This node provides different content extraction algorithms, which allow extracting textual content from web pages, discarding irrelevant elements like navigation, ads, footers, headers, etc. The extractors are generally optimized for typical news article and blog post web pages. Currently, three algorithms are provided:
The Palladian content extractor extracts clean sentences from (English) texts. That is, short phrases are not included in the output. Consider Readability for general content. The main difference is that this class also finds sentences in comment sections of web pages.
- Type: Data Input with (X)HTML documents parsed as DOM/XML.
- Type: Data Text documents with extracted content.
Community Nodes > Palladian
Make sure to have this extension installed: