ContentExtractor

Manipulator

This node provides different content extraction algorithms, which allow extracting textual content from web pages, discarding irrelevant elements like navigation, ads, footers, headers, etc. The extractors are generally optimized for typical news article and blog post web pages. Currently, three algorithms are provided:

Readability

A port of the JavaScript browser bookmarklet "Readability" by Arc90 -- a great tool for extracting content from HTML pages. "Readability [...] takes a crack at wiping out all that junk so you can have a more enjoyable reading experience. [...] its success rate is pretty respectable (we'd guess over 90% of web sites are handled properly)". Readability operates on the document's DOM tree. Basically, it assigns all elements a score for their contents. Metrics for the scoring are length of their text content, number of commas and link density. Also, class and id names are taken into consideration; for example, elements with class name sidebar contain unlikely actual content in contrast to elements with class article. Website, JavaScript Source.

Palladian

The Palladian content extractor extracts clean sentences from (English) texts. That is, short phrases are not included in the output. Consider Readability for general content. The main difference is that this class also finds sentences in comment sections of web pages.

Input Ports

  1. Type: Data Input with (X)HTML documents parsed as DOM/XML.

Output Ports

  1. Type: Data Text documents with extracted content.

Find here

Community Nodes > Palladian

Make sure to have this extension installed: