Web Text Scraper

Use this component to extract meaningful text from any web page. This component uses a Java based library called BoilerPipe (boilerpipe-web.appspot.com) to detect and remove boilerplate text from a web page and only extract the main textual content. The Java library uses a heuristic based approach, based on this research paper: l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf The component, before starting the analysis, automatically downloads inside your workflow the Java library from: storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/boilerpipe/boilerpipe-1.2.0-bin.tar.gz When processing a long list of URLs, the Java library might get stuck on a web address with a faulty (not a valid HTML body) or way too big web page. The component in those exceptional cases might fail or take too long, so if possible remove the faulty web addresses from the input table beforehand.

Component details

Input ports

Output ports

KNIME Base nodes

KNIME Javasnippet

KNIME Quick Forms

Legal

Web Text Scraper

Component details

Input ports

Output ports

Used extensions & nodes

KNIME Base nodesTrusted extension

KNIME JavasnippetTrusted extension

KNIME Quick FormsTrusted extension

Legal

KNIME Base nodes

KNIME Javasnippet

KNIME Quick Forms