- Type: TableWeb AddressesWeb Addresses as input.
Use this component to extract meaningful text from any web page. This component uses a Java based library called BoilerPipe (boilerpipe-web.appspot.com) to detect and remove boilerplate text from a web page and only extract the main textual content. The Java library uses a heuristic based approach, based on this research paper: l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf The component, before starting the analysis, automatically downloads inside your workflow the Java library from: storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/boilerpipe/boilerpipe-1.2.0-bin.tar.gz When processing a long list of URLs, the Java library might get stuck on a web address with a faulty (not a valid HTML body) or way too big web page. The component in those exceptional cases might fail or take too long, so if possible remove the faulty web addresses from the input table beforehand.
- Type: TableTableOutputs an additional column with extracted texts.
Used extensions & nodes
Created with KNIME Analytics Platform version 4.4.0
By using or downloading the component, you agree to our terms and conditions.