Use this component to extract meaningful text from any web page.
This component uses a Java based library called BoilerPipe (boilerpipe-web.appspot.com) to detect and remove boilerplate text from a web page and only extract the main textual content. The Java library uses a heuristic based approach, based on this research paper:
l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf
The component, before starting the analysis, automatically downloads inside your workflow the Java library from:
storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/boilerpipe/boilerpipe-1.2.0-bin.tar.gz
When processing a long list of URLs, the Java library might get stuck on a web address with a faulty (not a valid HTML body) or way too big web page. The component in those exceptional cases might fail or take too long, so if possible remove the faulty web addresses from the input table beforehand.
- Type: TableWeb AddressesWeb Addresses as input.