Hub
  • Software
  • Blog
  • Forum
  • Events
  • Documentation
  • About KNIME
  • KNIME Hub
  • knime
  • Spaces
  • Examples
  • 00_Components
  • Text Processing
  • Web Text Scraper
ComponentComponent

Web Text Scraper

Last edited: 

Drag & drop
Like
Use or download
Copy short link
Use this component to extract meaningful text from any web page. This component uses a Java based library called BoilerPipe (boilerpipe-web.appspot.com) to detect and remove boilerplate text from a web page and only extract the main textual content. The Java library uses a heuristic based approach, based on this research paper: l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf The component, before starting the analysis, automatically downloads inside your workflow the Java library from: storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/boilerpipe/boilerpipe-1.2.0-bin.tar.gz When processing a long list of URLs, the Java library might get stuck on a web address with a faulty (not a valid HTML body) or way too big web page. The component in those exceptional cases might fail or take too long, so if possible remove the faulty web addresses from the input table beforehand.

Component details

Input ports
  1. Type: Table
    Web Addresses
    Web Addresses as input.
Output ports
  1. Type: Table
    Table
    Outputs an additional column with extracted texts.

Used extensions & nodes

Created with KNIME Analytics Platform version 4.4.0
  • Go to item
    KNIME Base nodes Trusted extension

    KNIME AG, Zurich, Switzerland

    Version 4.4.0

  • Go to item
    KNIME Javasnippet Trusted extension

    KNIME AG, Zurich, Switzerland

    Version 4.4.0

  • Go to item
    KNIME Quick Forms Trusted extension

    KNIME AG, Zurich, Switzerland

    Version 4.4.0

  1. Go to item
  2. Go to item
  3. Go to item
  4. Go to item
  5. Go to item
  6. Go to item

Legal

By using or downloading the component, you agree to our terms and conditions.

KNIME
Open for Innovation

KNIME AG
Hardturmstrasse 66
8005 Zurich, Switzerland
  • Software
  • Getting started
  • Documentation
  • E-Learning course
  • Solutions
  • KNIME Hub
  • KNIME Forum
  • Blog
  • Events
  • Partner
  • Developers
  • KNIME Home
  • KNIME Open Source Story
  • Careers
  • Contact us
Download KNIME Analytics Platform Read more on KNIME Server
© 2022 KNIME AG. All rights reserved.
  • Trademarks
  • Imprint
  • Privacy
  • Terms & Conditions
  • Credits