Hub
Pricing About
ComponentComponent

Web Text Scraper

knime profile image
Versionv1.0Latest, created on 
Oct 20, 2023 1:30 PM
Drag & drop
Like
Use or download
Use this component to extract meaningful text from any web page. This component uses a Java based library called BoilerPipe (boilerpipe-web.appspot.com) to detect and remove boilerplate text from a web page and only extract the main textual content. The Java library uses a heuristic based approach, based on this research paper: l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf The component, before starting the analysis, automatically downloads inside your workflow the Java library from: storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/boilerpipe/boilerpipe-1.2.0-bin.tar.gz When processing a long list of URLs, the Java library might get stuck on a web address with a faulty (not a valid HTML body) or way too big web page. The component in those exceptional cases might fail or take too long, so if possible remove the faulty web addresses from the input table beforehand.

Component details

Input ports
  1. Type: Table
    Web Addresses
    Web Addresses as input.
Output ports
  1. Type: Table
    Table
    Outputs an additional column with extracted texts.

Used extensions & nodes

Created with KNIME Analytics Platform version 5.1.0
  • Go to item
    KNIME Base nodesTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 5.1.0

    knime profile image
    knime
  • Go to item
    KNIME JavasnippetTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 5.1.0

    knime profile image
    knime
  • Go to item
    KNIME Quick FormsTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 5.1.0

    knime profile image
    knime

This component does not have nodes, extensions, nested components and related workflows

Legal

By using or downloading the component, you agree to our terms and conditions.

KNIME
Open for Innovation

KNIME AG
Talacker 50
8001 Zurich, Switzerland
  • Software
  • Getting started
  • Documentation
  • Courses + Certification
  • Solutions
  • KNIME Hub
  • KNIME Forum
  • Blog
  • Events
  • Partner
  • Developers
  • KNIME Home
  • Careers
  • Contact us
Download KNIME Analytics Platform Read more about KNIME Business Hub
© 2025 KNIME AG. All rights reserved.
  • Trademarks
  • Imprint
  • Privacy
  • Terms & Conditions
  • Data Processing Agreement
  • Credits