Hub
Pricing About
  • Software
  • Blog
  • Forum
  • Events
  • Documentation
  • About KNIME
  • KNIME Community Hub
  • angusveitch
  • Spaces
  • Public
  • TextKleaner
WorkflowWorkflow

TextKleaner

Text analysis Text preprocessing Named entity tagging Ngrams Duplicate detection
+1
Angus Veitch profile image

Last edited: 

Drag & drop
Like
Download workflow
Copy short link
Workflow preview
This workflow is designed to help you prepare a textual dataset for a bag-of-words style computational analysis. It assumes that you already have your data in a tabular form - that is, a CSV or KNIME table containing a column of plain text documents along with metadata columns. The workflow performs four types of operations to prepare your text for analysis. First, it scrubs your text, removing or replacing various characters to ensure that the text is formatted cleanly and consistently. Second, the workflow provides various ways to find and exclude documents that are irrelevant to your study. Third, it helps you to find and remove duplicated text, both in the form of highly similar documents and 'boilerplating' that is repeated at the start of documents. Finally, it allows you to enrich your data by tagging names and ngrams, and to refine your data by filtering out terms that are rare or uninformative, and by standardising plurals and other word variants. While the loading and scrubbing of texts must be performed first, there is some flexibility around the remaining steps. The filtering operations in Step 2 are entirely optional, and can be performed in any order, although there are benefits to detecting duplicates before filtering documents by topic. Duplicate detection and boilerplate removal (Step 3) are also optional, but are highly recommended if you plan to tag ngrams in Step 4 or use topic modelling in your analysis. Duplicate detection should be performed after document filtering, but boilerplate removal can be performed at any stage before Step 4, and indeed may improve the results of the 'Filter by topic' operation. Tagging and filtering (Step 4) must be run last, as it will convert your documents from plain text strings into a tokenised format for subsequent analysis. Except for duplicate detection (which saves information in a separate table), each operation in Steps 2 and 3 will overwrite the input data with the filtered data. The excluded documents are saved in a separate file, and can be reviewed or restored at any stage.

External resources

  • TextKleaner - a Knime workflow for preparing large text datasets for analysis

Used extensions & nodes

Created with KNIME Analytics Platform version 4.3.0 Note: Not all extensions may be displayed.
  • Go to item
    KNIME Base nodes Trusted extension

    KNIME AG, Zurich, Switzerland

    Version 4.3.0

    knime
  • Go to item
    KNIME Data Generation Trusted extension

    KNIME AG, Zurich, Switzerland

    Version 4.3.0

    knime
  • Go to item
    KNIME Distance Matrix Trusted extension

    KNIME AG, Zurich, Switzerland

    Version 4.3.0

    knime
  • Go to item
    KNIME Expressions Trusted extension

    KNIME AG, Zurich, Switzerland

    Version 4.3.0

    knime
  • Go to item
    KNIME JavaScript Views Trusted extension

    KNIME AG, Zurich, Switzerland

    Version 4.3.0

    knime
  • Go to item
    KNIME Javasnippet Trusted extension

    KNIME AG, Zurich, Switzerland

    Version 4.3.0

    knime
  • Go to item
    KNIME Math Expression (JEP) Trusted extension

    KNIME AG, Zurich, Switzerland

    Version 4.3.0

    knime
  • Go to item
    KNIME Parallel Chunk Loop Nodes Trusted extension

    KNIME AG, Zurich, Switzerland

    Version 4.3.0

    knime
  • Go to item
    KNIME Quick Forms Trusted extension

    KNIME AG, Zurich, Switzerland

    Version 4.3.0

    knime
  • Go to item
    KNIME Textprocessing Trusted extension

    KNIME AG, Zurich, Switzerland

    Version 4.3.0

    knime
  • Go to item
    KNIME Timeseries nodes Trusted extension

    KNIME AG, Zurich, Switzerland

    Version 4.3.0

    knime
  • Go to item
    Vernalis KNIME Nodes Trusted extension

    Vernalis Research Ltd, Cambridge, UK

    Version 1.28.0

    vernalis
  1. Go to item
  2. Go to item
  3. Go to item
  4. Go to item
  5. Go to item
  6. Go to item
Loading deployments
Loading ad hoc executions

Legal

By using or downloading the workflow, you agree to our terms and conditions.

Discussion
Discussions are currently not available, please try again later.

KNIME
Open for Innovation

KNIME AG
Talacker 50
8001 Zurich, Switzerland
  • Software
  • Getting started
  • Documentation
  • E-Learning course
  • Solutions
  • KNIME Hub
  • KNIME Forum
  • Blog
  • Events
  • Partner
  • Developers
  • KNIME Home
  • KNIME Open Source Story
  • Careers
  • Contact us
Download KNIME Analytics Platform Read more on KNIME Business Hub
© 2023 KNIME AG. All rights reserved.
  • Trademarks
  • Imprint
  • Privacy
  • Terms & Conditions
  • Credits