Hub
Pricing About
WorkflowWorkflow

Challenge 37 - Deduplicate Text - Solution

Text processingTika parserOcrJustknimeitJustknimeit-37
+3
knime profile image
Draft Latest edits on 
Mar 5, 2024 12:28 AM
Drag & drop
Like
Download workflow
Workflow preview
Challenge 37 - Text Deduplication - Solution You are asked to read Swedish textual data from a PDF using the Tika Parser. You then notice that much of the text is duplicated, which could be an encoding issue with the PDF itself. Consequently, you decide to to deduplicate the text. In this challenge, do your best to remove excessive amounts of duplicated text using as few nodes as possible. In most cases like this, you are not aiming for perfect removal of text, but instead are aiming for a cost effective approach which eliminates a large chunk of the duplication. Hint: Our solution consists of 5 nodes, but the 5th node may be unnecessary depending on your workflow.
Loading deploymentsLoading ad hoc jobs

Used extensions & nodes

Created with KNIME Analytics Platform version 5.2.1
  • Go to item
    KNIME Base nodesTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 5.1.0

    knime profile image
    knime
  • Go to item
    KNIME TextprocessingTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 4.5.0

    knime profile image
    knime

Legal

By using or downloading the workflow, you agree to our terms and conditions.

KNIME
Open for Innovation

KNIME AG
Talacker 50
8001 Zurich, Switzerland
  • Software
  • Getting started
  • Documentation
  • Courses + Certification
  • Solutions
  • KNIME Hub
  • KNIME Forum
  • Blog
  • Events
  • Partner
  • Developers
  • KNIME Home
  • Careers
  • Contact us
Download KNIME Analytics Platform Read more about KNIME Business Hub
© 2025 KNIME AG. All rights reserved.
  • Trademarks
  • Imprint
  • Privacy
  • Terms & Conditions
  • Data Processing Agreement
  • Credits