Hub
Pricing About
WorkflowWorkflow

School of duplicates - and how to deal with them

SqlDuplicatesRemoveRow_id
mlauber71 profile image
Draft Latest edits on 
May 27, 2020 1:49 PM
Drag & drop
Like
Download workflow
Workflow preview
Dealing with duplicates is a constant theme with data scientist. And a lot of things can go wrong. The easienst ways to deal with them is GROUP BY or DISTINCT. Just get rid of them and be done. But as this examples might demonstrate this might not always be the best option. Even if your data provider swears your combined IDs are unique especially in Big Data scenarios there might still be lurking some muddy duplicates and you shoudl still be able to deal with them. And you should be able to bring a messy dataset into a meaningful table with a nice unique ID without loosing too much information. And this workflow would like to encourage you to think about what to do with your duplicates and not to get caught off guard but to take control :-)

External resources

  • use H2 to produce a Position / Rank number within a group variable (window functions)
  • School of duplicates - and how to deal with them (corresponding article)
  • School of Hive - with KNIME's local Big Data environment (SQL for Big Data)
  • long forum debate about duplicates
Loading deploymentsLoading ad hoc jobs

Used extensions & nodes

Created with KNIME Analytics Platform version 4.3.2
  • Go to item
    KNIME Base nodesTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 4.3.2

    knime
  • Go to item
    KNIME DatabaseTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 4.3.2

    knime
  • Go to item
    KNIME Excel SupportTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 4.3.2

    knime
  • Go to item
    KNIME Extension for Local Big Data EnvironmentsTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 4.3.1

    knime
  • Go to item
    KNIME JavasnippetTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 4.3.0

    knime
  • Go to item
    Vernalis KNIME NodesTrusted extension

    Vernalis Research Ltd, Cambridge, UK

    Version 1.30.1

    vernalis

Legal

By using or downloading the workflow, you agree to our terms and conditions.

KNIME
Open for Innovation

KNIME AG
Talacker 50
8001 Zurich, Switzerland
  • Software
  • Getting started
  • Documentation
  • Courses + Certification
  • Solutions
  • KNIME Hub
  • KNIME Forum
  • Blog
  • Events
  • Partner
  • Developers
  • KNIME Home
  • Careers
  • Contact us
Download KNIME Analytics Platform Read more about KNIME Business Hub
© 2025 KNIME AG. All rights reserved.
  • Trademarks
  • Imprint
  • Privacy
  • Terms & Conditions
  • Data Processing Agreement
  • Credits