Dealing with duplicates is a constant theme with data scientist. And a lot of things can go wrong. The easienst ways to deal with them is GROUP BY or DISTINCT. Just get rid of them and be done. But as this examples might demonstrate this might not always be the best option. Even if your data provider swears your combined IDs are unique especially in Big Data scenarios there might still be lurking some muddy duplicates and you shoudl still be able to deal with them.
Workflow
School of duplicates - and how to deal with them - H2 version
External resources
- long forum debate about duplicates
- School of duplicates - and how to deal with them (corresponding article)
- Window functions with new DB drivers
- A meta collection of KNIME and databases (SQL, Big Data/Hive/Impala and Spark/PySpark)
- Example how to use H2 database to create table with upload and from scratch
Used extensions & nodes
Created with KNIME Analytics Platform version 4.7.0
Legal
By using or downloading the workflow, you agree to our terms and conditions.