A meta collection of KNIME and databases (SQL, Big Data/Hive/Impala and Spark/PySpark)
KNIME as a powerful data anaylytics platform is able to connecct to all sorts of data bases and supports a lot of functions mainly by SQL and SQL-like languages. KNIME also supports big data platforms and related frameworks - like Spark and PySpark. There are myriads of database related topics and books and online ressources but ...
What I try to do is present a collection of useful links and examples when you want to set out to use databases and KNIME abd hopefully by providing some sampel workflows that one could try out at home ease some quirks especially when it comes to dealing with big data systems and Spark (including PySpark). The 'normal' SQL examples will be with SQLite and H2 so you can easily try this at home (but they will work for other databases like Postgres or MySQL as well). The big data examples are based on KNIME's local big data environment - which is very useful if you happen not to have such a cluster at home.
A good starting point in general is the official "KNIME Database Extension Guide"
https://docs.knime.com/latest/db_extension_guide/index.html?u=mlauber71
=> you can dive right into the examples and will see quick results, but sometimes it helps to read about the concept.
-------------------------------
A - About SQL in general - a few links to refersh and learn
Structured Query Language Basics
https://www.nuwavesolutions.com/structured-query-language-basics/
SQL CASE WHEN - and other code in the sqltutorial
https://www.sqltutorial.org/sql-case/
KNIME courses Lesson 4. Bringing Things Together - Join and Concatenation
https://www.knime.com/self-paced-course/l1-dw-knime-analytics-platform-for-data-wranglers-basics/lesson4#join-concatenation?u=mlauber71
KNIME Nodeguide for Joining and Concatenating
https://www.knime.com/nodeguide/etl-data-manipulation/joining-and-concatenating?u=mlauber71
-------------------------------
B - KNIME and SQL databases - going from basic to advanced
Database - Simple IO (using standalone SQL-databse SQLite)
https://hub.knime.com/knime/spaces/Examples/latest/01_Data_Access/02_Databases/01_Database_Simple_IO_Example?u=mlauber71
KNIME and SQLite - simple example (with the New DB nodes)
https://hub.knime.com/mlauber71/spaces/Public/latest/sqlite_knime_40/t_001_sqlite_knime_40_db_nodes?u=mlauber71
H2 - use a SQL standalone Database from Scratch or from upload
https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_db_h2_create_table_from_scratch?u=mlauber71
H2 - Example of H2 database - handling of database structure and use of Primary Keys - insert only new lines by ID
https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_db_h2_primary_key?u=mlauber71
Databases - Advanced Usage (using SQLite)
https://hub.knime.com/knime/spaces/Examples/latest/01_Data_Access/02_Databases/02_Database_Advanced_Example?u=mlauber71
use H2 to produce a Position / Rank number within a group variable (window functions with new H2 JDBC drivers)
https://forum.knime.com/t/sqlite-and-window-functions/31608/4?u=mlauber71
https://hub.knime.com/mlauber71/spaces/Public/latest/forum/kn_forum_31608_h2_group_rank_window_function?u=mlauber71
-------------------------------
C - KNIME and Big Data
KNIME can also handle Big Data systems and databases. So you can go from handling your local or small SQL-DB right up to big enterprise systems with (eg.) Cloudera.
KNIME Big Data Extensions User Guide
https://docs.knime.com/latest/bigdata_extensions_user_guide/index.html?u=mlauber71
School of Hive – everything you need to know to work with Hive tables on a Big Data system
https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_hive_school_of?u=mlauber71
=> if you want to understand the concepts of Big Data tables and partitions by doing it in code steps from scratch
An overview of KNIME based functions to access big data systems (with KNIME's local big data environment)
https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_db_bigdata_nodes/m_020_db_access_local_bigdata_tables?u=mlauber71
=> see for yourself how the big data nodes work on your local computer
KNIME and Hive - load multiple CSV files at once via external table
https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_bigdata_hive_csv_loader/m_001_import_hive_csv?u=mlauber71
-- C+ Bonus Track (some more advanced big data functions with Hive like external tables demonstrated and partitions half-open code)
work with Hive and external tables in CSV and Parquet
KNIME and Hive - load multiple CSV files at once via external table
https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_bigdata_hive_csv_loader/m_001_import_hive_csv?u=mlauber71
KNIME and Hive - load multiple Parquet files at once via external table
https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_bigdata_hive_parquet_loader/m_001_import_hive_parquet?u=mlauber71
Some more functions with Hive like adding fields to Hive table
https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_bigdata_hive_add_column_db_40?u=mlauber71
If you experience slow connection when working with complex (Impala) queries you could check the "Retrieve in configure" settings
https://forum.knime.com/t/db-nodes-in-a-more-complex-flow-executing-very-slow/23649/2?u=mlauber71
https://docs.knime.com/latest/db_extension_guide/index.html#advanced_tab
Hive - how to handle missing tables in Hive? - us a try-if combination (THX to H. Stölting for the inspiration)
https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_hive_missing_table_switch?u=mlauber71
D -------------------------------
KNIME and Spark
Being Lazy is Useful — Lazy Evaluation in Spark
https://medium.com/analytics-vidhya/being-lazy-is-useful-lazy-evaluation-in-spark-1f04072a3648
=> you can do without and jump right into it, but it makes sense to understand this basic concept of Spark
Comparison of Hive and Spark SQL - a gentle introductory example
https://hub.knime.com/knime/spaces/Examples/latest/10_Big_Data/02_Spark_Executor/07_SparkSQL_meets_HiveQL?u=mlauber71
An overview of KNIME based functions to access big data systems - use it on your own big data system (including PySpark)
https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_db_bigdata_nodes/m_120_db_access_bigdata_tables?u=mlauber71
=> use the DB, Spark and PySpark nodes on your big data system
-- going further with Spark
Overview of Examples using Spark (and ML) with KNIME
https://hub.knime.com/knime/spaces/Education/latest/Courses/L4-BD%20Introduction%20to%20Big%20Data%20with%20KNIME%20Analytics%20Platform/3_Spark/4_Examples/?u=mlauber71
Local Big Data Irish Meter
https://hub.knime.com/knime/spaces/Examples/latest/10_Big_Data/02_Spark_Executor/09_Big_Data_Irish_Meter_on_Spark_only?u=mlauber71
Cleaning the NYC taxi dataset on Spark
https://hub.knime.com/knime/spaces/Examples/latest/50_Applications/49_NYC_Taxi_Visualization/Data_Preparation?u=mlauber71
E -------------------------------
More database and SQL related stuff
Tobias Kötter's KNIME hub space with a lot of DB related workflows
https://hub.knime.com/tobias.koetter/spaces/Public/latest/DB/?u=mlauber71
If you want to transfer data between data bases without down- and uploading it all
https://hub.knime.com/tobias.koetter/spaces/Public/latest/DB/DBStreamingDataTransfer?u=mlauber71
Microsoft Access is still a thing - and you can access it with KNIME
https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_db_update_merge_ms_access?u=mlauber71
The full SQL can be found in this free eBook (not all SQL commands might work on all databases)
https://goalkicker.com/SQLBook/
External resources
- B - Example of H2 database - handling of database structure and use of Primary Keys - insert only new lines by ID
- [meta] How to learn KNIME online - fast & link to KNIME capabilities
- B - use H2 to produce a Position / Rank number within a group variable (window functions with new H2 JDBC drivers)
- C+ If you experience slow connection when working with complex (Impala) queries you could check the "Retrieve in configure" settings
- E - The full SQL can be found in this free eBook (not all SQL commands might work on all databases)
- E - Microsoft Access is still a thing - and you can access it with KNIME
- E - If you want to transfer data between data bases without down- and uploading it all
- E - Tobias Kötter's KNIME hub space with a lot of DB related workflows
- D - Cleaning the NYC taxi dataset on Spark
- D - Local Big Data Irish Meter
- D - Overview of Examples using Spark (and ML) with KNIME
- D - An overview of KNIME based functions to access big data systems - use it on your own big data system (including PySpark)
- D - Comparison of Hive and Spark SQL - a gentle introductory example
- D - Being Lazy is Useful — Lazy Evaluation in Spark
- C+ - Hive - how to handle missing tables in Hive? - us a try-if combination (THX to H. Stölting for the inspiration)
- C+ - Some more functions with Hive like adding fields to Hive table
- C+ - KNIME and Hive - load multiple Parquet files at once via external table
- C+ - KNIME and Hive - load multiple CSV files at once via external table
- C - KNIME and Hive - load multiple CSV files at once via external table
- C - An overview of KNIME based functions to access big data systems (with KNIME's local big data environment)
- C - School of Hive – everything you need to know to work with Hive tables on a Big Data system
- C - KNIME Big Data Extensions User Guide
- B - Databases - Advanced Usage (using SQLite)
- B - H2 - use a SQL standalone Database from Scratch or from upload
- B - KNIME and SQLite - simple example (with the New DB nodes)
- B - Database - Simple IO (using standalone SQL-databse SQLite)
- A - KNIME Nodeguide for Joining and Concatenating
- A - KNIME courses Lesson 4. Bringing Things Together - Join and Concatenation
- A - SQL CASE WHEN - and other code in the sqltutorial
- A - Structured Query Language Basics
- A good starting point in general is the official "KNIME Database Extension Guide"
Used extensions & nodes
All required extensions are part of the default installation of KNIME Analytics Platform version 4.3.2
Legal
By using or downloading the workflow, you agree to our terms and conditions.