This workflow demonstrates KNIME's capability to connect with Databricks Unity Volumes, part of the Unity Catalog framework. It enables users to read and write files from and to Databricks Unity Volumes.
The use case presented here involves writing once-a-month Excel files containing daily weather information for different locations as Parquet files into the Databricks Unity Volume. Then, the data is read, and a simple linear regression model is applied to Spark.
For more information about Databricks Unity Catalog and Databricks Unity Volumes, please refer to the "External resources" links.
You can download the workflow and run it on your local machine using the latest version of the KNIME Analytics Platform. For optimal performance, it is recommended that you use the latest version of KNIME AP.
Workflow Requirements
To run the workflow locally, you will need:
A Databricks account
An existing Databricks cluster
Workflow Details
Connecting to Databricks Unity Volume
First, we connect to the Databricks Unity Volume, where we want to read and write files via the Databricks Unity File System Connector.
Writing Data to Unity Volume
The use case involves taking thirty generated Excel files with synthetic weather information from 1000 locations and writing them into the Databricks Unity Volume as Parquet files.
Creating a Spark Context
We create a Spark context using the Create Databricks Environment node and read the previously generated Parquet files with the Parquet to Spark node, creating a DataFrame in Spark.
Data Manipulation and Modeling
We manipulate the data in the Spark context using the KNIME Extension for Apache Spark nodes. This operation includes filtering missing values, splitting and normalizing the data frame, and applying a linear regression model.
Model Evaluation
Finally, we use the Spark Numeric Score node to visualize the linear regression performance and capacity to predict rainfall based on the selected features and shut down the Spark context.