The Cache node materializes and caches the input table in a data processing workflow. This node is useful after a sequence of preprocessing steps, especially when these steps involve column transformations, such as removing, manipulating, or adding new columns.
In workflows involving multiple transformation nodes, only the modified data (e.g., added columns) is stored, while the unmodified columns reference the input data. Although this approach optimizes the execution and data caching for individual nodes, it can result in tables that are composites of multiple nested tables. Consequently, iterating over such composite tables may be less efficient compared to iterating over a single, unified table.
The Cache node addresses this by materializing the input data, creating a self-contained table that consolidates all columns. Additionally, the Cache node is useful in scenarios where portions of a workflow are executed in streaming mode, as it allows data to be staged at specific points. This staging facilitates inspection and debugging, providing a snapshot of the data at the desired point in the workflow.