Hub
Pricing About
NodeNode / Other

Spark Repartition

Tools & ServicesApache SparkMisc
Drag & drop
Like

This node returns a Spark DataFrame with increased or decreased partition count. This id useful to deal with performance issues in the following situations:

  • An uneven distribution of rows over partitions, which causes "straggler" tasks that delay the completion of a stage. A straggler task is a task that takes much longer than other tasks of the same stage.
  • A too low number of partitions, which prevents Spark from parallelizing computation.
  • A too high number of partitions with very little data in them, which causes unnecessary overhead.
  • Spark executors that crash or are very slow, because they run out of memory, due to partitions that contain too much data.

The following guidelines apply when repartitioning a DataFrame:

  • Before performing computation on a DataFrame (e.g. preprocessing or learning a model), the partition count should be at least a low multiple of the number of available executor cores in the Spark cluster (see respective option in the "Settings" tab). This ensures that Spark can properly parallelize computation. For very large data sets also high multiples of the available executor cores make sense, in order to avoid memory problems on the Spark executors.
  • Before writing a DataFrame to storage (HDFS, S3, ...) it is beneficial to aim for a partition count where partitions have a reasonable size e.g. 50M - 100M. This ensures fast writing and reading of the DataFrame.
Notes:
  • This node shuffles data which might be expensive. See the "Advanced" tab to avoid shuffling if possible.
  • This node requires at least Apache Spark 2.0.

Node details

Input ports
  1. Type: Spark Data
    Spark data table to repartition
    Spark DataFrame to repartition.
Output ports
  1. Type: Spark Data
    Repartitioned Spark data table
    Repartitioned Spark DataFrame.

Extension

The Spark Repartition node is part of this extension:

  1. Go to item

Related workflows & nodes

  1. Go to item
  2. Go to item
  3. Go to item

KNIME
Open for Innovation

KNIME AG
Talacker 50
8001 Zurich, Switzerland
  • Software
  • Getting started
  • Documentation
  • Courses + Certification
  • Solutions
  • KNIME Hub
  • KNIME Forum
  • Blog
  • Events
  • Partner
  • Developers
  • KNIME Home
  • Careers
  • Contact us
Download KNIME Analytics Platform Read more about KNIME Business Hub
© 2025 KNIME AG. All rights reserved.
  • Trademarks
  • Imprint
  • Privacy
  • Terms & Conditions
  • Data Processing Agreement
  • Credits