This node returns a Spark DataFrame with increased or decreased partition count. This id useful to deal with performance issues in the following situations:
- An uneven distribution of rows over partitions, which causes "straggler" tasks that delay the completion of a stage. A straggler task is a task that takes much longer than other tasks of the same stage.
- A too low number of partitions, which prevents Spark from parallelizing computation.
- A too high number of partitions with very little data in them, which causes unnecessary overhead.
- Spark executors that crash or are very slow, because they run out of memory, due to partitions that contain too much data.
The following guidelines apply when repartitioning a DataFrame:
- Before performing computation on a DataFrame (e.g. preprocessing or learning a model), the partition count should be at least a low multiple of the number of available executor cores in the Spark cluster (see respective option in the "Settings" tab). This ensures that Spark can properly parallelize computation. For very large data sets also high multiples of the available executor cores make sense, in order to avoid memory problems on the Spark executors.
- Before writing a DataFrame to storage (HDFS, S3, ...) it is beneficial to aim for a partition count where partitions have a reasonable size e.g. 50M - 100M. This ensures fast writing and reading of the DataFrame.
- This node shuffles data which might be expensive. See the "Advanced" tab to avoid shuffling if possible.
- This node requires at least Apache Spark 2.0.