How can a DataFrame be partitioned based on the count of the number of items in a column. Suppose we have a DataFrame with 100 people (columns are first_name
and country
) and we'd like to create a partition for every 10 people in a country.
If our dataset contains 80 people from China, 15 people from France, and 5 people from Cuba, then we'll want 8 partitions for China, 2 partitions for France, and 1 partition for Cuba.
Here is code that will not work:
df.repartition($"country")
: This will create 1 partition for China, one partition for France, and one partition for Cubadf.repartition(8, $"country", rand)
: This will create up to 8 partitions for each country, so it should create 8 partitions for China, but the France & Cuba partitions are unknown. France could be in 8 partitions and Cuba could be in up to 5 partitions. See this answer for more details.
Here's the repartition()
documentation:
When I look at the repartition()
method, I don't even see a method that takes three arguments, so looks like some of this behavior isn't documented.
Is there any way to dynamically set the number of partitions for each column? It would make creating partitioned data sets way easier.
$"country", rand
go together aspartitionExprs
in second invokation – Cousteau