In my R script, I have a SparkDataFrame
of two columns (time, value) containing data for four different months. Because of the fact that I need to apply my function to an each month separately, I figured I would repartition
it into four partitions where each of them would hold data for a separate month.
I created an additional column called partition, having an integer value 0 - 3 and after that called the repartition
method by this specific column.
Sadly as it's being described in this topic:
Spark SQL - Difference between df.repartition and DataFrameWriter partitionBy?, with the repartition
method we are only sure that all the data with the same key will end up in the same partition, however data with a different key can also end up in the same partition.
In my case, executing code visible below results in creating 4 partitions but populating only 2 of them with data.
I guess I should be using the partitionBy
method, however in case of SparkR I have no idea how to do that.
The official documentation states that this method is applied to something called WindowSpec
and not a DataFrame
.
I would really appreciate some help with this matter as I have no idea how to incorporate this method into my code.
sparkR.session(
master="local[*]", sparkConfig = list(spark.sql.shuffle.partitions="4"))
df <- as.DataFrame(inputDat) # this is a dataframe with added partition column
repartitionedDf <- repartition(df, col = df$partition)
schema <- structType(
structField("time", "timestamp"),
structField("value", "double"),
structField("partition", "string"))
processedDf <- dapply(repartitionedDf,
function(x) { data.frame(produceHourlyResults(x), stringsAsFactors = FALSE) },
schema)
spark.sql.warehouse.dir
to desired value. – Virgenvirgie