spark.sql.files.maxPartitionBytes not limiting max size of written partitions
Asked Answered
E

2

5

I'm trying to copy parquet data from another s3 bucket to my s3 bucket. I want to limit the size of each partition to a max of 128 MB. I thought by default spark.sql.files.maxPartitionBytes would have been set to 128 MB, but when I look at the partition files in s3 after my copy I see individual partition files around 226 MB instead. I was looking at this post which suggested that I set this spark config key in order to limit the max size of my partitions: Limiting maximum size of dataframe partition but it doesn't seem to work?

This is the definition of that config key:

The maximum number of bytes to pack into a single partition when reading files. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC.

I'm also a bit confused how this relates to size of the written parquet files.

For reference, I am running a glue script on glue version 1.0, spark 2.4 and the script is this:

val conf: SparkConf = new SparkConf()
conf.set("spark.sql.catalogImplementation", "hive")
    .set("spark.hadoop.hive.metastore.glue.catalogid", catalogId)
val spark: SparkContext = new SparkContext(sparkConf)

val glueContext: GlueContext = new GlueContext(spark)
val sparkSession = glueContext.getSparkSession

val sqlDF = sparkSession.sql("SELECT * FROM db.table where id='item1'")
sqlDF.write.mode(SaveMode.Overwrite).parquet("s3://my-s3-location/")
Elson answered 30/6, 2020 at 0:36 Comment(0)
P
12

The setting spark.sql.files.maxPartitionBytes has indeed impact on the max size of the partitions when reading the data on the Spark cluster. If your final files after the output are too large, then I suggest decreasing the value of this setting and it should create more files because the input data will be distributed among more partitions. This will however not be true if you have any shuffle in your query because then it will be always repartitioned into the number of partitions given by spark.sql.shuffle.partitions setting.

Also, the final size of your files will depend on the file format and compression that you will use. So if you output the data into for example parquet, the files will be much smaller than outputing to csv or json.

Purine answered 30/6, 2020 at 4:43 Comment(1)
Wait but why are the parquet file sizes nearly double (230 MB) the default maxPartitionBytes value - 128 MB? When you say the final size of my files will depend on the file format and compression that I'm using - so I'm reading/writing parquet - but my objects in each column are large nested structs - is that what you mean by file format affecting the final size? (sorry I am new to spark)Elson
Y
4

You could use "spark.sql.files.maxRecordsPerFile" to limit the max number of records that could be written in one parquet file and thus control the max size of the files

Ypsilanti answered 22/2, 2022 at 5:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.