Spark dataframe saveAsTable vs save

Asked 22/7, 2016 at 16:13 Answered 3/1, 2023 at 10:35

I am using spark 1.6.1 and I am trying to save a dataframe to an orc format.

The problem I am facing is that the save method is very slow, and it takes about 6 minutes for 50M orc file on each executor. This is how I am saving the dataframe

dt.write.format("orc").mode("append").partitionBy("dt").save(path)

I tried using saveAsTable to an hive table which is also using orc formats, and that seems to be faster about 20% to 50% faster, but this method has its own problems - it seems that when a task fails, retries will always fail due to file already exist. This is how I am saving the dataframe

dt.write.format("orc").mode("append").partitionBy("dt").saveAsTable(tableName)

Is there a reason save method is so slow? Am I doing something wrong?

Dulcle answered 22/7, 2016 at 16:13 Comment(3)

6 minutes is not that slow for writing 50M files. Sounds like a lot of files! How big is each one? How many executors? If it's one file per row then that's way too many. If they are something appropriate for your storage system , and number of nodes/executors used in a typical query then maybe 50M is fine, but I doubt it. If each of those 50M files is 1G then that's ~47 petabytes so I doubt that. If each is 1MB then it's 47 Terabytes, and I'd suggest the file size is too small to efficiently query the table. What's the total data volume? – Stelmach 2/5, 2017 at 3:12

it is actually 50 mega file. – Dulcle 14/6, 2017 at 15:9

like,it's just one 50MB file? If it's just one small file then not much point partitioning it. It's possible that your dt field is way too much cardinality and it ends up creating partitions for each row. E.g. if it's a timestamp/datetime like "2017-01-01 14:52:22" then the parititoning will happen for each second, which would then write an orc file for each partition. 50MB might be a small file but it could be a lot of rows with different timestamps. e.g. if each row is ~8K, then that's ~6400 rows, which is a lot of file I/O. – Stelmach 15/6, 2017 at 7:35

The problem is due to partitionBy method. PartitionBy reads the values of column specified and then segregates the data for every value of the partition column. Try to save it without partition by, there would be significant performance difference.

Unpleasant answered 22/7, 2016 at 17:26 Comment(2)

I need to partition the data, so this is not an option. – Dulcle 22/7, 2016 at 18:21

I think this is a valid point. What is dt? Is that a suitable column for partitioning? If the cardinality is very high then it's probably not appropriate. For example if you are using a value that is different for every row of the dataframe then that's going to make too many partitions. The overhead of all that file I/O is not worth it. – Stelmach 2/5, 2017 at 3:6

See my previous comments above regarding cardinality and partitionBy.

If you really want to partition it, and it's just one 50MB file, then use something like

dt.write.format("orc").mode("append").repartition(4).saveAsTable(tableName)

repartition will create 4 roughly even partitions, rather than what you are doing to partition on a dt column which could end up writing a lot of orc files.

The choice of 4 partitions is a bit arbitrary. You're not going to get much performance/parallelizing benefit from partitioning tiny files like that. The overhead of reading more files is not worth it.

Stelmach answered 15/6, 2017 at 7:45 Comment(0)

Use save() to save at particular location may be at some blob location.

Use saveAsTable() to save dataframe as spark SQL tables

Annabel answered 3/1, 2023 at 10:35 Comment(0)

Recommended topics

Hot tags