How to keep partition columns when reading in ORC files in Spark

Asked 12/9, 2018 at 20:23 Answered 10/10, 2019 at 2:46

When reading in an ORC file in Spark, if you specify the partition column in the path, that column will not be included in the dataset. For example, if we have

val dfWithColumn = spark.read.orc("/some/path") 

val dfWithoutColumn = spark.read.orc("/some/path/region_partition=1")

then dfWithColumn will have a region_partition column, but dfWithoutColumn will not. How can I specify that I want to include all columns, even if they're partitioned?

I am using spark 2.2 on scala.

EDIT: This is a re-usable Spark program that will take in its arguments from the command line; I want the program to work even if the user passes in a specific partition of a table instead of the whole table. So, using Dataset.filter is not an option.

Recourse answered 12/9, 2018 at 20:23 Comment(2)

If the intention of the second line was to only get data of that partition, Why did you not filter the DF with the column information? Since DFs are lazily evaluated, the predicate would get pushed down and there is no overhead of reading the whole file – Semibreve 13/9, 2018 at 7:36

I interpreted question differently to tgecanswers – Rhee 13/9, 2018 at 11:16

It is same as parquet.

Reference: https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery

df = spark.read.option("basePath", "file://foo/bar/")
         .orc("file://foo/bar/partition_column=XXX")

df has a 'partition_column' column.

Pennoncel answered 10/10, 2019 at 2:46 Comment(0)

Instead of adding your partitioned columns in the path, add them as filters. Modify your code to -

val dfWithColumn = spark.read.orc("/some/path/").where($"region_partition" === 1)

This will identify the schema properly and will read data only for "region_partition=1" directory.

Triceratops answered 13/9, 2018 at 7:31 Comment(1)

See my edit; I don't want to only read in a certain partition, I want my program to work even if the user passes in a certain partition instead of the whole table. – Recourse 13/9, 2018 at 17:11

If the aim is to load one partition not the whole data, then you can get benefit of the lazy loading of spark and do the following:

val dfWithColumn = spark.read.orc("/some/path") 
dfWithColumn= dfWithColumn.where($"region_partition" === 1)

By doing this you will be getting data from the folder:

"/some/path/region_partition=1"

The win of this is that you keep the original structure with having the partition column inside your dataset.

But if you aim to manipulate the read dataset to add some column with some value I suggest to use the method:

withColumn

Monodrama answered 13/9, 2018 at 7:49 Comment(1)

See my edit; I don't want to only read in a certain partition, I want my program to work even if the user passes in a certain partition instead of the whole table. – Recourse 13/9, 2018 at 17:11

Recommended topics

Hot tags