What is the best practice to load a delta table specific partition in databricks?

About

Asked 12/7, 2021 at 8:37 Answered 16/9, 2022 at 13:9

Solved apache-spark pyspark partitioning azure-databricks delta-lake

I would like to know what is the best way to load a delta table specific partition ? Is option 2 loading the all table before filtering ?

option 1 :

df = spark.read.format("delta").option('basePath','/mnt/raw/mytable/')\
   .load('/mnt/raw/mytable/ingestdate=20210703')

(Is the basePath option needed here ?)

option 2 :

df = spark.read.format("delta").load('/mnt/raw/mytable/')
df = df.filter(col('ingestdate')=='20210703')

Many thanks in advance !

Communication answered 12/7, 2021 at 8:37 Comment(0)

In the second option, spark loads only the relevant partitions that has been mentioned on the filter condition, internally spark does partition pruning and load only the relevant data from source table.

Whereas in the first option, you are directly instructing spark to load only the respective partitions as defined.

So in both the cases, you will end up loading only the respective partitions data.

Lanai answered 13/7, 2021 at 5:47 Comment(0)

If your table is partitioned and you want to read just one partition, you can do it using where

val partition = "year = '2019'"


val df = spark.read
 .format("delta")
 .load(path)
 .where(partition)

Nutgall answered 16/9, 2022 at 13:9 Comment(1)

where is just alias to filter, so this is the option 2 – Exceptive 29/11, 2022 at 18:15

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

option 1 :

option 2 :

Recommended topics

Hot tags