What is the best practice to load a delta table specific partition in databricks?
Asked Answered
C

2

7

I would like to know what is the best way to load a delta table specific partition ? Is option 2 loading the all table before filtering ?

option 1 :

df = spark.read.format("delta").option('basePath','/mnt/raw/mytable/')\
   .load('/mnt/raw/mytable/ingestdate=20210703')

(Is the basePath option needed here ?)

option 2 :

df = spark.read.format("delta").load('/mnt/raw/mytable/')
df = df.filter(col('ingestdate')=='20210703')

Many thanks in advance !

Communication answered 12/7, 2021 at 8:37 Comment(0)
L
5

In the second option, spark loads only the relevant partitions that has been mentioned on the filter condition, internally spark does partition pruning and load only the relevant data from source table.

Whereas in the first option, you are directly instructing spark to load only the respective partitions as defined.

So in both the cases, you will end up loading only the respective partitions data.

Lanai answered 13/7, 2021 at 5:47 Comment(0)
N
3

If your table is partitioned and you want to read just one partition, you can do it using where

val partition = "year = '2019'"


val df = spark.read
 .format("delta")
 .load(path)
 .where(partition)
Nutgall answered 16/9, 2022 at 13:9 Comment(1)
where is just alias to filter, so this is the option 2Exceptive

© 2022 - 2024 — McMap. All rights reserved.