How to use OPTIMIZE ZORDER BY in Databricks
Asked Answered
P

2

6

I have two dataframes(from a delta lake table) that do a left join via an id column.

sd1, sd2

%sql
select
    a.columnA,
    b.columnB,
from sd1 a
left outer join sd2 b
on a.id = b.id

   

The problem is that my query takes a long time, looking for ways to improve the results I have found OPTIMIZE ZORDER BY Youtube video

according to the video seems to be useful when ordering columns if they are going to be part of the where condition`.

But since the two dataframes use the id in the join condition, could it be interesting to order that column?

spark.sql(f'OPTIMIZE delta.`{sd1_delta_table_path}` ZORDER BY (id)')

the logic that follows in my head is that if we first order that column then it will take less time to look for them to make the match. Is this correct ?

Thanks ind advance

Prop answered 16/8, 2022 at 14:40 Comment(0)
C
6

OPTIMIZE ZORDER may help a bit by placing related data together, but it's usefulness may depend on the data type used for ID column. OPTIMIZE ZORDER relies on the data skipping functionality that just gives you min & max statistics, but may not be useful when you have big ranges in your joins.

You can also tune a file sizes, to avoid scanning of too many smaller files.

But from my personal experience, for joins, bloom filters give better performance because they allow to skip files more efficiently than data skipping. Just build bloom filter on the ID column...

Callum answered 16/8, 2022 at 15:48 Comment(0)
T
0

Your assumption may be right as the spark will use SMJ for the left join, and the file will be read out and sorted in the disk. I don't think this will benefit from the z-order file format. From my understanding, only equal operations will benefit from that.

Trinitroglycerin answered 29/3, 2023 at 3:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.