What does "Determining location of DBIO file fragments..." mean, and how do I speed it up?
Asked Answered
J

2

19

When running simple SQL commands in Databricks, sometimes I get the message:

Determining location of DBIO file fragments. This operation can take some time.

What does this mean, and how do I prevent it from having to perform this apparently-expensive operation every time? This happens even when all the underlying tables are Delta tables.

Jerold answered 30/11, 2019 at 20:11 Comment(0)
K
13

That is a message about the delta cache. It’s determines on which executors it has what cached, to route tasks for best cached locality. Optimizing your table more frequently so there are fewer files will make this better.

Kinfolk answered 22/4, 2020 at 23:22 Comment(6)
Also, using fewer machines will cut down on this time as well.Kinfolk
When you say optimizing table is that a spark config or a code that needs to be run on a scheduler. I am using delta caching and Runtime 10.3 in Azure and I keep getting that message. We also have a vacuum job, but even after running it I still get the above message. Any help would be much appreciated.Hardden
@VitaliDedkov run %sql Optimize [table name]Jerold
Hi @DavidMaddox so I actually did create a notebook that does that and it did help. At the time (I am not sure what the issue was) I was not doing that and then the issue went away. I am not sure if the Optimize command would have solved it, but from what I read (and as other commented) it probably would helped solve it as well.Hardden
@VitaliDedkov This is an old thread but I've encountered the same problem and no solution found seemed to help. Do you mind sharing what worked for you?Rockhampton
Hi @user17101610, So honestly this issue came up and went away a couple of times. What I did first is to run the Vacuum dry run command first to see if there were old files that needed to be vacuumed. After that I ran Optimize command on the table, then vacuum without the dry run. It went away sometimes, but I guess a silly solution to never see that again is not to use delta cache enabled VMs from Databricks. Let me know if that is enough details to help you.Hardden
T
-1

That message is related to delta caching, basically if a cluster is constantly scaling up or down then occasionally you might lose delta cache pieces. Determining the location of DBIO file fragments is the operation determining which executors the files were cached.

This is something that can be helped by trying a newer DBR such as 11.3 or 12.X. You could also try turning off the cache by setting the below configuration in the notebook and observing the behaviour:

spark.conf.set("spark.databricks.io.cache.enabled", "false")

Trilemma answered 21/11, 2023 at 18:18 Comment(1)
did you just coppy the answer from here? community.databricks.com/t5/data-engineering/…Rump

© 2022 - 2024 — McMap. All rights reserved.