Not able to cat dbfs file in databricks community edition cluster. FileNotFoundError: [Errno 2] No such file or directory:
Asked Answered
B

2

6

Trying to read delta log file in databricks community edition cluster. (databricks-7.2 version)

df=spark.range(100).toDF("id")
df.show()
df.repartition(1).write.mode("append").format("delta").save("/user/delta_test")

with open('/user/delta_test/_delta_log/00000000000000000000.json','r')  as f:
  for l in f:
    print(l)

Getting file not found error:

FileNotFoundError: [Errno 2] No such file or directory: '/user/delta_test/_delta_log/00000000000000000000.json'
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<command-1759925981994211> in <module>
----> 1 with open('/user/delta_test/_delta_log/00000000000000000000.json','r')  as f:
      2   for l in f:
      3     print(l)

FileNotFoundError: [Errno 2] No such file or directory: '/user/delta_test/_delta_log/00000000000000000000.json'

I have tried with adding /dbfs/,dbfs:/ nothing got worked out,Still getting same error.

with open('/dbfs/user/delta_test/_delta_log/00000000000000000000.json','r')  as f:
  for l in f:
    print(l)

But using dbutils.fs.head i was able to read the file.

dbutils.fs.head("/user/delta_test/_delta_log/00000000000000000000.json")

'{"commitInfo":{"timestamp":1598224183331,"userId":"284520831744638","userName":"","operation":"WRITE","operationParameters":{"mode":"Append","partitionBy":"[]"},"notebook":{"","isolationLevel":"WriteSerializable","isBlindAppend":true,"operationMetrics":{"numFiles":"1","numOutputBytes":"1171","numOutputRows":"100"}}}\n{"protocol":{"minReaderVersi...etc

How can we read/cat a dbfs file in databricks with python open method?

Biblio answered 23/8, 2020 at 23:16 Comment(0)
C
11

By default, this data is on the DBFS, and your code need to understand how to access it. Python doesn't know about it - that's why it's failing.

But there is a workaround - DBFS is mounted to the nodes at /dbfs, so you just need to append it to your file name: instead of /user/delta_test/_delta_log/00000000000000000000.json, use /dbfs/user/delta_test/_delta_log/00000000000000000000.json

update: on community edition, in DBR 7+, this mount is disabled. The workaround would be to use dbutils.fs.cp command to copy file from DBFS to local directory, like, /tmp, or /var/tmp, and then read from it:

dbutils.fs.cp("/file_on_dbfs", "file:///tmp/local_file")

please note that if you don't specify URI schema, then the file by default is referring DBFS, and to refer the local file you need to use file:// prefix (see docs).

Cyclops answered 9/9, 2020 at 14:51 Comment(6)
Thanks @AlexOtt, I have tried using /dbfs and used python open file api.. with open('/dbfs/user/delta_test/_delta_log/00000000000000000000.json','r') as f: for l in f: print(l) still getting same error.. do I need to manually create mount point to access the file before using in python way?Biblio
It looks like that it depends on the DBR version on the community - it works just fine with DBR 6.6, but /dbfs/ is empty on DBR 7.2Cyclops
Hi Alex, could you elaborate how to workaround this issue through dbutils.fs.cp? I have tried the following: dbutils.fs.cp("databricks-datasets/README.md", "/tmp/README.md") --> worked %fs ls /tmp/README.md --> returns path to "dbfs:/tmp/README.md" f = open("/tmp/README.md", "r") --> FileNotFoundError: [Errno 2] No such file or directory: '/tmp/README.md'Debera
I've added the code example to the answer. By default if you don't specify schema, then all references are going to the DBFS file. To use local files, use file://Cyclops
@AlexOtt, Hi Alex, this answer is super helpful. But I have one question. This file:/// directory is not on DBFS, so where is this dir located? I searched my local machine, but no such dir is being created. I appreciate your help!Ethylene
it will be located on the driver nodeCyclops
V
0

We encountered the same issue and we discovered that if you're using the Azure Databricks edition instead of the Community Edition, you simply need to set the cluster to 'no isolation shared' mode. That's all you need to do to fetch from DBFS :)

Vergne answered 7/3 at 19:7 Comment(1)
As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.Equipollent

© 2022 - 2024 — McMap. All rights reserved.