Pandas: Write to Excel not working in Databricks
Asked Answered
D

1

6

I was trying to convert parquet file to excel file. However, when I am trying to do so, using pandas or openpyxl engine, it is showing "Operation not supported" error. However, I can read excel file using openpyxl engine in databricks.

While Reading the below code is working:

xlfile = '/dbfs/mnt/raw/BOMFILE.xlsx'
tmp_csv = '/dbfs/mnt/trusted/BOMFILE.csv'
pdf = pd.DataFrame(pd.read_excel(xlfile, engine='openpyxl'))
pdf.to_csv (tmp_csv, index = None, header=True)

However, when I tried to write the same using openpyxl as well as xlswriter, it is not working:

parq = '/mnt/raw/PRODUCT.parquet'
final = '/dbfs/mnt/trusted/PRODUCT.xlsx'
df = spark.read.format("parquet").option("header", "true").load(parq)
pandas_df = df.toPandas()
pandas_df.to_excel(final, engine='openpyxl')
#pandas_df.to_excel(outfile, engine='xlsxwriter')#, sheet_name=tbl)

Error I got:

FileCreateError: [Errno 95] Operation not supported

OSError: [Errno 95] Operation not supported
During handling of the above exception, another exception occurred:
FileCreateError                           Traceback (most recent call last)
<command-473603709964454> in <module>
     17       final = '/dbfs/mnt/trusted/PRODUCT.xlsx'
     18       print(outfile)
---> 19       pandas_df.to_excel(outfile, engine='openpyxl')
     20       #pandas_df.to_excel(outfile, engine='xlsxwriter')#, sheet_name=tbl)

/databricks/python/lib/python3.7/site-packages/pandas/core/generic.py in to_excel(self, excel_writer, sheet_name, na_rep, float_format, columns, header, index, index_label, startrow, startcol, engine, merge_cells, encoding, inf_rep, verbose, freeze_panes)
   2179             startcol=startcol,
   2180             freeze_panes=freeze_panes,
-> 2181             engine=engine,
   2182         )
   2183 

Please suggest.

Deviationism answered 10/5, 2021 at 14:4 Comment(0)
S
18

The problem is that there are limitations when it comes to the local file API support in DBFS (the /dbfs fuse). For example, it doesn't support random writes that are required for Excel files. From documentation:

Does not support random writes. For workloads that require random writes, perform the I/O on local disk first and then copy the result to /dbfs.

In your case it could be:

from shutil import copyfile

parq = '/mnt/raw/PRODUCT.parquet'
final = '/dbfs/mnt/trusted/PRODUCT.xlsx'
temp_file = '/tmp/PRODUCT.xlsx'
df = spark.read.format("parquet").option("header", "true").load(parq)
pandas_df = df.toPandas()
pandas_df.to_excel(temp_file, engine='openpyxl')

copyfile(temp_file, final)

P.S. You can also use dbutils.fs.cp to copy file (doc) - it will also work on Community Edition where the /dbfs isn't supported

Slogan answered 10/5, 2021 at 14:59 Comment(4)
Thanks for that. This came up in the XlsxWriter issues as well. Do you know what is meant by "random writes" in those docs?Gunfire
Random means that data isn't written sequentially. For example we wrote some bytes, then rewind to earlier position and wrote new bytes. For excel it's quite common patternSlogan
Thanks. That should have been obvious but I wasn't sure if that was what was meant because XlsxWriter doesn't use any random writes (although it does use random reads of a sort with rewind()). I guess the issue is with zipfile.py which probably does have random writes. The full stack trace isn't shown above but in the issue I linked to the exception is thrown in zipfile.py around a file seek(). Anyway thanks for the link. It is good to know that the workaround is officially documented.Gunfire
yes, writing zipfile usually requires random seek because you need to update some offsets after you finished writing data. It's possible to do it in one pass, but it's use-case specificSlogan

© 2022 - 2024 — McMap. All rights reserved.