Pandas cannot read parquet files created in PySpark

B

4

11

I am writing a parquet file from a Spark DataFrame the following way:

df.write.parquet("path/myfile.parquet", mode = "overwrite", compression="gzip")

This creates a folder with multiple files in it.

When I try to read this into pandas, I get the following errors, depending on which parser I use:

import pandas as pd
df = pd.read_parquet("path/myfile.parquet", engine="pyarrow")

PyArrow:

File "pyarrow\error.pxi", line 83, in pyarrow.lib.check_status

ArrowIOError: Invalid parquet file. Corrupt footer.

fastparquet:

File "C:\Program Files\Anaconda3\lib\site-packages\fastparquet\util.py", line 38, in default_open return open(f, mode)

PermissionError: [Errno 13] Permission denied: 'path/myfile.parquet'

I am using the following versions:

Spark 2.4.0
Pandas 0.23.4
pyarrow 0.10.0
fastparquet 0.2.1

I tried gzip as well as snappy compression. Both do not work. I of course made sure that I have the file in a location where Python has permissions to read/write.

It would already help if somebody was able to reproduce this error.

Brahui answered 15/1, 2019 at 15:20 Comment(0)

B

4

Since this still seems to be an issue even with newer pandas versions, I wrote some functions to circumvent this as part of a larger pyspark helpers library:

import pandas as pd
import datetime
import os

def read_parquet_folder_as_pandas(path, verbosity=1):
  files = [f for f in os.listdir(path) if f.endswith("parquet")]

  if verbosity > 0:
    print("{} parquet files found. Beginning reading...".format(len(files)), end="")
    start = datetime.datetime.now()

  df_list = [pd.read_parquet(os.path.join(path, f)) for f in files]
  df = pd.concat(df_list, ignore_index=True)

  if verbosity > 0:
    end = datetime.datetime.now()
    print(" Finished. Took {}".format(end-start))
  return df


def read_parquet_as_pandas(path, verbosity=1):
  """Workaround for pandas not being able to read folder-style parquet files.
  """
  if os.path.isdir(path):
    if verbosity>1: print("Parquet file is actually folder.")
    return read_parquet_folder_as_pandas(path, verbosity)
  else:
    return pd.read_parquet(path)

This assumes that the relevant files in the parquet "file", which is actually a folder, end with ".parquet". This works for parquet files exported by databricks and might work with others as well (untested, happy about feedback in the comments).

The function read_parquet_as_pandas() can be used if it is not known beforehand whether it is a folder or not.

Brahui answered 17/9, 2019 at 11:51 Comment(0)

H

5

The problem is that Spark partitions the file due to its distributed nature (each executor writes a file inside the directory that receives the filename). This is not something supported by Pandas, which expects a file, not a path.

You can circumvent this issue in different ways:

Reading the file with an alternative utility, such as the pyarrow.parquet.ParquetDataset, and then convert that to Pandas (I did not test this code).

  arrow_dataset = pyarrow.parquet.ParquetDataset('path/myfile.parquet')
  arrow_table = arrow_dataset.read()
  pandas_df = arrow_table.to_pandas()

Another way is to read the separate fragments separately and then concatenate them, as this answer suggest: Read multiple parquet files in a folder and write to single csv file using python

Hierogram answered 15/1, 2019 at 15:32 Comment(5)

Thank you for your answer. It seems that reading single files (your second bullet point) works. However, the first thing does not work - it looks like pyarrow cannot handle PySpark's footer (see error message in question) – Brahui 15/1, 2019 at 15:37

@Thomas, I am unfortunately not sure about the footer issue. – Hierogram 21/1, 2019 at 13:18

Or you could try calling coalesce on the dataframe: coalesce(1) so it coalesces all the part files into one file and then read from the single file instead of a directory of files? – Plagio 19/6, 2019 at 14:18

@OmkarNeogi: This is only possible if you are the person writing the files, not if you receive them from somebody else... – Brahui 29/8, 2019 at 9:22

I updated this to work with the actual APIs, which is that you create a Dataset, convert it to a Table and then to a Pandas DataFrame. – Contrarily 24/10, 2020 at 4:4

B

4

Since this still seems to be an issue even with newer pandas versions, I wrote some functions to circumvent this as part of a larger pyspark helpers library:

import pandas as pd
import datetime
import os

def read_parquet_folder_as_pandas(path, verbosity=1):
  files = [f for f in os.listdir(path) if f.endswith("parquet")]

  if verbosity > 0:
    print("{} parquet files found. Beginning reading...".format(len(files)), end="")
    start = datetime.datetime.now()

  df_list = [pd.read_parquet(os.path.join(path, f)) for f in files]
  df = pd.concat(df_list, ignore_index=True)

  if verbosity > 0:
    end = datetime.datetime.now()
    print(" Finished. Took {}".format(end-start))
  return df


def read_parquet_as_pandas(path, verbosity=1):
  """Workaround for pandas not being able to read folder-style parquet files.
  """
  if os.path.isdir(path):
    if verbosity>1: print("Parquet file is actually folder.")
    return read_parquet_folder_as_pandas(path, verbosity)
  else:
    return pd.read_parquet(path)

This assumes that the relevant files in the parquet "file", which is actually a folder, end with ".parquet". This works for parquet files exported by databricks and might work with others as well (untested, happy about feedback in the comments).

The function read_parquet_as_pandas() can be used if it is not known beforehand whether it is a folder or not.

Brahui answered 17/9, 2019 at 11:51 Comment(0)

F

2

If the parquet file has been created with spark, (so it's a directory) to import it to pandas use

from pyarrow.parquet import ParquetDataset

dataset = ParquetDataset("file.parquet")
table = dataset.read()
df = table.to_pandas()

Fresh answered 13/7, 2020 at 13:58 Comment(0)

L

0

I will refer to this answers, which helped me

adding engine = 'fastparquet' worked for me. Otherwise it defaults to engine = 'pyarrow' and that seems to make the kernel die.

Levulose answered 25/6, 2024 at 22:34 Comment(0)

Recommended topics

Hot tags