Read multiple parquet files in a folder and write to single csv file using python

L

7

33

I am new to python and I have a scenario where there are multiple parquet files with file names in order. ex: par_file1,par_file2,par_file3 and so on upto 100 files in a folder.

I need to read these parquet files starting from file1 in order and write it to a singe csv file. After writing contents of file1, file2 contents should be appended to same csv without header. Note that all files have same column names and only data is split into multiple files.

I learnt to convert single parquet to csv file using pyarrow with the following code:

import pandas as pd    
df = pd.read_parquet('par_file.parquet')    
df.to_csv('csv_file.csv')

But I could'nt extend this to loop for multiple parquet files and append to single csv. Is there a method in pandas to do this? or any other way to do this would be of great help. Thank you.

Legitimist answered 5/8, 2018 at 17:27 Comment(2)

are you running your python code in the same directory as the parquet files? – Forecourt 5/8, 2018 at 18:6

I am thinking to copy the parquet files to a folder in local and run the python code from local machine. I am entirely new to python and not sure on where to to run the python code. If the hdfs folder location can be given in code and copy the contents to a csv file in local, that is also perfectly fine. I am sorry if I did not understand your query properly. – Legitimist 6/8, 2018 at 8:39

F

2

If you are going to copy the files over to your local machine and run your code you could do something like this. The code below assumes that you are running your code in the same directory as the parquet files. It also assumes the naming of files as your provided above: "order. ex: par_file1,par_file2,par_file3 and so on upto 100 files in a folder." If you need to search for your files then you will need to get the file names using glob and explicitly provide the path where you want to save the csv: open(r'this\is\your\path\to\csv_file.csv', 'a') Hope this helps.

import pandas as pd

# Create an empty csv file and write the first parquet file with headers
with open('csv_file.csv','w') as csv_file:
    print('Reading par_file1.parquet')
    df = pd.read_parquet('par_file1.parquet')
    df.to_csv(csv_file, index=False)
    print('par_file1.parquet appended to csv_file.csv\n')
    csv_file.close()

# create your file names and append to an empty list to look for in the current directory
files = []
for i in range(2,101):
    files.append(f'par_file{i}.parquet')

# open files and append to csv_file.csv
for f in files:
    print(f'Reading {f}')
    df = pd.read_parquet(f)
    with open('csv_file.csv','a') as file:
        df.to_csv(file, header=False, index=False)
        print(f'{f} appended to csv_file.csv\n')

You can remove the print statements if you want.

Tested in python 3.6 using pandas 0.23.3

Forecourt answered 6/8, 2018 at 13:21 Comment(2)

Thanks a lot, this was what I wanted. I ran this code n idle by substituting absolute paths to parquet and csv files location.I have one more question here. If I am not copying the files to local and wanted to read the parquet directly from hdfs folder location, then append the contents to a csv file in local, then is there a way to give hdfs path inside this code? – Legitimist 6/8, 2018 at 17:18

@Legitimist this documentation should help: crs4.github.io/pydoop/api_docs/hdfs_api.html – Forecourt 6/8, 2018 at 18:29

R

49

I ran into this question looking to see if pandas can natively read partitioned parquet datasets. I have to say that the current answer is unnecessarily verbose (making it difficult to parse). I also imagine that it's not particularly efficient to be constantly opening/closing file handles then scanning to the end of them depending on the size.

A better alternative would be to read all the parquet files into a single DataFrame, and write it once:

from pathlib import Path
import pandas as pd

data_dir = Path('dir/to/parquet/files')
full_df = pd.concat(
    pd.read_parquet(parquet_file)
    for parquet_file in data_dir.glob('*.parquet')
)
full_df.to_csv('csv_file.csv')

Alternatively, if you really want to just append to the file:

data_dir = Path('dir/to/parquet/files')
for i, parquet_path in enumerate(data_dir.glob('*.parquet')):
    df = pd.read_parquet(parquet_path)
    write_header = i == 0 # write header only on the 0th file
    write_mode = 'w' if i == 0 else 'a' # 'write' mode for 0th file, 'append' otherwise
    df.to_csv('csv_file.csv', mode=write_mode, header=write_header)

A final alternative for appending each file that opens the target CSV file in "a+" mode at the onset, keeping the file handle scanned to the end of the file for each write/append (I believe this works, but haven't actually tested it):

data_dir = Path('dir/to/parquet/files')
with open('csv_file.csv', "a+") as csv_handle:
    for i, parquet_path in enumerate(data_dir.glob('*.parquet')):
        df = pd.read_parquet(parquet_path)
        write_header = i == 0 # write header only on the 0th file
        df.to_csv(csv_handle, header=write_header)

Revelation answered 5/9, 2018 at 22:8 Comment(3)

This readall+concat approach will be bounded by memory (8G, 16GB), whatever it is. The open+append won't. – Scotty 6/7, 2021 at 3:47

@Scotty thanks a lot! – Backwards 8/3, 2022 at 8:18

@Scotty The open+append approach will be bounded by memory too, as the line pd.read_parquet reads a parquet file at once into RAM. Therefore if the parquet file is really large it will lead to a memory error, as the dataframe cannot be stored in RAM. One would need to process the parquet_file in parts/chunks too, to make it stable. – Antiperistalsis 26/10, 2023 at 12:14

S

37

I'm having a similar need and I read current Pandas version supports a directory path as argument for the read_parquet function. So you can read multiple parquet files like this:

import pandas as pd    
df = pd.read_parquet('path/to/the/parquet/files/directory')

It concats everything into a single dataframe so you can convert it to a csv right after:

df.to_csv('csv_file.csv')

Make sure you have the following dependencies according to the doc:

pyarrow
fastparquet

Stocktonontees answered 8/2, 2021 at 15:28 Comment(1)

Doesn't work like that for me. – Swiger 25/8, 2023 at 17:17

R

13

This helped me to load all parquet files into one data frame

import glob
 files = glob.glob("*.snappy.parquet")
 data = [pd.read_parquet(f,engine='fastparquet') for f in files]
 merged_data = pd.concat(data,ignore_index=True)

Room answered 9/2, 2020 at 20:2 Comment(1)

aren't you bound by memory? – Backwards 8/3, 2022 at 8:18

F

2

If you are going to copy the files over to your local machine and run your code you could do something like this. The code below assumes that you are running your code in the same directory as the parquet files. It also assumes the naming of files as your provided above: "order. ex: par_file1,par_file2,par_file3 and so on upto 100 files in a folder." If you need to search for your files then you will need to get the file names using glob and explicitly provide the path where you want to save the csv: open(r'this\is\your\path\to\csv_file.csv', 'a') Hope this helps.

import pandas as pd

# Create an empty csv file and write the first parquet file with headers
with open('csv_file.csv','w') as csv_file:
    print('Reading par_file1.parquet')
    df = pd.read_parquet('par_file1.parquet')
    df.to_csv(csv_file, index=False)
    print('par_file1.parquet appended to csv_file.csv\n')
    csv_file.close()

# create your file names and append to an empty list to look for in the current directory
files = []
for i in range(2,101):
    files.append(f'par_file{i}.parquet')

# open files and append to csv_file.csv
for f in files:
    print(f'Reading {f}')
    df = pd.read_parquet(f)
    with open('csv_file.csv','a') as file:
        df.to_csv(file, header=False, index=False)
        print(f'{f} appended to csv_file.csv\n')

You can remove the print statements if you want.

Tested in python 3.6 using pandas 0.23.3

Forecourt answered 6/8, 2018 at 13:21 Comment(2)

Thanks a lot, this was what I wanted. I ran this code n idle by substituting absolute paths to parquet and csv files location.I have one more question here. If I am not copying the files to local and wanted to read the parquet directly from hdfs folder location, then append the contents to a csv file in local, then is there a way to give hdfs path inside this code? – Legitimist 6/8, 2018 at 17:18

@Legitimist this documentation should help: crs4.github.io/pydoop/api_docs/hdfs_api.html – Forecourt 6/8, 2018 at 18:29

T

1

You can use Dask to read in the multiple Parquet files and write them to a single CSV.

Dask accepts an asterisk (*) as wildcard / glob character to match related filenames.

Make sure to set single_file to True and index to False when writing the CSV file.

import pandas as pd
import numpy as np

# create some dummy dataframes using np.random and write to separate parquet files
rng = np.random.default_rng()

for i in range(3):
    df = pd.DataFrame(rng.integers(0, 100, size=(10, 4)), columns=list('ABCD'))
    df.to_parquet(f"dummy_df_{i}.parquet")

# load multiple parquet files with Dask
import dask.dataframe as dd
ddf = dd.read_parquet('dummy_df_*.parquet', index=False)

# write to single csv
ddf.to_csv("dummy_df_all.csv", 
           single_file=True, 
           index=False
)

# test to verify
df_test = pd.read_csv("dummy_df_all.csv")

Using Dask for this means you won't have to worry about the resulting file size (Dask is a distributed computing framework that can handle anything you throw at it, while pandas might throw a MemoryError if the resulting DataFrame is too large) and you can easily read and write from cloud data storage like Amazon S3.

Tesch answered 5/1, 2022 at 13:39 Comment(0)

N

0

a small change for those trying to read remote files, which helps to read it faster (direct read_parquet for remote files was doing this much slower for me):

import io
merged = []
# remote_reader = ... <- init some remote reader, for example AzureDLFileSystem()
for f in files:
    with remote_reader.open(f, 'rb') as f_reader:
        merged.append(remote_reader.read())
merged = pd.concat((pd.read_parquet(io.BytesIO(file_bytes)) for file_bytes in merged))

Adds a little temporary memory overhead though.

Novercal answered 12/5, 2020 at 15:49 Comment(0)

C

0

Option for large datasets using Polars

If you are working with large data which is to big to fit in memory then you could try using scan_parquet. Not only does this work for large data but it is very fast! When saving to csv you might want to play with batch_size. Note Polars has integrated glob into the scan_parquet method so you can use a wildcard and file extension.

import polars as pl    
df = pl.scan_parquet('path/to/the/parquet/files/directory', r"*.parquet.gzip")
df.collect().write_csv(path, batch_size=10000)

The normal rule of thumb is your RAM needs to be 5-10x your data size. But with LazyFrames that goes out the window... Just make sure you have enough room on your drive for the large csv :)

Canter answered 24/7, 2024 at 14:51 Comment(0)

Recommended topics

Hot tags