Is there a way to directly insert data from a parquet file into PostgreSQL database?
Asked Answered
D

3

18

I'm trying to restore some historic backup files that saved in parquet format, and I want to read from them once and write the data into a PostgreSQL database.

I know that backup files saved using spark, but there is a strict restriction for me that I cant install spark in the DB machine or read the parquet file using spark in a remote device and write it to the database using spark_df.write.jdbc. Everything needs to happen on the DB machine and in the absence of spark and Hadoop only using Postgres and Bash scripting.

my files structure is something like:

foo/
    foo/part-00000-2a4e207f-4c09-48a6-96c7-de0071f966ab.c000.snappy.parquet
    foo/part-00001-2a4e207f-4c09-48a6-96c7-de0071f966ab.c000.snappy.parquet
    foo/part-00002-2a4e207f-4c09-48a6-96c7-de0071f966ab.c000.snappy.parquet
    ..
    ..

I expect to read data and schema from each parquet folder like foo, create a table using that schema and write the data into the shaped table, only using bash and Postgres CLI.

Dobrinsky answered 10/11, 2019 at 8:5 Comment(3)
You can try the Parquet Foreign Data Wrapper github.com/adjust/parquet_fdw. You'll have to download the files from HDFS first.Cathrine
@RemusRusanu It's quite interesting, thank you! I'm going to test it but the commits show that it is heavily under development yet. I'm looking for a solution based on processing the files using bash.Dobrinsky
You have chosen an incorrect answer to your question. Converting to CSV is not the same as "a direct way"Sharilyn
Y
8

You can using spark and converting parquet files to csv format, then moving the files to DB machine and import them by any tools.

spark.read.parquet("...").write.csv("...")
import pandas as pd
df = pd.read_csv('mypath.csv')
df.columns = [c.lower() for c in df.columns] #postgres doesn't like capitals or spaces

from sqlalchemy import create_engine
engine = create_engine('postgresql://username:password@localhost:5432/dbname')

df.to_sql("my_table_name", engine)
Yungyunick answered 10/11, 2019 at 14:41 Comment(5)
Thanks for your answer! eventually, I decided to convert parquet files to CSV using spark in another machine, ship CSV files to DB machine and propagate tables using SQL COPY foo FROM '/path/to/csv/foo' WITH (FORMAT CSV) statement.Dobrinsky
This is one of the best answers I've seen to the question "easiest way to ingest csv files into Postgres using python"Handiness
Alternatively, you can even skip the whole reading into Spark/writing to CSV step by just using pyarrow.parquet and reading directly into pandas with the ParquetDataset function - that could save an entire write and read of the data.Zoogloea
Why not use pd.read_parquet here instead of spark.read.parquet ?Zimmerman
Saving to CSV is a bad idea. CSV is an outdated format that is error-prone (it's quite common to have commas in strings). PyArrow is much better.Donee
S
8

I made a library to convert from parquet to Postgres’ binary format: https://github.com/adriangb/pgpq.

It lets you use PyArrow to read the parquet data and your postgres driver of choice to write it out (e.g. psycopg).

Update: it's been pointed out to me that DuckDB can do this as well. Here's a benchmark where I show how to do it with both pgpq and DuckDB: permalink. Both perform about the same but DuckDB is of course much more widely used, better tested and more versatile.

Sapless answered 14/2, 2023 at 22:14 Comment(0)
A
1

This page was viewed more than 34000 times, and many of the views do not require a bash-specific solution.

General answer about working with Parquet in PostgreSQL

The best is to use only SQL. The most recommended resources are:

For example to read Parquet file on Amazon S3, use JDBC. To local files, use FDW.


If you encounter issues with local files, you can try to work around them by running the pgpq command directly in bash.

Amora answered 10/11, 2019 at 8:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.