I'm trying to restore some historic backup files that saved in parquet format, and I want to read from them once and write the data into a PostgreSQL database.
I know that backup files saved using spark, but there is a strict restriction for me that I cant install spark in the DB machine or read the parquet file using spark in a remote device and write it to the database using spark_df.write.jdbc
. Everything needs to happen on the DB machine and in the absence of spark and Hadoop only using Postgres and Bash scripting.
my files structure is something like:
foo/
foo/part-00000-2a4e207f-4c09-48a6-96c7-de0071f966ab.c000.snappy.parquet
foo/part-00001-2a4e207f-4c09-48a6-96c7-de0071f966ab.c000.snappy.parquet
foo/part-00002-2a4e207f-4c09-48a6-96c7-de0071f966ab.c000.snappy.parquet
..
..
I expect to read data and schema from each parquet folder like foo
, create a table using that schema and write the data into the shaped table, only using bash and Postgres CLI.