How to copy and convert parquet files to csv

Asked 9/9, 2016 at 21:29 Answered 2/7 at 10:44

Solved python hadoop apache-spark pyspark parquet

I have access to a hdfs file system and can see parquet files with

hadoop fs -ls /user/foo

How can I copy those parquet files to my local system and convert them to csv so I can use them? The files should be simple text files with a number of fields per row.

Mclaurin answered 9/9, 2016 at 21:29 Comment(0)

Try

df = spark.read.parquet("/path/to/infile.parquet")
df.write.csv("/path/to/outfile.csv")

Relevant API documentation:

Both /path/to/infile.parquet and /path/to/outfile.csv should be locations on the hdfs filesystem. You can specify hdfs://... explicitly or you can omit it as usually it is the default scheme.

You should avoid using file://..., because a local file means a different file to every machine in the cluster. Output to HDFS instead then transfer the results to your local disk using the command line:

hdfs dfs -get /path/to/outfile.csv /path/to/localfile.csv

Or display it directly from HDFS:

hdfs dfs -cat /path/to/outfile.csv

Marsupium answered 11/9, 2016 at 7:36 Comment(2)

Can infile.parquet by a location n the hdfs filesystem? – Mclaurin 11/9, 2016 at 8:58

Yes, infile.parquet should be a location on the hdfs filesystem, and outfile.csv also. You can specify a path without a scheme as the default is usually hdfs or you can specify hdfs://... explicitly. You should avoid using file://... because a local file means a different file to every machine in the cluster. Output to hdfs instead then transfer the results to your local disk using the command line if you really have to. – Marsupium 11/9, 2016 at 10:28

If there is a table defined over those parquet files in Hive (or if you define such a table yourself), you can run a Hive query on that and save the results into a CSV file. Try something along the lines of:

insert overwrite local directory dirname
  row format delimited fields terminated by ','
  select * from tablename;

Substitute dirname and tablename with actual values. Be aware that any existing content in the specified directory gets deleted. See Writing data into the filesystem from queries for details.

Marsupium answered 10/9, 2016 at 17:9 Comment(1)

Thank you. I have never used hive. I can run hadoop from the command line and there is spark installed too. – Mclaurin 10/9, 2016 at 21:16

Snippet for a more dynamic form, since you might not exactly know what's the name of your parquet file, will be:

for filename in glob.glob("[location_of_parquet_file]/*.snappy.parquet"):
        print filename
        df = sqlContext.read.parquet(filename)
        df.write.csv("[destination]")
        print "csv generated"

Metabolism answered 4/9, 2017 at 15:21 Comment(0)

Another way is to use sling (https://slingdata.io). See below.

sling run --src-stream file://path/to/file.parquet --tgt-object file://path/to/file.csv

sling run --src-stream file://path/to/parquet_folder/ --tgt-object file://path/to/file.csv

sling run --src-stream file://path/to/folder/*.parquet --tgt-object file://path/to/folder/{stream_file_name}.csv

You can also use YAML/JSON config files:

source: local
target: aws_s3

defaults:
  source_options:
    format: csv
  target_options:
    format: parquet

streams:
  'file://path/to/csv_folder/':
    object: path/to/file.csv

  'file://path/to/folder/*.parquet':
    object: path/to/folder/{stream_file_name}.csv

Jejune answered 2/7 at 10:44 Comment(0)

Recommended topics

Hot tags