parquet - McMap

4

Solved

Python: Obtain number of rows for ParquetDataset?

How do I obtain the number of rows of a ParquetDataset that is structured in the form of a folder containing multiple parquet files. I tried from pyarrow.parquet import ParquetDataset a = Parquet...

python parquet

Swarey asked 1/4, 2020 at 0:39

3

Solved

pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'

This has a different answer to those given in the post above I am getting an error that reads pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It must be specified manua...

apache-spark pyspark parquet

Proust asked 2/11, 2018 at 16:54

4

"Parquet record is malformed" while column count is not 0

On an AWS EMR cluster, I'm trying to write a query result to parquet using Pyspark but face the following error: Caused by: java.lang.RuntimeException: Parquet record is malformed: empty fields ar...

hive pyspark amazon-emr parquet

Wreckful asked 10/1, 2020 at 1:13

7

Solved

Read multiple parquet files in a folder and write to single csv file using python

I am new to python and I have a scenario where there are multiple parquet files with file names in order. ex: par_file1,par_file2,par_file3 and so on upto 100 files in a folder. I need to read the...

pandas csv parquet

Legitimist asked 5/8, 2018 at 17:27

10

GUI tools for viewing/editing Apache Parquet

I have some Apache Parquet file. I know I can execute parquet file.parquet in my shell and view it in terminal. But I would like some GUI tool to view Parquet files in more user-friendly format. Do...

apache hadoop parquet

Bethina asked 19/3, 2018 at 16:3

4

Solved

How to copy and convert parquet files to csv

I have access to a hdfs file system and can see parquet files with hadoop fs -ls /user/foo How can I copy those parquet files to my local system and convert them to csv so I can use them? The fi...

python hadoop apache-spark pyspark parquet

Mclaurin asked 9/9, 2016 at 21:29

4

Solved

How to read Parquet file from S3 without spark? Java

Currently, I am using the Apache ParquetReader for reading local parquet files, which looks something like this: ParquetReader<GenericData.Record> reader = null; Path path = new Path("userd...

java apache-spark hadoop amazon-s3 parquet

Flatwise asked 9/4, 2020 at 17:2

4

Solved

Pandas cannot read parquet files created in PySpark

I am writing a parquet file from a Spark DataFrame the following way: df.write.parquet("path/myfile.parquet", mode = "overwrite", compression="gzip") This creates a folder with multiple files in...

python pandas apache-spark pyspark parquet

Brahui asked 15/1, 2019 at 15:20

2

How to read parquet file as R data.frame without any other dependencies (like spark, python etc)?

I need to read some 'paraquet' files in R. There are few solution using sparklyr:: spark_read_parquet (which required 'spark') reticulate (which need python) Now the problem is I am not allowed...

r parquet

Dukas asked 14/3, 2019 at 13:34

2

Does Spark infer partition of parquet file persisted using repartition() on reading?

I have two datasets stored as parquet files with schemas as below: Dataset 1: id col1 col2 1 v1 v3 2 v2 v4 Dataset 2: id col3 col4 1 v5 v7 2 v6 v8 I want to join the two dat...

apache-spark pyspark parquet partitioning

Broch asked 9/4, 2024 at 13:0

3

Solved

How can one append to parquet files and how does it affect partitioning?

Does parquet allow appending to a parquet file periodically ? How does appending relate to partitioning if any ? For example if i was able to identify a column that had low cardinality and partitio...

parquet pyarrow fastparquet

Clew asked 9/9, 2021 at 20:23

4

how to fix Illegal Parquet type: INT64 (TIMESTAMP_MICROS) error

I use a sqlContext.read.parquet function in PySpark to read the parquet files everyday. The data has a timestamp column. They changed the timestamp field from 2019-08-26T00:00:13.600+0000 to 2019-0...

apache-spark pyspark apache-spark-sql parquet

Portis asked 28/8, 2019 at 20:54

8

Get schema of parquet file in Python

Is there any python library that can be used to just get the schema of a parquet file? Currently we are loading the parquet file into dataframe in Spark and getting schema from the dataframe to dis...

python parquet

Searby asked 10/1, 2017 at 10:54

1

Issue while using py-polars sink_parquet method on a LazyFrame

I am getting the below error while using sink_parquet on a LazyFrame. Earlier I was using .collect() on the output of the scan_parquet() to convert the result into a DataFrame but unfortunately it ...

parquet python-polars

Calling asked 31/1, 2023 at 17:7

4

How to loop large parquet file with generators in python?

Is it possible to open parquet files and iterate line by line, using generators? This is to avoid loading the whole parquet file into memory. The content of the file is pandas DataFrame.

python pandas dataframe generator parquet

Snuffer asked 8/6, 2018 at 7:32

4

how to merge multiple parquet files to single parquet file using linux or hdfs command?

I have multiple small parquet files generated as output of hive ql job, i would like to merge the output files to single parquet file? what is the best way to do it using some hdfs or linux comman...

hdfs parquet

Spohr asked 27/7, 2016 at 10:49

3

Solved

Spark Parquet read error : java.io.EOFException: Reached the end of stream with XXXXX bytes left to read

While reading parquet files in spark, if you face the below problem. App > Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 faile...

apache-spark apache-spark-sql parquet

Tamandua asked 30/10, 2019 at 6:14

4

How to load parquet file into Snowflake database?

Is it possible to load parquet file directly into a snowflake? If yes - how? Thanks.

database parquet snowflake-cloud-data-platform

Darya asked 6/7, 2018 at 17:26

3

How can I read each Parquet row group into a separate partition?

I have a parquet file with 10 row groups: In [30]: print(pyarrow.parquet.ParquetFile("/tmp/test2.parquet").num_row_groups) 10 But when I load it using Dask Dataframe, it is read into a single pa...

python dataframe dask parquet

Salop asked 30/1, 2020 at 14:27

3

Spark2 Can't write dataframe to parquet hive table : HiveFileFormat`. It doesn't match the specified format `ParquetFileFormat`

I'm trying to save dataframe in table hive. In spark 1.6 it's work but after migration to 2.2.0 it doesn't work anymore. Here's the code: blocs .toDF() .repartition($"col1", $"col2", $"col3", ...

apache-spark hive parquet apache-spark-2.0

Iceland asked 9/1, 2019 at 14:42

6

Solved

A comparison between fastparquet and pyarrow?

After some searching I failed to find a thorough comparison of fastparquet and pyarrow. I found this blog post (a basic comparison of speeds). and a github discussion that claims that files crea...

python parquet dask pyarrow fastparquet

Electrolyte asked 16/7, 2018 at 12:0

2

Solved

How do I Configure file format of AWS Athena results

Currently, the Athena query results are in tsv format in S3. Is there any way to configure Athena queries to return results in Parquet format.

amazon-web-services csv amazon-s3 parquet amazon-athena

Matronly asked 11/10, 2018 at 14:42

3

Solved

How to find the COMPRESSION_CODEC used on a Parquet file at the time of its generation?

Usually in Impala, we use the COMPRESSION_CODEC before inserting data into a table for which the underlying files are in Parquet format. Commands used to set COMPRESSION_CODEC: set compression_c...

hadoop parquet impala

Bordure asked 20/8, 2019 at 12:16

2

How to define Parquet and/or Arrow schemas?

Is there a language agnostic way of representing a Parquet or Arrow schema in a similar way to Avro? For example, an Avro schema might look like this: { "type": "record", &quo...

parquet apache-arrow

Osmunda asked 4/1, 2024 at 23:48

4

Solved

How to handle changing parquet schema in Apache Spark

I have run into a problem where I have Parquet data as daily chunks in S3 (in the form of s3://bucketName/prefix/YYYY/MM/DD/) but I cannot read the data in AWS EMR Spark from different dates becaus...

apache-spark apache-spark-sql parquet amazon-emr

Cuckoopint asked 2/12, 2016 at 7:52

parquet Questions

Recommended topics

Hot tags