parquet Questions

4

Solved

How do I obtain the number of rows of a ParquetDataset that is structured in the form of a folder containing multiple parquet files. I tried from pyarrow.parquet import ParquetDataset a = Parquet...
Swarey asked 1/4, 2020 at 0:39

3

Solved

This has a different answer to those given in the post above I am getting an error that reads pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It must be specified manua...
Proust asked 2/11, 2018 at 16:54

4

On an AWS EMR cluster, I'm trying to write a query result to parquet using Pyspark but face the following error: Caused by: java.lang.RuntimeException: Parquet record is malformed: empty fields ar...
Wreckful asked 10/1, 2020 at 1:13

7

Solved

I am new to python and I have a scenario where there are multiple parquet files with file names in order. ex: par_file1,par_file2,par_file3 and so on upto 100 files in a folder. I need to read the...
Legitimist asked 5/8, 2018 at 17:27

10

I have some Apache Parquet file. I know I can execute parquet file.parquet in my shell and view it in terminal. But I would like some GUI tool to view Parquet files in more user-friendly format. Do...
Bethina asked 19/3, 2018 at 16:3

4

Solved

I have access to a hdfs file system and can see parquet files with hadoop fs -ls /user/foo How can I copy those parquet files to my local system and convert them to csv so I can use them? The fi...
Mclaurin asked 9/9, 2016 at 21:29

4

Solved

Currently, I am using the Apache ParquetReader for reading local parquet files, which looks something like this: ParquetReader<GenericData.Record> reader = null; Path path = new Path("userd...
Flatwise asked 9/4, 2020 at 17:2

4

Solved

I am writing a parquet file from a Spark DataFrame the following way: df.write.parquet("path/myfile.parquet", mode = "overwrite", compression="gzip") This creates a folder with multiple files in...
Brahui asked 15/1, 2019 at 15:20

2

I need to read some 'paraquet' files in R. There are few solution using sparklyr:: spark_read_parquet (which required 'spark') reticulate (which need python) Now the problem is I am not allowed...
Dukas asked 14/3, 2019 at 13:34

2

I have two datasets stored as parquet files with schemas as below: Dataset 1: id col1 col2 1 v1 v3 2 v2 v4 Dataset 2: id col3 col4 1 v5 v7 2 v6 v8 I want to join the two dat...
Broch asked 9/4 at 13:0

3

Solved

Does parquet allow appending to a parquet file periodically ? How does appending relate to partitioning if any ? For example if i was able to identify a column that had low cardinality and partitio...
Clew asked 9/9, 2021 at 20:23

4

I use a sqlContext.read.parquet function in PySpark to read the parquet files everyday. The data has a timestamp column. They changed the timestamp field from 2019-08-26T00:00:13.600+0000 to 2019-0...
Portis asked 28/8, 2019 at 20:54

8

Is there any python library that can be used to just get the schema of a parquet file? Currently we are loading the parquet file into dataframe in Spark and getting schema from the dataframe to dis...
Searby asked 10/1, 2017 at 10:54

1

I am getting the below error while using sink_parquet on a LazyFrame. Earlier I was using .collect() on the output of the scan_parquet() to convert the result into a DataFrame but unfortunately it ...
Calling asked 31/1, 2023 at 17:7

4

Is it possible to open parquet files and iterate line by line, using generators? This is to avoid loading the whole parquet file into memory. The content of the file is pandas DataFrame.
Snuffer asked 8/6, 2018 at 7:32

4

I have multiple small parquet files generated as output of hive ql job, i would like to merge the output files to single parquet file? what is the best way to do it using some hdfs or linux comman...
Spohr asked 27/7, 2016 at 10:49

3

Solved

While reading parquet files in spark, if you face the below problem. App > Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 faile...
Tamandua asked 30/10, 2019 at 6:14

4

Is it possible to load parquet file directly into a snowflake? If yes - how? Thanks.
Darya asked 6/7, 2018 at 17:26

3

I have a parquet file with 10 row groups: In [30]: print(pyarrow.parquet.ParquetFile("/tmp/test2.parquet").num_row_groups) 10 But when I load it using Dask Dataframe, it is read into a single pa...
Salop asked 30/1, 2020 at 14:27

3

I'm trying to save dataframe in table hive. In spark 1.6 it's work but after migration to 2.2.0 it doesn't work anymore. Here's the code: blocs .toDF() .repartition($"col1", $"col2", $"col3", ...
Iceland asked 9/1, 2019 at 14:42

6

Solved

After some searching I failed to find a thorough comparison of fastparquet and pyarrow. I found this blog post (a basic comparison of speeds). and a github discussion that claims that files crea...
Electrolyte asked 16/7, 2018 at 12:0

2

Solved

Currently, the Athena query results are in tsv format in S3. Is there any way to configure Athena queries to return results in Parquet format.
Matronly asked 11/10, 2018 at 14:42

3

Solved

Usually in Impala, we use the COMPRESSION_CODEC before inserting data into a table for which the underlying files are in Parquet format. Commands used to set COMPRESSION_CODEC: set compression_c...
Bordure asked 20/8, 2019 at 12:16

2

Is there a language agnostic way of representing a Parquet or Arrow schema in a similar way to Avro? For example, an Avro schema might look like this: { "type": "record", &quo...
Osmunda asked 4/1 at 23:48

4

Solved

I have run into a problem where I have Parquet data as daily chunks in S3 (in the form of s3://bucketName/prefix/YYYY/MM/DD/) but I cannot read the data in AWS EMR Spark from different dates becaus...
Cuckoopint asked 2/12, 2016 at 7:52

© 2022 - 2024 — McMap. All rights reserved.