parquet - 2 - McMap

4

Reading partitioned parquet files in DuckDB

Background: DuckDB allows for direct querying for parquet files. e.g. con.execute("Select * from 'Hierarchy.parquet') Parquet allows files to be partitioned by column values. When a parquet ...

parquet duckdb

Drice asked 21/4, 2022 at 10:6

2

Solved

Read parquet data from ByteArrayOutputStream instead of file

I would like to convert this code: import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.parquet.column.page.PageReadStore; import org.apache.parquet.exa...

java parquet bytearrayoutputstream

Karlotta asked 27/9, 2019 at 20:56

2

Solved

Memory issue when importing parquet files in Spark

I am trying to query data from parquet files in Scala Spark (1.5), including a query of 2 million rows ("variants" in the following code). val sqlContext = new org.apache.spark.sql.SQLContext(sc) ...

scala apache-spark apache-spark-sql parquet

Laryngeal asked 22/3, 2016 at 1:1

2

Solved

How to Generate Parquet File Using Pure Java (Including Date & Decimal Types) And Upload to S3 [Windows] (No HDFS)

I recently had a requirement where I needed to generate Parquet files that could be read by Apache Spark using only Java (Using no additional software installations such as: Apache Drill, Hive, Spa...

java apache-spark amazon-s3 avro parquet

Wells asked 17/11, 2017 at 16:21

3

Memory leak from pyarrow?

For the parsing of a larger file, I need to write in a loop to a large number of parquet files successively. However, it appears that the memory consumed by this task increases over each iteration,...

python pandas parquet pyarrow

Buddie asked 26/10, 2018 at 22:1

6

Spark Exception : Task failed while writing rows

I am reading text files and converting them to parquet files. I am doing it using spark code. But when i try to run the code I get following exception org.apache.spark.SparkException: Job aborted ...

java hadoop apache-spark apache-spark-sql parquet

Revolt asked 16/3, 2016 at 11:52

4

Is there a way to overwrite existing data using pandas to_parquet with partitions?

I'm using pandas to write a parquet file using the to_parquet function with partitions. Example: df.to_parquet('gs://bucket/path', partition_cols=['key']) The issue is that every time I run the co...

python pandas parquet

Truelove asked 17/2, 2022 at 1:12

3

ClientError: An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied

I am working with python in a jupyter notebook. I am trying to access several parquet files from an aws s3 bucket and convert them all into one json file. I know I have access to the data, but I am...

python amazon-s3 jupyter-notebook parquet

Wiper asked 16/7, 2020 at 17:33

0

Specifying logical types (in particular, UUID) when writing parquet files from pyarrow

The pyarrow documentation builds a custom UUID type many times like this: import pyarrow as pa class UuidType(pa.PyExtensionType): def __init__(self): pa.PyExtensionType.__init__(self, pa.binary(...

parquet pyarrow apache-arrow

Cigarillo asked 5/9, 2023 at 22:16

5

Apache Parquet Could not read footer: java.io.IOException:

I have a SPARK project running on a Cloudera VM. On my project I load the data from a parquet file and then process these data. Everything works fine but The problem is that I need to run this proj...

java hadoop io apache-spark parquet

Bullock asked 15/1, 2016 at 15:9

4

How to read parquet files from Azure Blobs into Pandas DataFrame?

I need to read .parquet files into a Pandas DataFrame in Python on my local machine without downloading the files. The parquet files are stored on Azure blobs with hierarchical directory structure....

azure azure-blob-storage parquet

Superfluid asked 11/8, 2020 at 4:24

4

Solved

Using aws profile with fs S3Filesystem

Trying to use a specific AWS profile when using Apache Pyarrow. The documentation show no option to pass a profile name when instantiating S3FileSystem using pyarrow fs [https://arrow.apache.org/do...

amazon-web-services amazon-s3 parquet pyarrow

Ketcham asked 22/6, 2022 at 16:50

4

Solved

Py4JJavaError: An error occurred while calling o26.parquet. (Reading Parquet file)

Trying to read a Parquet file in PySpark but getting Py4JJavaError. I even tried reading it from the spark-shell and was able to do so. I cannot understand what I am doing wrong here in terms of th...

python-3.x apache-spark pyspark parquet

Lilley asked 5/7, 2018 at 9:31

0

pyarrow memory consumption difference between Dataset.to_batches and ParquetFile.iter_batches

I am using pyarrow and am struggling to understand the big difference in memory usage between the Dataset.to_batches method compared to ParquetFile.iter_batches. Using pyarrow.dataset >>> ...

parquet pyarrow apache-arrow

Ting asked 4/8, 2023 at 1:11

1

Solved

What is actually meant when referring to parquet row-group size?

I am starting to work with the parquet file format. The official Apache site recommends large row groups of 512MB to 1GB (here). Several online source (e.g. this one) suggest that the default row g...

parquet pyarrow apache-arrow

Romanfleuve asked 27/7, 2023 at 17:6

3

Solved

R: Reading first n rows from parquet file?

I realise parquet is a column format, but with large files, sometimes you don't want to read it all to memory in R before filtering, and the first 1000 or so rows may be enough for testing. I don't...

r parquet

Maple asked 27/7, 2022 at 2:1

7

pandas write dataframe to parquet format with append

I am trying to write a pandas dataframe to parquet file format (introduced in most recent pandas version 0.21.0) in append mode. However, instead of appending to the existing file, the file is over...

python apache pandas parquet

Rammish asked 8/11, 2017 at 23:48

10

Convert csv to parquet file using python

I am trying to convert a .csv file to a .parquet file. The csv file (Temp.csv) has the following format 1,Jon,Doe,Denver I am using the following python code to convert it into parquet from py...

python csv parquet

Daredeviltry asked 30/5, 2018 at 11:59

2

How to get list of all columns from a parquet file using s3 select?

I have a parquet file stored in S3 bucket. I want to get the list of all columns of the parquet file. I am using s3 select but it just give me list of all rows wihtout any column headers. Is ther...

java sql amazon-s3 parquet amazon-s3-select

Slider asked 11/8, 2019 at 16:4

2

How to avoid creation of .crc files when parquet files are created

I am using parquet framework to write parquet files. I create the parquet writer with this constructor-- public class ParquetBaseWriter<T extends HashMap> extends ParquetWriter<T> { p...

parquet

Connel asked 13/10, 2014 at 6:7

3

Solved

read a parquet files from HDFS using PyArrow

I know I can connect to an HDFS cluster via pyarrow using pyarrow.hdfs.connect() I also know I can read a parquet file using pyarrow.parquet's read_table() However, read_table() accepts a filepat...

hdfs parquet pyarrow

Liddy asked 22/11, 2017 at 20:10

3

Solved

Dremel - repetition and definition level

Reading Interactive Analysis of Web-Scale Datasets paper, I bumped into the concept of repetition and definition level. while I understand the need for these two, to be able to disambiguate occurr...

algorithm data-structures dataset parquet dremel

Metz asked 23/4, 2017 at 6:35

3

Solved

How to set Parquet file encoding in Spark

Parquet documentation describe few different encodings here Is it changes somehow inside file during read/write, or I can set it? Nothing about it in Spark documentation. Only found slides from sp...

scala apache-spark apache-spark-sql parquet

Male asked 3/8, 2017 at 15:11

5

Solved

How to write parquet file from pandas dataframe in S3 in python

I have a pandas dataframe. i want to write this dataframe to parquet file in S3. I need a sample code for the same.I tried to google it. but i could not get a working sample code.

python-3.x amazon-s3 parquet

Arose asked 21/11, 2018 at 16:13

2

Solved

How can I write NULL value to parquet using org.apache.parquet.hadoop.ParquetWriter?

I have a tool that uses a org.apache.parquet.hadoop.ParquetWriter to convert CSV data files to parquet data files. I can write basic primitive types just fine (INT32, DOUBLE, BINARY string). I ne...

java apache-spark hadoop parquet

Mercaptide asked 19/3, 2019 at 18:25

parquet Questions

Recommended topics

Hot tags