parquet Questions
4
Background:
DuckDB allows for direct querying for parquet files. e.g. con.execute("Select * from 'Hierarchy.parquet')
Parquet allows files to be partitioned by column values. When a parquet ...
2
Solved
I would like to convert this code:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.column.page.PageReadStore;
import org.apache.parquet.exa...
Karlotta asked 27/9, 2019 at 20:56
2
Solved
I am trying to query data from parquet files in Scala Spark (1.5), including a query of 2 million rows ("variants" in the following code).
val sqlContext = new org.apache.spark.sql.SQLContext(sc) ...
Laryngeal asked 22/3, 2016 at 1:1
2
Solved
I recently had a requirement where I needed to generate Parquet files that could be read by Apache Spark using only Java (Using no additional software installations such as: Apache Drill, Hive, Spa...
Wells asked 17/11, 2017 at 16:21
3
For the parsing of a larger file, I need to write in a loop to a large number of parquet files successively. However, it appears that the memory consumed by this task increases over each iteration,...
6
I am reading text files and converting them to parquet files. I am doing it using spark code. But when i try to run the code I get following exception
org.apache.spark.SparkException: Job aborted ...
Revolt asked 16/3, 2016 at 11:52
4
I'm using pandas to write a parquet file using the to_parquet function with partitions. Example:
df.to_parquet('gs://bucket/path', partition_cols=['key'])
The issue is that every time I run the co...
3
I am working with python in a jupyter notebook.
I am trying to access several parquet files from an aws s3 bucket and convert them all into one json file. I know I have access to the data, but I am...
Wiper asked 16/7, 2020 at 17:33
0
The pyarrow documentation builds a custom UUID type many times like this:
import pyarrow as pa
class UuidType(pa.PyExtensionType):
def __init__(self):
pa.PyExtensionType.__init__(self, pa.binary(...
Cigarillo asked 5/9, 2023 at 22:16
5
I have a SPARK project running on a Cloudera VM. On my project I load the data from a parquet file and then process these data. Everything works fine but The problem is that I need to run this proj...
Bullock asked 15/1, 2016 at 15:9
4
I need to read .parquet files into a Pandas DataFrame in Python on my local machine without downloading the files. The parquet files are stored on Azure blobs with hierarchical directory structure....
Superfluid asked 11/8, 2020 at 4:24
4
Solved
Trying to use a specific AWS profile when using Apache Pyarrow. The documentation show no option to pass a profile name when instantiating S3FileSystem using pyarrow fs [https://arrow.apache.org/do...
Ketcham asked 22/6, 2022 at 16:50
4
Solved
Trying to read a Parquet file in PySpark but getting Py4JJavaError. I even tried reading it from the spark-shell and was able to do so. I cannot understand what I am doing wrong here in terms of th...
Lilley asked 5/7, 2018 at 9:31
0
I am using pyarrow and am struggling to understand the big difference in memory usage between the Dataset.to_batches method compared to ParquetFile.iter_batches.
Using pyarrow.dataset
>>> ...
Ting asked 4/8, 2023 at 1:11
1
Solved
I am starting to work with the parquet file format.
The official Apache site recommends large row groups of 512MB to 1GB (here).
Several online source (e.g. this one) suggest that the default row g...
Romanfleuve asked 27/7, 2023 at 17:6
3
Solved
I realise parquet is a column format, but with large files, sometimes you don't want to read it all to memory in R before filtering, and the first 1000 or so rows may be enough for testing. I don't...
7
I am trying to write a pandas dataframe to parquet file format (introduced in most recent pandas version 0.21.0) in append mode. However, instead of appending to the existing file, the file is over...
10
I am trying to convert a .csv file to a .parquet file.
The csv file (Temp.csv) has the following format
1,Jon,Doe,Denver
I am using the following python code to convert it into parquet
from py...
2
I have a parquet file stored in S3 bucket. I want to get the list of all columns of the parquet file. I am using s3 select but it just give me list of all rows wihtout any column headers.
Is ther...
Slider asked 11/8, 2019 at 16:4
2
I am using parquet framework to write parquet files.
I create the parquet writer with this constructor--
public class ParquetBaseWriter<T extends HashMap> extends ParquetWriter<T> {
p...
Connel asked 13/10, 2014 at 6:7
3
Solved
I know I can connect to an HDFS cluster via pyarrow using pyarrow.hdfs.connect()
I also know I can read a parquet file using pyarrow.parquet's read_table()
However, read_table() accepts a filepat...
3
Solved
Reading Interactive Analysis of Web-Scale Datasets paper, I bumped into the concept of repetition and definition level.
while I understand the need for these two, to be able to disambiguate occurr...
Metz asked 23/4, 2017 at 6:35
3
Solved
Parquet documentation describe few different encodings here
Is it changes somehow inside file during read/write, or I can set it?
Nothing about it in Spark documentation. Only found slides from sp...
Male asked 3/8, 2017 at 15:11
5
Solved
I have a pandas dataframe. i want to write this dataframe to parquet file in S3.
I need a sample code for the same.I tried to google it. but i could not get a working sample code.
Arose asked 21/11, 2018 at 16:13
2
Solved
I have a tool that uses a org.apache.parquet.hadoop.ParquetWriter to convert CSV data files to parquet data files.
I can write basic primitive types just fine (INT32, DOUBLE, BINARY string).
I ne...
Mercaptide asked 19/3, 2019 at 18:25
© 2022 - 2025 — McMap. All rights reserved.