pyarrow Questions

2

Solved

I'm using Dask to read a Parquet file that was generated by PySpark, and one of the columns is a list of dictionaries (i.e. array<map<string,string>>'). An example of the df would be: ...
Cran asked 14/7, 2019 at 2:6

1

Solved

I am learning about parquet file using python and pyarrow. Parquet is great in compression and minimizing disk space. My dataset is 190MB csv file which ends up as single 3MB file when saved as sna...
Hippy asked 13/10, 2019 at 5:34

2

Solved

I have a large dictionary that I want to iterate through to build a pyarrow table. The values of the dictionary are tuples of varying types and need to be unpacked and stored in separate columns in...
Lion asked 14/9, 2019 at 20:37

2

Solved

I am having what I believe to be a common issue in using mock patching in that I can not figure out the right thing to patch. I have two questions that I am hoping for help with. Thoughts on ho...
Sulemasulf asked 13/9, 2019 at 18:41

1

I've been very interested in Apache Arrow for a bit now due to the promises of "zero copy reads", "zero serde", and "No overhead for cross-system communication". My understanding of the project (th...
Crassus asked 17/9, 2019 at 0:54

2

Solved

I am converting large CSV files into Parquet files for further analysis. I read in the CSV data into Pandas and specify the column dtypes as follows _dtype = {"column_1": "float64", "column_2": "...
Mumps asked 17/2, 2019 at 8:23

2

Solved

I am running into this problem w/ Apache Arrow Spark Integration. Using AWS EMR w/ Spark 2.4.3 Tested this problem on both local spark single machine instance and a Cloudera cluster and everythin...
Lengthwise asked 1/8, 2019 at 18:28

1

Solved

When I execute below code - gets following error ValueError: Table schema does not match schema used to create file. import pandas as pd import pyarrow as pa import pyarrow.parquet as pq fields ...
Drink asked 11/7, 2019 at 4:30

1

Solved

Pandas dataframe is heavy weight so I want to avoid that. But I want to construct Pyarrow Table in order to store the data in parquet format. I search and read the documentation and I try to use t...
Walston asked 17/6, 2019 at 21:24

1

Solved

I am converting data from CSV to Parquet using Python (Pandas) to later load it into Google BigQuery. I have some integer columns that contain missing values and since Pandas 0.24.0 I can store the...
Predator asked 3/6, 2019 at 14:26

1

Solved

I get this error whenever I try to install pyarrow on my PC. It is 64bit so I don't understand it: raise RuntimeError('Not supported on 32-bit Windows') RuntimeError: Not supported on 32-bit Wind...
G asked 24/4, 2019 at 16:12

1

Solved

I can't seem to write a pandas dataframe containing timedeltas to a parquet file through pyarrow. The pyarrow documentation specifies that it can handle numpy timedeltas64 with ms precision. Howev...
Ilyssa asked 13/7, 2018 at 19:29

1

Solved

I'm trying to use Pandas UDFs (a.k.a. Vectorized UDFs) in Apache Spark 2.4.0 on macOS 10.14.3 (macOS Mojave). I installed pandas and pyarrow using pip (and later pip3). Whenever I execute the sam...
Ghost asked 27/3, 2019 at 14:9

2

Solved

I'm trying to save a very large dataset using pandas to_parquet, and it seems to fail when exceeding a certain limit, both with 'pyarrow' and 'fastparquet'. I reproduced the errors I am getting wit...
Terbia asked 10/6, 2018 at 9:23

4

Solved

Im working with pandas and with spark dataframes. The dataframes are always very big (> 20 GB) and the standard spark functions are not sufficient for those sizes. Currently im converting my pandas...
Rusty asked 20/11, 2017 at 13:19

1

Solved

I would like to give read-only access to shared DataFrame to multiple worker processes created by multiprocessing.Pool.map(). I would like to avoid copying and pickling. I understood that pyarrow...
Macrae asked 7/2, 2019 at 20:51

2

I have large CSV files that I'd ultimately like to convert to parquet. Pandas won't help because of memory constraints and its difficulty handling NULL values (which are common in my data). I check...
Shorn asked 19/9, 2018 at 19:53

2

I use toPandas() on a DataFrame which is not very large, but I get the following exception: 18/10/31 19:13:19 ERROR Executor: Exception in task 127.2 in stage 13.0 (TID 2264) org.apache.spark.api....
Checkered asked 31/10, 2018 at 11:51

1

I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and s...
Jeremiad asked 2/1, 2019 at 15:28

2

Solved

I am trying to install pyarrow using pip in my alpine docker image, but pip is unable to find the package. I'm using the following Dockerfile: FROM python:3.6-alpine3.7 RUN apk add --no-cache musl...
Watts asked 1/3, 2018 at 22:35

1

Solved

Using AWS Firehose I am converting incoming records to parquet. In one example, I have 150k identical records enter firehose, and a single 30kb parquet gets written to s3. Because of how firehose p...
Eviaevict asked 26/10, 2018 at 16:38

2

Solved

I have a somewhat large (~20 GB) partitioned dataset in parquet format. I would like to read specific partitions from the dataset using pyarrow. I thought I could accomplish this with pyarrow.parqu...
Unseasonable asked 28/12, 2017 at 5:29

1

Solved

I am trying to use Pandas and Pyarrow to parquet data. I have hundreds of parquet files that don't need to have the same schema but if columns match across parquets they must have the same data typ...
Extradite asked 10/9, 2018 at 19:18

1

I want to convert a big spark data frame to Pandas with more than 1000000 rows. I tried to convert a spark data Frame to Pandas data frame using the following code: spark.conf.set("spark.sql.execu...
Trimeter asked 4/7, 2018 at 13:59

2

Solved

Consider the following dataframe import pandas as pd import numpy as np import pyarrow.parquet as pq import pyarrow as pa idx = pd.date_range('2017-01-01 12:00:00.000', '2017-03-01 12:00:00.000',...
Foamy asked 12/6, 2018 at 19:57

© 2022 - 2024 — McMap. All rights reserved.