pyarrow - 5 - McMap

2

Solved

Reading Parquet File with Array<Map<String,String>> Column

I'm using Dask to read a Parquet file that was generated by PySpark, and one of the columns is a list of dictionaries (i.e. array<map<string,string>>'). An example of the df would be: ...

python dask python-3.7 pyarrow fastparquet

Cran asked 14/7, 2019 at 2:6

1

Solved

Why partitioned parquet files consume larger disk space?

I am learning about parquet file using python and pyarrow. Parquet is great in compression and minimizing disk space. My dataset is 190MB csv file which ends up as single 3MB file when saved as sna...

python parquet pyarrow

Hippy asked 13/10, 2019 at 5:34

2

Solved

Fastest way to construct pyarrow table row by row

I have a large dictionary that I want to iterate through to build a pyarrow table. The values of the dictionary are tuples of varying types and need to be unpacked and stored in separate columns in...

python pyarrow apache-arrow

Lion asked 14/9, 2019 at 20:37

2

Solved

Pytest mocker patch - how to troubleshoot?

I am having what I believe to be a common issue in using mock patching in that I can not figure out the right thing to patch. I have two questions that I am hoping for help with. Thoughts on ho...

python mocking pytest pyarrow

Sulemasulf asked 13/9, 2019 at 18:41

1

How does apache arrow facilitate "No overhead for cross-system communication"?

I've been very interested in Apache Arrow for a bit now due to the promises of "zero copy reads", "zero serde", and "No overhead for cross-system communication". My understanding of the project (th...

python pyspark rust pyarrow apache-arrow

Crassus asked 17/9, 2019 at 0:54

2

Solved

Pandas DataFrame with categorical columns from a Parquet file using read_parquet?

I am converting large CSV files into Parquet files for further analysis. I read in the CSV data into Pandas and specify the column dtypes as follows _dtype = {"column_1": "float64", "column_2": "...

python-3.x pandas parquet pyarrow

Mumps asked 17/2, 2019 at 8:23

2

Solved

AWS EMR - ModuleNotFoundError: No module named 'pyarrow'

I am running into this problem w/ Apache Arrow Spark Integration. Using AWS EMR w/ Spark 2.4.3 Tested this problem on both local spark single machine instance and a Cloudera cluster and everythin...

apache-spark pyspark amazon-emr pyarrow apache-arrow

Lengthwise asked 1/8, 2019 at 18:28

1

Solved

How to write Parquet with user defined schema through pyarrow

When I execute below code - gets following error ValueError: Table schema does not match schema used to create file. import pandas as pd import pyarrow as pa import pyarrow.parquet as pq fields ...

python-3.x pyarrow

Drink asked 11/7, 2019 at 4:30

1

Solved

Using data to construct Table. Avoid creating dataframe

Pandas dataframe is heavy weight so I want to avoid that. But I want to construct Pyarrow Table in order to store the data in parquet format. I search and read the documentation and I try to use t...

pyarrow

Walston asked 17/6, 2019 at 21:24

1

Solved

How to use the new Int64 pandas object when saving to a parquet file

I am converting data from CSV to Parquet using Python (Pandas) to later load it into Google BigQuery. I have some integer columns that contain missing values and since Pandas 0.24.0 I can store the...

python google-bigquery parquet pyarrow

Predator asked 3/6, 2019 at 14:26

1

Solved

"Raise RuntimeError('Not supported on 32-bit Windows')" when installing pyarrow

I get this error whenever I try to install pyarrow on my PC. It is 64bit so I don't understand it: raise RuntimeError('Not supported on 32-bit Windows') RuntimeError: Not supported on 32-bit Wind...

python pip pyarrow

G asked 24/4, 2019 at 16:12

1

Solved

writing pandas dataframe with timedeltas to parquet

I can't seem to write a pandas dataframe containing timedeltas to a parquet file through pyarrow. The pyarrow documentation specifies that it can handle numpy timedeltas64 with ms precision. Howev...

python pandas parquet pyarrow

Ilyssa asked 13/7, 2018 at 19:29

1

Solved

How to use Pandas UDFs on macOS Mojave? (that fails due to [__NSPlaceholderDictionary initialize] may have been in progress...)

I'm trying to use Pandas UDFs (a.k.a. Vectorized UDFs) in Apache Spark 2.4.0 on macOS 10.14.3 (macOS Mojave). I installed pandas and pyarrow using pip (and later pip3). Whenever I execute the sam...

apache-spark pyspark apache-spark-sql pyarrow

Ghost asked 27/3, 2019 at 14:9

2

Solved

pandas to_parquet fails on large datasets

I'm trying to save a very large dataset using pandas to_parquet, and it seems to fail when exceeding a certain limit, both with 'pyarrow' and 'fastparquet'. I reproduced the errors I am getting wit...

pandas parquet pyarrow fastparquet

Terbia asked 10/6, 2018 at 9:23

4

Solved

How to save a huge pandas dataframe to hdfs?

Im working with pandas and with spark dataframes. The dataframes are always very big (> 20 GB) and the standard spark functions are not sufficient for those sizes. Currently im converting my pandas...

python pandas apache-spark pyarrow apache-arrow

Rusty asked 20/11, 2017 at 13:19

1

Solved

Sharing objects across workers using pyarrow

I would like to give read-only access to shared DataFrame to multiple worker processes created by multiprocessing.Pool.map(). I would like to avoid copying and pickling. I understood that pyarrow...

python pandas python-multiprocessing pyarrow

Macrae asked 7/2, 2019 at 20:51

2

Read CSV with PyArrow

I have large CSV files that I'd ultimately like to convert to parquet. Pandas won't help because of memory constraints and its difficulty handling NULL values (which are common in my data). I check...

python pyarrow

Shorn asked 19/9, 2018 at 19:53

2

Mysterious 'pyarrow.lib.ArrowInvalid: Floating point value truncated' ERROR when use toPandas() on a DataFrame in pyspark

I use toPandas() on a DataFrame which is not very large, but I get the following exception: 18/10/31 19:13:19 ERROR Executor: Exception in task 127.2 in stage 13.0 (TID 2264) org.apache.spark.api....

apache-spark pyspark apache-spark-sql pyarrow apache-arrow

Checkered asked 31/10, 2018 at 11:51

1

Streaming parquet file python and only downsampling

I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and s...

python-3.x parquet pyarrow fastparquet

Jeremiad asked 2/1, 2019 at 15:28

2

Solved

How to install pyarrow on an Alpine Docker image?

I am trying to install pyarrow using pip in my alpine docker image, but pip is unable to find the package. I'm using the following Dockerfile: FROM python:3.6-alpine3.7 RUN apk add --no-cache musl...

python docker alpine-linux pyarrow

Watts asked 1/3, 2018 at 22:35

1

Solved

Repartitioning parquet-mr generated parquets with pyarrow/parquet-cpp increases file size by x30?

Using AWS Firehose I am converting incoming records to parquet. In one example, I have 150k identical records enter firehose, and a single 30kb parquet gets written to s3. Because of how firehose p...

pandas parquet amazon-kinesis-firehose pyarrow

Eviaevict asked 26/10, 2018 at 16:38

2

Solved

Reading specific partitions from a partitioned parquet dataset with pyarrow

I have a somewhat large (~20 GB) partitioned dataset in parquet format. I would like to read specific partitions from the dataset using pyarrow. I thought I could accomplish this with pyarrow.parqu...

python parquet pyarrow apache-arrow

Unseasonable asked 28/12, 2017 at 5:29

1

Solved

Pandas Dataframe Parquet Data Types?

I am trying to use Pandas and Pyarrow to parquet data. I have hundreds of parquet files that don't need to have the same schema but if columns match across parquets they must have the same data typ...

python pandas numpy parquet pyarrow

Extradite asked 10/9, 2018 at 19:18

1

RuntimeError: Unsupported type in conversion to Arrow: VectorUDT

I want to convert a big spark data frame to Pandas with more than 1000000 rows. I tried to convert a spark data Frame to Pandas data frame using the following code: spark.conf.set("spark.sql.execu...

pandas apache-spark dataframe pyspark pyarrow

Trimeter asked 4/7, 2018 at 13:59

2

Solved

how to efficiently split a large dataframe into many parquet files?

Consider the following dataframe import pandas as pd import numpy as np import pyarrow.parquet as pq import pyarrow as pa idx = pd.date_range('2017-01-01 12:00:00.000', '2017-03-01 12:00:00.000',...

python pandas parquet pyarrow

Foamy asked 12/6, 2018 at 19:57

pyarrow Questions

Recommended topics

Hot tags