pyarrow - 3 - McMap

3

Solved

How to assign arbitrary metadata to pyarrow.Table / Parquet columns

Use-case I am using Apache Parquet files as a fast IO format for large-ish spatial data that I am working on in Python with GeoPandas. I am storing feature geometries as WKB and would like to reco...

python pandas gis parquet pyarrow

Hydrocortisone asked 6/4, 2019 at 4:52

1

How to store custom Parquet Dataset metadata with pyarrow?

How do I store custom metadata to a ParquetDataset using pyarrow? For example, if I create a Parquet dataset using Dask import dask dask.datasets.timeseries().to_parquet('temp.parq') I can then re...

python parquet pyarrow

Huggins asked 10/9, 2021 at 11:10

3

Solved

Different behavior while reading DataFrame from parquet using CLI Versus executable on same environment

Please consider following program as Minimal Reproducible Example -MRE: import pandas as pd import pyarrow from pyarrow import parquet def foo(): print(pyarrow.__file__) print('version:',pyarrow...

python pandas pyinstaller parquet pyarrow

Miler asked 22/7, 2021 at 13:52

1

Solved

How can I convert a ndarray/multi-dimensional array to a parquet file?

I have a <class 'numpy.ndarray'> array that I would like saved to a parquet file to pass to a ML model I'm building. My array has 159573 arrays and each array has 1395 array in each. Here is ...

numpy parquet pyarrow

Disclosure asked 12/8, 2021 at 15:3

1

Solved

How do I debug OverflowError: value too large to convert to int32_t?

What I am trying to do I am using PyArrow to read some CSVs and convert them to Parquet. Some of the files I read have plenty of columns and have a high memory footprint (enough to crash the machin...

python pyarrow apache-arrow

Halliehallman asked 4/8, 2021 at 13:29

1

Solved

How can I chunk through a CSV using Arrow?

What I am trying to do I am using PyArrow to read some CSVs and convert them to Parquet. Some of the files I read have plenty of columns and have a high memory footprint (enough to crash the machin...

python pyarrow apache-arrow

Madra asked 28/7, 2021 at 5:54

1

Solved

Incrementally writing Parquet dataset from Python

I am writing a larger than RAM data out from my Python application - basically dumping data from SQLAlchemy to Parque. My solution was inspired by this question. Even though increasing the batch si...

parquet pyarrow

Ogee asked 14/7, 2021 at 9:14

2

pyarrow error: toPandas attempted Arrow optimization

when I set pyarrow to true we using spark session, but when I run toPandas(), it throws the error: "toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true...

pyspark pyarrow

Carthusian asked 28/8, 2018 at 6:16

2

How to save a pandas DataFrame with custom types using pyarrow and parquet

I want to save a pandas DataFrame to parquet, but I have some unsupported types in it (for example bson ObjectIds). Throughout the examples we use: import pandas as pd import pyarrow as pa Here...

python pandas dataframe parquet pyarrow

Facet asked 17/4, 2020 at 12:8

2

Solved

pandasUDF and pyarrow 0.15.0

I have recently started getting a bunch of errors on a number of pyspark jobs running on EMR clusters. The erros are java.lang.IllegalArgumentException at java.nio.ByteBuffer.allocate(ByteBuffer...

pandas apache-spark pyspark pyarrow

Hive asked 7/10, 2019 at 15:51

2

Solved

Read last N rows of S3 parquet table

If I apply what was discussed here to read parquet files in an S3 buck to pandas dataframe, particularly: import pyarrow.parquet as pq import s3fs s3 = s3fs.S3FileSystem() pandas_dataframe = pq.Pa...

python amazon-web-services amazon-s3 pyarrow

Donor asked 20/6, 2021 at 5:41

3

Can't install pyarrow on OSX / Python 3.9: is this me or an incompatible package?

I'm trying to install pyarrow with pip3 on OSX 11.0.1, and getting error messages. I'm using Python 3.9 and not sure if that is the problem. Here is the error summary: ERROR: Command errored out w...

python pyarrow

Begat asked 21/11, 2020 at 23:57

1

Solved

ArrowTypeError: Did not pass numpy.dtype object', 'Conversion failed for column X with type int32

Problem I am trying to save a data frame as a parquet file on Databricks, getting the ArrowTypeError. Databricks Runtime Version: 7.6 ML (includes Apache Spark 3.0.1, Scala 2.12) Log Trace ArrowTyp...

python pandas numpy databricks pyarrow

Brack asked 12/5, 2021 at 11:41

1

Solved

What is a common use case for Apache arrow in a data pipeline built in Spark

What is the purpose of Apache Arrow? It converts from one binary format to another, but why do i need that? If I have a spark program,then spark can read parquet,so why do i need to convert it into...

apache-spark parquet pyarrow apache-arrow

Maudiemaudlin asked 11/5, 2021 at 20:28

1

dask read parquet and specify schema

Is there a dask equivalent of spark's ability to specify a schema when reading in a parquet file? Possibly using kwargs passed to pyarrow? I have a bunch of parquet files in a bucket but some of th...

pandas apache-spark dask parquet pyarrow

Myrwyn asked 1/4, 2021 at 2:1

1

Solved

Can not save pandas dataframe to parquet with lists of floats as cell value

I have an dataframe with a structure like this: Coumn1 Coumn2 0 (0.00030271668219938874, 0.0002655923890415579... (0.0016430083196610212, 0.0014970217598602176,... 1 (0.00015607803652528673, 0.000...

python pandas parquet pyarrow

Sciamachy asked 25/3, 2021 at 14:4

1

Solved

In PyArrow, how to append rows of a table to a memory mapped file?

as you can see in the code below, I'm having troubles adding new rows to a Table saved in a memory mapped file. I just want to write the file again with the new rows. import pyarrow as pa source =...

python memory-mapped-files pyarrow memory-mapping apache-arrow

Vitovitoria asked 12/3, 2021 at 7:46

3

How to write a partitioned Parquet file using Pandas

I'm trying to write a Pandas dataframe to a partitioned file: df.to_parquet('output.parquet', engine='pyarrow', partition_cols = ['partone', 'partwo']) TypeError: __cinit__() got an unexpected ke...

python pandas parquet pyarrow

Quiche asked 22/10, 2018 at 16:56

2

How to read parquet file with a condition using pyarrow in Python

I have created a parquet file with three columns (id, author, title) from database and want to read the parquet file with a condition (title='Learn Python'). Below mentioned is the python code whic...

python filter conditional-statements parquet pyarrow

Sylvie asked 9/2, 2018 at 22:6

1

Solved

pyarrow and pandas integration

I plan to: join group by filter data using pyarrow (new to it). The idea is to get better performance and memory utilisation ( apache arrow compression) comparing to pandas. Seems like pyarrow ha...

pandas pyarrow apache-arrow

Chalmer asked 1/1, 2021 at 17:15

1

pip install pyarrow failing in Linux / Inside a docker

I tried installing pyarrow and it's failing with the below error. I also tried the option --no-binary :all: and still the same problem. Any help to resolve this will really help me. Python version:...

python-3.x pip pyarrow

Browne asked 23/11, 2020 at 8:24

3

Solved

PyArrow: Store list of dicts in parquet using nested types

I want to store the following pandas data frame in a parquet file using PyArrow: import pandas as pd df = pd.DataFrame({'field': [[{}, {}]]}) The type of the field column is list of dicts: fie...

python pandas parquet pyarrow

Yurt asked 21/2, 2019 at 22:7

1

Solved

Pyarrow Dataset read specific columns and specific rows

Is there a way to use pyarrow parquet dataset to read specific columns and if possible filter data instead of reading a whole file into dataframe?

python parquet pyarrow

Preciosity asked 10/9, 2019 at 22:7

1

PyArrow: Incrementally using ParquetWriter without keeping entire dataset in memory (large than memory parquet files)

I'm trying to write a large parquet file onto disk (larger then memory). I naively thought I can be clever and use ParquetWriter and write_table to incrementally write a file, like this (POC): impo...

python parquet pyarrow apache-arrow

Satirical asked 14/9, 2020 at 20:11

4

Solved

Using predicates to filter rows from pyarrow.parquet.ParquetDataset

I have a parquet dataset stored on s3, and I would like to query specific rows from the dataset. I was able to do that using petastorm but now I want to do that using only pyarrow. Here's my attem...

python pandas amazon-s3 parquet pyarrow

Latchkey asked 10/6, 2019 at 8:33

pyarrow Questions

Recommended topics

Hot tags