pyarrow Questions

3

Solved

Use-case I am using Apache Parquet files as a fast IO format for large-ish spatial data that I am working on in Python with GeoPandas. I am storing feature geometries as WKB and would like to reco...
Hydrocortisone asked 6/4, 2019 at 4:52

1

How do I store custom metadata to a ParquetDataset using pyarrow? For example, if I create a Parquet dataset using Dask import dask dask.datasets.timeseries().to_parquet('temp.parq') I can then re...
Huggins asked 10/9, 2021 at 11:10

3

Solved

Please consider following program as Minimal Reproducible Example -MRE: import pandas as pd import pyarrow from pyarrow import parquet def foo(): print(pyarrow.__file__) print('version:',pyarrow...
Miler asked 22/7, 2021 at 13:52

1

Solved

I have a <class 'numpy.ndarray'> array that I would like saved to a parquet file to pass to a ML model I'm building. My array has 159573 arrays and each array has 1395 array in each. Here is ...
Disclosure asked 12/8, 2021 at 15:3

1

Solved

What I am trying to do I am using PyArrow to read some CSVs and convert them to Parquet. Some of the files I read have plenty of columns and have a high memory footprint (enough to crash the machin...
Halliehallman asked 4/8, 2021 at 13:29

1

Solved

What I am trying to do I am using PyArrow to read some CSVs and convert them to Parquet. Some of the files I read have plenty of columns and have a high memory footprint (enough to crash the machin...
Madra asked 28/7, 2021 at 5:54

1

Solved

I am writing a larger than RAM data out from my Python application - basically dumping data from SQLAlchemy to Parque. My solution was inspired by this question. Even though increasing the batch si...
Ogee asked 14/7, 2021 at 9:14

2

when I set pyarrow to true we using spark session, but when I run toPandas(), it throws the error: "toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true...
Carthusian asked 28/8, 2018 at 6:16

2

I want to save a pandas DataFrame to parquet, but I have some unsupported types in it (for example bson ObjectIds). Throughout the examples we use: import pandas as pd import pyarrow as pa Here...
Facet asked 17/4, 2020 at 12:8

2

Solved

I have recently started getting a bunch of errors on a number of pyspark jobs running on EMR clusters. The erros are java.lang.IllegalArgumentException at java.nio.ByteBuffer.allocate(ByteBuffer...
Hive asked 7/10, 2019 at 15:51

2

Solved

If I apply what was discussed here to read parquet files in an S3 buck to pandas dataframe, particularly: import pyarrow.parquet as pq import s3fs s3 = s3fs.S3FileSystem() pandas_dataframe = pq.Pa...
Donor asked 20/6, 2021 at 5:41

3

I'm trying to install pyarrow with pip3 on OSX 11.0.1, and getting error messages. I'm using Python 3.9 and not sure if that is the problem. Here is the error summary: ERROR: Command errored out w...
Begat asked 21/11, 2020 at 23:57

1

Solved

Problem I am trying to save a data frame as a parquet file on Databricks, getting the ArrowTypeError. Databricks Runtime Version: 7.6 ML (includes Apache Spark 3.0.1, Scala 2.12) Log Trace ArrowTyp...
Brack asked 12/5, 2021 at 11:41

1

Solved

What is the purpose of Apache Arrow? It converts from one binary format to another, but why do i need that? If I have a spark program,then spark can read parquet,so why do i need to convert it into...
Maudiemaudlin asked 11/5, 2021 at 20:28

1

Is there a dask equivalent of spark's ability to specify a schema when reading in a parquet file? Possibly using kwargs passed to pyarrow? I have a bunch of parquet files in a bucket but some of th...
Myrwyn asked 1/4, 2021 at 2:1

1

Solved

I have an dataframe with a structure like this: Coumn1 Coumn2 0 (0.00030271668219938874, 0.0002655923890415579... (0.0016430083196610212, 0.0014970217598602176,... 1 (0.00015607803652528673, 0.000...
Sciamachy asked 25/3, 2021 at 14:4

1

Solved

as you can see in the code below, I'm having troubles adding new rows to a Table saved in a memory mapped file. I just want to write the file again with the new rows. import pyarrow as pa source =...
Vitovitoria asked 12/3, 2021 at 7:46

3

I'm trying to write a Pandas dataframe to a partitioned file: df.to_parquet('output.parquet', engine='pyarrow', partition_cols = ['partone', 'partwo']) TypeError: __cinit__() got an unexpected ke...
Quiche asked 22/10, 2018 at 16:56

2

I have created a parquet file with three columns (id, author, title) from database and want to read the parquet file with a condition (title='Learn Python'). Below mentioned is the python code whic...
Sylvie asked 9/2, 2018 at 22:6

1

Solved

I plan to: join group by filter data using pyarrow (new to it). The idea is to get better performance and memory utilisation ( apache arrow compression) comparing to pandas. Seems like pyarrow ha...
Chalmer asked 1/1, 2021 at 17:15

1

I tried installing pyarrow and it's failing with the below error. I also tried the option --no-binary :all: and still the same problem. Any help to resolve this will really help me. Python version:...
Browne asked 23/11, 2020 at 8:24

3

Solved

I want to store the following pandas data frame in a parquet file using PyArrow: import pandas as pd df = pd.DataFrame({'field': [[{}, {}]]}) The type of the field column is list of dicts: fie...
Yurt asked 21/2, 2019 at 22:7

1

Solved

Is there a way to use pyarrow parquet dataset to read specific columns and if possible filter data instead of reading a whole file into dataframe?
Preciosity asked 10/9, 2019 at 22:7

1

I'm trying to write a large parquet file onto disk (larger then memory). I naively thought I can be clever and use ParquetWriter and write_table to incrementally write a file, like this (POC): impo...
Satirical asked 14/9, 2020 at 20:11

4

Solved

I have a parquet dataset stored on s3, and I would like to query specific rows from the dataset. I was able to do that using petastorm but now I want to do that using only pyarrow. Here's my attem...
Latchkey asked 10/6, 2019 at 8:33

© 2022 - 2024 — McMap. All rights reserved.