pyarrow Questions
3
Solved
Use-case
I am using Apache Parquet files as a fast IO format for large-ish spatial data that I am working on in Python with GeoPandas. I am storing feature geometries as WKB and would like to reco...
1
How do I store custom metadata to a ParquetDataset using pyarrow?
For example, if I create a Parquet dataset using Dask
import dask
dask.datasets.timeseries().to_parquet('temp.parq')
I can then re...
3
Solved
Please consider following program as Minimal Reproducible Example -MRE:
import pandas as pd
import pyarrow
from pyarrow import parquet
def foo():
print(pyarrow.__file__)
print('version:',pyarrow...
Miler asked 22/7, 2021 at 13:52
1
Solved
I have a <class 'numpy.ndarray'> array that I would like saved to a parquet file to pass to a ML model I'm building.
My array has 159573 arrays and each array has 1395 array in each.
Here is ...
1
Solved
What I am trying to do
I am using PyArrow to read some CSVs and convert them to Parquet. Some of the files I read have plenty of columns and have a high memory footprint (enough to crash the machin...
Halliehallman asked 4/8, 2021 at 13:29
1
Solved
What I am trying to do
I am using PyArrow to read some CSVs and convert them to Parquet. Some of the files I read have plenty of columns and have a high memory footprint (enough to crash the machin...
Madra asked 28/7, 2021 at 5:54
1
Solved
I am writing a larger than RAM data out from my Python application - basically dumping data from SQLAlchemy to Parque. My solution was inspired by this question. Even though increasing the batch si...
2
when I set pyarrow to true we using spark session, but when I run toPandas(), it throws the error:
"toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true...
2
I want to save a pandas DataFrame to parquet, but I have some unsupported types in it (for example bson ObjectIds).
Throughout the examples we use:
import pandas as pd
import pyarrow as pa
Here...
2
Solved
I have recently started getting a bunch of errors on a number of pyspark jobs running on EMR clusters. The erros are
java.lang.IllegalArgumentException
at java.nio.ByteBuffer.allocate(ByteBuffer...
Hive asked 7/10, 2019 at 15:51
2
Solved
If I apply what was discussed here to read parquet files in an S3 buck to pandas dataframe, particularly:
import pyarrow.parquet as pq
import s3fs
s3 = s3fs.S3FileSystem()
pandas_dataframe = pq.Pa...
Donor asked 20/6, 2021 at 5:41
3
I'm trying to install pyarrow with pip3 on OSX 11.0.1, and getting error messages.
I'm using Python 3.9 and not sure if that is the problem.
Here is the error summary:
ERROR: Command errored out w...
1
Solved
Problem
I am trying to save a data frame as a parquet file on Databricks, getting the ArrowTypeError.
Databricks Runtime Version:
7.6 ML (includes Apache Spark 3.0.1, Scala 2.12)
Log Trace
ArrowTyp...
Brack asked 12/5, 2021 at 11:41
1
Solved
What is the purpose of Apache Arrow? It converts from one binary format to another, but why do i need that? If I have a spark program,then spark can read parquet,so why do i need to convert it into...
Maudiemaudlin asked 11/5, 2021 at 20:28
1
Is there a dask equivalent of spark's ability to specify a schema when reading in a parquet file? Possibly using kwargs passed to pyarrow?
I have a bunch of parquet files in a bucket but some of th...
Myrwyn asked 1/4, 2021 at 2:1
1
Solved
I have an dataframe with a structure like this:
Coumn1 Coumn2
0 (0.00030271668219938874, 0.0002655923890415579... (0.0016430083196610212, 0.0014970217598602176,...
1 (0.00015607803652528673, 0.000...
1
Solved
as you can see in the code below, I'm having troubles adding new rows to a Table saved in a memory mapped file.
I just want to write the file again with the new rows.
import pyarrow as pa
source =...
Vitovitoria asked 12/3, 2021 at 7:46
3
I'm trying to write a Pandas dataframe to a partitioned file:
df.to_parquet('output.parquet', engine='pyarrow', partition_cols = ['partone', 'partwo'])
TypeError: __cinit__() got an unexpected ke...
2
I have created a parquet file with three columns (id, author, title) from database and want to read the parquet file with a condition (title='Learn Python').
Below mentioned is the python code whic...
Sylvie asked 9/2, 2018 at 22:6
1
Solved
I plan to:
join
group by
filter
data using pyarrow (new to it). The idea is to get better performance and memory utilisation ( apache arrow compression) comparing to pandas.
Seems like pyarrow ha...
Chalmer asked 1/1, 2021 at 17:15
1
I tried installing pyarrow and it's failing with the below error. I also tried the option --no-binary :all: and still the same problem. Any help to resolve this will really help me.
Python version:...
Browne asked 23/11, 2020 at 8:24
3
Solved
I want to store the following pandas data frame in a parquet file using PyArrow:
import pandas as pd
df = pd.DataFrame({'field': [[{}, {}]]})
The type of the field column is list of dicts:
fie...
1
Solved
Is there a way to use pyarrow parquet dataset to read specific columns and if possible filter data instead of reading a whole file into dataframe?
1
I'm trying to write a large parquet file onto disk (larger then memory). I naively thought I can be clever and use ParquetWriter and write_table to incrementally write a file, like this (POC):
impo...
Satirical asked 14/9, 2020 at 20:11
4
Solved
I have a parquet dataset stored on s3, and I would like to query specific rows from the dataset. I was able to do that using petastorm but now I want to do that using only pyarrow.
Here's my attem...
© 2022 - 2024 — McMap. All rights reserved.