pyarrow Questions

4

Using pyarrow to convert a pandas.DataFrame containing Player objects to a pyarrow.Table with the following code import pandas as pd import pyarrow as pa class Player: def __init__(self, name, a...
Baskerville asked 7/1, 2020 at 22:7

2

Solved

I am trying to pip install Superset pip install apache-superset and getting below error Traceback (most recent call last): File "c:\users\saurav_nimesh\appdata\local\programs\python\python3...
Coxcomb asked 27/2, 2020 at 16:41

1

Solved

I'm not sure where to begin, so looking for some guidance. I'm looking for a way to create some arrays/tables in one process, and have it accessible (read-only) from another. So I create a pyarrow....
Silverplate asked 8/2, 2023 at 23:34

5

Solved

How do you append/update to a parquet file with pyarrow? import pandas as pd import pyarrow as pa import pyarrow.parquet as pq table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', '...
Marshy asked 4/11, 2017 at 17:59

3

I am trying to read a decently large Parquet file (~2 GB with about ~30 million rows) into my Jupyter Notebook (in Python 3) using the Pandas read_parquet function. I have also installed the pyarro...
Antacid asked 11/2, 2020 at 3:59

5

Solved

I have a python script that reads in a parquet file using pyarrow. I'm trying to loop through the table to update values in it. If I try this: for col_name in table2.column_names: if col_name in m...
Undrape asked 22/1, 2021 at 13:1

3

I have the below code which queries a database of about 500k rows. and it throws a SIGKILL when it hits rows = cur.fetchall(). I've tried to iterate through the cursor rather than load it all up in...
Elonore asked 2/9, 2020 at 20:38

2

It appears the most common way in Python to create Parquet files is to first create a Pandas dataframe and then use pyarrow to write the table to parquet. I worry that this might be overly taxing i...
Unclasp asked 11/11, 2020 at 17:48

9

Solved

I have a hacky way of achieving this using boto3 (1.4.4), pyarrow (0.4.1) and pandas (0.20.3). First, I can read a single parquet file locally like this: import pyarrow.parquet as pq path = 'par...
Anglicanism asked 11/7, 2017 at 20:1

5

Is it possible to use Pandas' DataFrame.to_parquet functionality to split writing into multiple files of some approximate desired size? I have a very large DataFrame (100M x 100), and am using df.t...
Cocker asked 6/9, 2020 at 20:33

3

I'm trying to overwrite my parquet files with pyarrow that are in S3. I've seen the documentacion and I haven't found anything. Here is my code: from s3fs.core import S3FileSystem import pyarrow ...
Trotta asked 30/8, 2018 at 11:22

2

Solved

Created a egg and whl file of pyarrow and put this on s3, for call this in pythonshell job. Received this message: Job code: import pyarrow raise Error, same structure for whl: Traceback (most...
Minimus asked 3/3, 2020 at 17:47

1

Solved

I am trying to use awswrangler to read into a pandas dataframe an arbitrarily-large parquet file stored in S3, but limiting my query to the first N rows due to the file's size (and my poor bandwidt...
Honeysuckle asked 25/5, 2022 at 12:15

2

Solved

Both are columnar (disk-)storage formats for use in data analysis systems. Both are integrated within Apache Arrow (pyarrow package for python) and are designed to correspond with Arrow as a colum...
Cheatham asked 3/1, 2018 at 18:48

5

Solved

I looking for ways to read data from multiple partitioned directories from s3 using python. data_folder/serial_number=1/cur_date=20-12-2012/abcdsd0324324.snappy.parquet data_folder/serial_number=2...
Draughtsman asked 13/7, 2017 at 13:56

0

I have a parquet dataset stored in my S3 bucket with multiple partition files. I want to read it into my pandas dataframe, but am getting this ArrowInvalid error when I didn't before. Occasionally,...
Emanuele asked 28/4, 2022 at 18:9

2

Solved

I am trying to store a Python Pandas DataFrame as a Parquet file, but I am experiencing some issues. One of the columns of my Pandas DF contains dictionaries as such: import pandas as pandas df = ...
Latanya asked 5/8, 2020 at 16:42

2

Could somebody give me a hint on how can I copy a file form a local filesystem to a HDFS filesystem using PyArrow's new filesystem interface (i.e. upload, copyFromLocal)? I have read the documentat...
Ejective asked 28/7, 2021 at 11:11

1

Solved

I am aware that "Many Arrow objects are immutable: once constructed, their logical properties cannot change anymore" (docs). In this blog post by one of the Arrow creators it's said Tabl...
Performance asked 10/3, 2022 at 17:58

1

Solved

I have a calculator that iterates a couple of hundred object and produces Nx1 arrays for each of those objects. N here being 1-10m depending on configurations. Right now I am summing over these by ...
Doretha asked 24/2, 2022 at 18:50

3

Solved

I have a problem using pyarrow.orc module in Anaconda on Windows 10. import pyarrow.orc as orc throws an exception: Traceback (most recent call last): File "<stdin>", line 1, in <modu...
Solvable asked 12/11, 2019 at 15:47

2

Solved

I'm currently working on a project and I am having a hard time understanding how does the Pandas UDF in PySpark works. I have a Spark Cluster with one Master node with 8 cores and 64GB, along with...
Exclave asked 26/12, 2019 at 20:53

5

Solved

Using Python, Parquet, and Spark and running into ArrowNotImplementedError: Support for codec 'snappy' not built after upgrading to pyarrow=3.0.0. My previous version without this error was pyarrow...
Squaw asked 2/2, 2021 at 21:19

2

I'm looking for fast ways to store and retrieve numpy array using pyarrow. I'm pretty satisfied with retrieval. It takes less than 1 second to extract columns from my .arrow file that contains 1.00...
Nunatak asked 9/11, 2021 at 16:44

1

Solved

In huggingface library, there is a particular format of datasets called arrow dataset https://arrow.apache.org/docs/python/dataset.html https://huggingface.co/datasets/wiki_lingua I have to convert...
Lindly asked 8/11, 2021 at 4:20

© 2022 - 2024 — McMap. All rights reserved.