pyarrow Questions
4
Using pyarrow to convert a pandas.DataFrame containing Player objects to a pyarrow.Table with the following code
import pandas as pd
import pyarrow as pa
class Player:
def __init__(self, name, a...
Baskerville asked 7/1, 2020 at 22:7
2
Solved
I am trying to pip install Superset
pip install apache-superset
and getting below error
Traceback (most recent call last):
File "c:\users\saurav_nimesh\appdata\local\programs\python\python3...
Coxcomb asked 27/2, 2020 at 16:41
1
Solved
I'm not sure where to begin, so looking for some guidance. I'm looking for a way to create some arrays/tables in one process, and have it accessible (read-only) from another.
So I create a pyarrow....
Silverplate asked 8/2, 2023 at 23:34
5
Solved
How do you append/update to a parquet file with pyarrow?
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', '...
3
I am trying to read a decently large Parquet file (~2 GB with about ~30 million rows) into my Jupyter Notebook (in Python 3) using the Pandas read_parquet function. I have also installed the pyarro...
Antacid asked 11/2, 2020 at 3:59
5
Solved
I have a python script that reads in a parquet file using pyarrow. I'm trying to loop through the table to update values in it. If I try this:
for col_name in table2.column_names:
if col_name in m...
Undrape asked 22/1, 2021 at 13:1
3
I have the below code which queries a database of about 500k rows. and it throws a SIGKILL when it hits rows = cur.fetchall(). I've tried to iterate through the cursor rather than load it all up in...
2
It appears the most common way in Python to create Parquet files is to first create a Pandas dataframe and then use pyarrow to write the table to parquet. I worry that this might be overly taxing i...
Unclasp asked 11/11, 2020 at 17:48
9
Solved
I have a hacky way of achieving this using boto3 (1.4.4), pyarrow (0.4.1) and pandas (0.20.3).
First, I can read a single parquet file locally like this:
import pyarrow.parquet as pq
path = 'par...
5
Is it possible to use Pandas' DataFrame.to_parquet functionality to split writing into multiple files of some approximate desired size?
I have a very large DataFrame (100M x 100), and am using df.t...
3
I'm trying to overwrite my parquet files with pyarrow that are in S3. I've seen the documentacion and I haven't found anything.
Here is my code:
from s3fs.core import S3FileSystem
import pyarrow ...
Trotta asked 30/8, 2018 at 11:22
2
Solved
Created a egg and whl file of pyarrow and put this on s3, for call this in pythonshell job. Received this message:
Job code:
import pyarrow
raise
Error, same structure for whl:
Traceback (most...
Minimus asked 3/3, 2020 at 17:47
1
Solved
I am trying to use awswrangler to read into a pandas dataframe an arbitrarily-large parquet file stored in S3, but limiting my query to the first N rows due to the file's size (and my poor bandwidt...
Honeysuckle asked 25/5, 2022 at 12:15
2
Solved
Both are columnar (disk-)storage formats for use in data analysis systems.
Both are integrated within Apache Arrow (pyarrow package for python) and are
designed to correspond with Arrow as a colum...
5
Solved
I looking for ways to read data from multiple partitioned directories from s3 using python.
data_folder/serial_number=1/cur_date=20-12-2012/abcdsd0324324.snappy.parquet
data_folder/serial_number=2...
Draughtsman asked 13/7, 2017 at 13:56
0
I have a parquet dataset stored in my S3 bucket with multiple partition files. I want to read it into my pandas dataframe, but am getting this ArrowInvalid error when I didn't before.
Occasionally,...
2
Solved
I am trying to store a Python Pandas DataFrame as a Parquet file, but I am experiencing some issues. One of the columns of my Pandas DF contains dictionaries as such:
import pandas as pandas
df = ...
2
Could somebody give me a hint on how can I copy a file form a local filesystem to a HDFS filesystem using PyArrow's new filesystem interface (i.e. upload, copyFromLocal)?
I have read the documentat...
Ejective asked 28/7, 2021 at 11:11
1
Solved
I am aware that "Many Arrow objects are immutable: once constructed, their logical properties cannot change anymore" (docs). In this blog post by one of the Arrow creators it's said
Tabl...
Performance asked 10/3, 2022 at 17:58
1
Solved
I have a calculator that iterates a couple of hundred object and produces Nx1 arrays for each of those objects. N here being 1-10m depending on configurations. Right now I am summing over these by ...
Doretha asked 24/2, 2022 at 18:50
3
Solved
I have a problem using pyarrow.orc module in Anaconda on Windows 10.
import pyarrow.orc as orc
throws an exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <modu...
2
Solved
I'm currently working on a project and I am having a hard time understanding how does the Pandas UDF in PySpark works.
I have a Spark Cluster with one Master node with 8 cores and 64GB, along with...
Exclave asked 26/12, 2019 at 20:53
5
Solved
Using Python, Parquet, and Spark and running into ArrowNotImplementedError: Support for codec 'snappy' not built after upgrading to pyarrow=3.0.0. My previous version without this error was pyarrow...
Squaw asked 2/2, 2021 at 21:19
2
I'm looking for fast ways to store and retrieve numpy array using pyarrow. I'm pretty satisfied with retrieval. It takes less than 1 second to extract columns from my .arrow file that contains 1.00...
1
Solved
In huggingface library, there is a particular format of datasets called arrow dataset
https://arrow.apache.org/docs/python/dataset.html
https://huggingface.co/datasets/wiki_lingua
I have to convert...
© 2022 - 2024 — McMap. All rights reserved.