pyarrow - McMap

2

Handling UUID values in Arrow with Parquet files

I'm new to Python and Pandas - please be gentle! I'm using SqlAlchemy with pymssql to execute a SQL query against a SQL Server database and then convert the result set into a dataframe. I'm then at...

python pandas pyarrow

Amparoampelopsis asked 5/9, 2021 at 22:55

7

import pyarrow not working <- error is "ValueError: The pyarrow library is not installed, please install pyarrow to use the to_arrow() function."

I have tried installing it in the terminal and in juypter lab and it says that it has been successfully installed but when I run df = query_job.to_dataframe() I keep getting the error " ValueE...

google-bigquery jupyter pyarrow

Spelter asked 13/12, 2020 at 13:3

7

Solved

"pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data" when sending data to BigQuery without schema

I'm working on a script where I'm sending a dataframe to BigQuery: load_job = bq_client.load_table_from_dataframe( df, '.'.join([PROJECT, DATASET, PROGRAMS_TABLE]) ) # Wait for the load job to c...

python-3.x google-bigquery google-cloud-functions pyarrow

Doughy asked 10/1, 2020 at 13:38

3

Solved

How can one append to parquet files and how does it affect partitioning?

Does parquet allow appending to a parquet file periodically ? How does appending relate to partitioning if any ? For example if i was able to identify a column that had low cardinality and partitio...

parquet pyarrow fastparquet

Clew asked 9/9, 2021 at 20:23

3

Solved

Categorical variables of Int/Float types are lost when saving to parquet

I have the following dataframe in pandas that is saved as a parquet import pandas as pd df = pd.DataFrame({"a":['1','2','3']}).astype("category") Upon inspection of the only fi...

pandas pyarrow

Intermarry asked 23/5, 2023 at 13:38

5

Add new column to a HuggingFace dataset

In the dataset I have 5000000 rows, I would like to add a column called 'embeddings' to my dataset. dataset = dataset.add_column('embeddings', embeddings) The variable embeddings is a numpy memmap ...

python numpy word-embedding pyarrow huggingface-datasets

Confectioner asked 22/11, 2021 at 10:56

7

ModuleNotFoundError: No module named 'pyarrow'

I am trying to run a simple pandas UDF example on my server. From here I have created a fresh environment just for the purpose of running this code. (PySparkEnv) $ conda list # packages in envir...

python-3.x pyspark pyarrow

Carlton asked 13/9, 2018 at 19:12

3

Having trouble to import the library pandas

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0) -everytime I execute on PyCharm "import pandas as pd" I got this error Give me solutio...

python pandas import pyarrow

Indecisive asked 29/1, 2024 at 18:0

6

Solved

A comparison between fastparquet and pyarrow?

After some searching I failed to find a thorough comparison of fastparquet and pyarrow. I found this blog post (a basic comparison of speeds). and a github discussion that claims that files crea...

python parquet dask pyarrow fastparquet

Electrolyte asked 16/7, 2018 at 12:0

6

Python pip install pyarrow error, unable to execute 'cmake'

I'm trying to install the pyarrow on a master instance of my EMR cluster, however I'm always receiving this error. [hadoop@ip-XXX-XXX-XXX-XXX ~]$ sudo /usr/bin/pip-3.4 install pyarrow Collecting p...

python-3.x cmake pip amazon-emr pyarrow

Willman asked 5/9, 2018 at 9:12

1

Pandas 2.0 pyarrow backend datetime operation

I have the following pandas dataframe object using the pyarrow back end: crsp_m.info(verbose = True) out: <class 'pandas.core.frame.DataFrame'> RangeIndex: 4921811 entries, 0 to 4921810 Data...

python pandas jupyter-notebook pyarrow

Friend asked 19/4, 2023 at 17:50

3

Memory leak from pyarrow?

For the parsing of a larger file, I need to write in a loop to a large number of parquet files successively. However, it appears that the memory consumed by this task increases over each iteration,...

python pandas parquet pyarrow

Buddie asked 26/10, 2018 at 22:1

7

Solved

How to set/get Pandas dataframes into Redis using pyarrow

Using dd = {'ID': ['H576','H577','H578','H600', 'H700'], 'CD': ['AAAAAAA', 'BBBBB', 'CCCCCC','DDDDDD', 'EEEEEEE']} df = pd.DataFrame(dd) Pre Pandas 0.25, this below worked. set: redisConn.se...

python pandas redis pyarrow py-redis

Orelee asked 16/9, 2019 at 2:54

4

Error Loading DataFrame to BigQuery Table (pyarrow.lib.ArrowTypeError: object of type <class 'str'> cannot be converted to int)

I have a CSV stored in GCS which I want to load it to BigQuery table. But I need to do some pre-process first so I load it to DataFrame and later load to BigQuery table import pandas as pd import j...

python pandas numpy google-bigquery pyarrow

Lobectomy asked 21/2, 2022 at 8:47

0

Specifying logical types (in particular, UUID) when writing parquet files from pyarrow

The pyarrow documentation builds a custom UUID type many times like this: import pyarrow as pa class UuidType(pa.PyExtensionType): def __init__(self): pa.PyExtensionType.__init__(self, pa.binary(...

parquet pyarrow apache-arrow

Cigarillo asked 5/9, 2023 at 22:16

3

Solved

Connect python-polars to SQL server (no support currently)

How can I directly connect MS SQL Server to polars? The documentation does not list any supported connections but recommends the use of pandas. Update: SQL Server Authentication works per answer, b...

sql-server sqlalchemy pyarrow python-polars

Dicotyledon asked 31/12, 2022 at 4:48

4

Solved

Using aws profile with fs S3Filesystem

Trying to use a specific AWS profile when using Apache Pyarrow. The documentation show no option to pass a profile name when instantiating S3FileSystem using pyarrow fs [https://arrow.apache.org/do...

amazon-web-services amazon-s3 parquet pyarrow

Ketcham asked 22/6, 2022 at 16:50

0

pyarrow memory consumption difference between Dataset.to_batches and ParquetFile.iter_batches

I am using pyarrow and am struggling to understand the big difference in memory usage between the Dataset.to_batches method compared to ParquetFile.iter_batches. Using pyarrow.dataset >>> ...

parquet pyarrow apache-arrow

Ting asked 4/8, 2023 at 1:11

1

Solved

What is actually meant when referring to parquet row-group size?

I am starting to work with the parquet file format. The official Apache site recommends large row groups of 512MB to 1GB (here). Several online source (e.g. this one) suggest that the default row g...

parquet pyarrow apache-arrow

Romanfleuve asked 27/7, 2023 at 17:6

3

Solved

read a parquet files from HDFS using PyArrow

I know I can connect to an HDFS cluster via pyarrow using pyarrow.hdfs.connect() I also know I can read a parquet file using pyarrow.parquet's read_table() However, read_table() accepts a filepat...

hdfs parquet pyarrow

Liddy asked 22/11, 2017 at 20:10

1

Solved

How to use categorical data type with pyarrow dtypes?

I'm working with the arrow dtypes with pandas, and my dataframe has a variable that should be categorical, but I can't figure out how to transform it into pyarrow data type for categorical data (di...

python pandas types pyarrow dtype

Donnelly asked 10/5, 2023 at 19:34

4

How to read feather/arrow file natively?

I have feather format file sales.feather that I am using for exchanging data between python and R. In R I use the following command: df = arrow::read_feather("sales.feather", as_data_fra...

apache-spark pyspark pyarrow apache-arrow feather

Constrained asked 1/12, 2018 at 9:49

1

Solved

Very slow aggregate on Pandas 2.0 dataframe with pyarrow as dtype_backend

Let's say I have the following dataframe: Code Price AA1 10 AA1 20 BB2 30 And I want to perform the following operation on it: df.groupby("code").aggregate({ "price&...

python pandas group-by pyarrow apache-arrow

Tnt asked 3/4, 2023 at 9:6

2

Datatypes issue when convert parquet data to pandas dataframe

I have a problem with filetypes when converting a parquet file to a dataframe. I do bucket = 's3://some_bucket/test/usages' import pyarrow.parquet as pq import s3fs s3 = s3fs.S3FileSystem() rea...

pandas parquet pyarrow apache-arrow

Urbannal asked 25/2, 2019 at 12:45

3

Solved

Occur "Could NOT find Arrow" error when using pip_pypy3 to install pyarrow

I am trying to use pypy3 to install pyarrow, but some errors occur. Basic information is blow: macOS 10.15.7 Xcode 12.3 python version 3.7.9 pypy3 version 7.3.3 pyarrow version 0.17.1 cmd is 'pip_...

python cmake pypy pyarrow

Saracen asked 10/1, 2021 at 13:43

pyarrow Questions

Recommended topics

Hot tags