pyarrow Questions

2

I'm new to Python and Pandas - please be gentle! I'm using SqlAlchemy with pymssql to execute a SQL query against a SQL Server database and then convert the result set into a dataframe. I'm then at...
Amparoampelopsis asked 5/9, 2021 at 22:55

7

I have tried installing it in the terminal and in juypter lab and it says that it has been successfully installed but when I run df = query_job.to_dataframe() I keep getting the error " ValueE...
Spelter asked 13/12, 2020 at 13:3

7

Solved

I'm working on a script where I'm sending a dataframe to BigQuery: load_job = bq_client.load_table_from_dataframe( df, '.'.join([PROJECT, DATASET, PROGRAMS_TABLE]) ) # Wait for the load job to c...

3

Solved

Does parquet allow appending to a parquet file periodically ? How does appending relate to partitioning if any ? For example if i was able to identify a column that had low cardinality and partitio...
Clew asked 9/9, 2021 at 20:23

3

Solved

I have the following dataframe in pandas that is saved as a parquet import pandas as pd df = pd.DataFrame({"a":['1','2','3']}).astype("category") Upon inspection of the only fi...
Intermarry asked 23/5, 2023 at 13:38

5

In the dataset I have 5000000 rows, I would like to add a column called 'embeddings' to my dataset. dataset = dataset.add_column('embeddings', embeddings) The variable embeddings is a numpy memmap ...
Confectioner asked 22/11, 2021 at 10:56

7

I am trying to run a simple pandas UDF example on my server. From here I have created a fresh environment just for the purpose of running this code. (PySparkEnv) $ conda list # packages in envir...
Carlton asked 13/9, 2018 at 19:12

3

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0) -everytime I execute on PyCharm "import pandas as pd" I got this error Give me solutio...
Indecisive asked 29/1 at 18:0

6

Solved

After some searching I failed to find a thorough comparison of fastparquet and pyarrow. I found this blog post (a basic comparison of speeds). and a github discussion that claims that files crea...
Electrolyte asked 16/7, 2018 at 12:0

6

I'm trying to install the pyarrow on a master instance of my EMR cluster, however I'm always receiving this error. [hadoop@ip-XXX-XXX-XXX-XXX ~]$ sudo /usr/bin/pip-3.4 install pyarrow Collecting p...
Willman asked 5/9, 2018 at 9:12

1

I have the following pandas dataframe object using the pyarrow back end: crsp_m.info(verbose = True) out: <class 'pandas.core.frame.DataFrame'> RangeIndex: 4921811 entries, 0 to 4921810 Data...
Friend asked 19/4, 2023 at 17:50

3

For the parsing of a larger file, I need to write in a loop to a large number of parquet files successively. However, it appears that the memory consumed by this task increases over each iteration,...
Buddie asked 26/10, 2018 at 22:1

7

Solved

Using dd = {'ID': ['H576','H577','H578','H600', 'H700'], 'CD': ['AAAAAAA', 'BBBBB', 'CCCCCC','DDDDDD', 'EEEEEEE']} df = pd.DataFrame(dd) Pre Pandas 0.25, this below worked. set: redisConn.se...
Orelee asked 16/9, 2019 at 2:54

4

I have a CSV stored in GCS which I want to load it to BigQuery table. But I need to do some pre-process first so I load it to DataFrame and later load to BigQuery table import pandas as pd import j...
Lobectomy asked 21/2, 2022 at 8:47

0

The pyarrow documentation builds a custom UUID type many times like this: import pyarrow as pa class UuidType(pa.PyExtensionType): def __init__(self): pa.PyExtensionType.__init__(self, pa.binary(...
Cigarillo asked 5/9, 2023 at 22:16

3

Solved

How can I directly connect MS SQL Server to polars? The documentation does not list any supported connections but recommends the use of pandas. Update: SQL Server Authentication works per answer, b...
Dicotyledon asked 31/12, 2022 at 4:48

4

Solved

Trying to use a specific AWS profile when using Apache Pyarrow. The documentation show no option to pass a profile name when instantiating S3FileSystem using pyarrow fs [https://arrow.apache.org/do...
Ketcham asked 22/6, 2022 at 16:50

0

I am using pyarrow and am struggling to understand the big difference in memory usage between the Dataset.to_batches method compared to ParquetFile.iter_batches. Using pyarrow.dataset >>> ...
Ting asked 4/8, 2023 at 1:11

1

Solved

I am starting to work with the parquet file format. The official Apache site recommends large row groups of 512MB to 1GB (here). Several online source (e.g. this one) suggest that the default row g...
Romanfleuve asked 27/7, 2023 at 17:6

3

Solved

I know I can connect to an HDFS cluster via pyarrow using pyarrow.hdfs.connect() I also know I can read a parquet file using pyarrow.parquet's read_table() However, read_table() accepts a filepat...
Liddy asked 22/11, 2017 at 20:10

1

Solved

I'm working with the arrow dtypes with pandas, and my dataframe has a variable that should be categorical, but I can't figure out how to transform it into pyarrow data type for categorical data (di...
Donnelly asked 10/5, 2023 at 19:34

4

I have feather format file sales.feather that I am using for exchanging data between python and R. In R I use the following command: df = arrow::read_feather("sales.feather", as_data_fra...
Constrained asked 1/12, 2018 at 9:49

1

Solved

Let's say I have the following dataframe: Code Price AA1 10 AA1 20 BB2 30 And I want to perform the following operation on it: df.groupby("code").aggregate({ "price&...
Tnt asked 3/4, 2023 at 9:6

2

I have a problem with filetypes when converting a parquet file to a dataframe. I do bucket = 's3://some_bucket/test/usages' import pyarrow.parquet as pq import s3fs s3 = s3fs.S3FileSystem() rea...
Urbannal asked 25/2, 2019 at 12:45

3

Solved

I am trying to use pypy3 to install pyarrow, but some errors occur. Basic information is blow: macOS 10.15.7 Xcode 12.3 python version 3.7.9 pypy3 version 7.3.3 pyarrow version 0.17.1 cmd is 'pip_...
Saracen asked 10/1, 2021 at 13:43

© 2022 - 2024 — McMap. All rights reserved.