pyarrow Questions

0

I have some spark(scala) dataframes/tables with timestamps which are coming from our DHW and which are using some High Watermarks some times. I want to work with this data in python with pandas so ...
Augmenter asked 9/10, 2020 at 8:55

1

Solved

As I'm given to understand due to the search of issues in the Feather Github, as well as questions in stackoverflow such as What are the differences between feather and parquet?, the Feather format...
Marciano asked 27/9, 2020 at 14:42

2

I am trying to enable Apache Arrow for conversion to Pandas. I am using: pyspark 2.4.4 pyarrow 0.15.0 pandas 0.25.1 numpy 1.17.2 This is the example code spark.conf.set("spark.sql.execution.arro...
Orthopedics asked 7/10, 2019 at 11:58

1

Solved

I have process A and process B. Process A opens a file, calls mmap and write to it, process B do the same but reads the same mapped region when process A has finished writing. Using mmap, process B...
Yvetteyvon asked 18/9, 2020 at 22:56

2

Solved

I have a very wide data frame (20,000 columns) that is mainly made up of float64 columns in Pandas. I want to cast these columns to float32 and write to Parquet format. I am doing this because the ...
Yoko asked 17/10, 2018 at 8:42

2

I am breaking my head over this right now. I am new to this parquet files, and I am running into a LOT of issues with it. I am thrown an error that reads OSError: Passed non-file path: \datasets\p...
Desalvo asked 13/3, 2019 at 16:58

2

Solved

I use pyarrow to create and analyse Parquet tables with biological information and I need to store some metadata, e.g. which sample the data comes from, how it was obtained and processed. Parquet...
Busman asked 31/8, 2018 at 21:15

1

I'm trying to connect to HDFS through Pyarrow, but it does not work because libhdfs library cannot be loaded. libhdfs.so is in $HADOOP_HOME/lib/native as well as in $ARROW_LIBHDFS_DIR. print(os.e...
Acrylonitrile asked 31/10, 2018 at 16:11

1

Solved

Is there a special pyarrow data type I should use for columns which have lists of dictionaries when I save to a parquet file? If I save lists or lists of dictionaries as a string, I normally have t...
Try asked 24/8, 2020 at 1:44

1

Solved

I have a pyarrow table name final_table of shape 6132,7 I want to add column to this table list_ = ['IT'] * 6132 final_table.append_column('COUNTRY_ID', list_) but I am getting following error A...
Aggrade asked 11/8, 2020 at 3:44

1

Solved

I have a flat parquet file where one varchar columns store JSON data as a string and I want to transform this data to a nested structure, i.e. the JSON data becomes nested parquet. I know the schem...
Madigan asked 6/7, 2020 at 6:41

1

Solved

I want to write data where some columns are arrays of strings or arrays of structs (typically key-value pairs) into a Parquet file for use in AWS Athena. After finding two Python libraries (Arrow a...
Ratchet asked 15/6, 2018 at 13:21

2

Solved

Does anyone have experience using pandas UDFs on a local pyspark session running on Windows? I've used them on linux with good results, but I've been unsuccessful on my Windows machine. Environmen...
Narcose asked 19/2, 2020 at 18:5

4

I am trying to load, process and write Parquet files in S3 with AWS Lambda. My testing / deployment process is: https://github.com/lambci/docker-lambda as a container to mock the Amazon environm...
Sixteenth asked 26/12, 2017 at 22:22

1

Solved

I use AWS Athena to query some data stored in S3, namely partitioned parquet files with pyarrow compression. I have three columns with string values, one column called "key" with int values and on...
Leonardo asked 22/5, 2020 at 6:9

1

Solved

When I save a parquet file in R and Python (using pyarrow) I get a arrow schema string saved in the metadata. How do I read the metadata? Is it Flatbuffer encoded data? Where is the definition for...
Haematozoon asked 10/5, 2020 at 4:26

4

I have a large dataset with many columns in (compressed) JSON format. I'm trying to convert it to parquet for subsequent processing. Some columns have a nested structure. For now I want to ignore t...
Hedwig asked 13/4, 2020 at 20:24

1

Solved

I got the following error when I upload numeric data (int64 or float64) from a Pandas dataframe to a "Numeric" Google BigQuery Data Type: pyarrow.lib.ArrowInvalid: Got bytestring of leng...
Gujarati asked 25/4, 2020 at 6:13

3

Solved

I am using Pyarrow library for optimal storage of Pandas DataFrame. I need to process pyarrow Table row by row as fast as possible without converting it to pandas DataFrame (it won't fit in memory)...
Monochord asked 5/11, 2018 at 15:37

1

Solved

I am trying Pandas UDF and facing the IllegalArgumentException. I also tried replicating examples from PySpark Documentation GroupedData to check but still getting the error. Following is the envi...
Ophidian asked 14/4, 2020 at 6:41

2

I am running Python 3.7.2 and using Miniconda3 to create a new environment named test-env. I have installed the pyarrow package from the default channel into this environment; however, when I try a...
Bioenergetics asked 7/3, 2019 at 19:27

2

I am currently developing my first whole system using PySpark and I am running into some strange, memory-related issues. In one of the stages, I would like to resemble a Split-Apply-Combine strateg...
Limonene asked 27/5, 2019 at 15:45

2

I started playing around with spark locally and finding this weird issue 1) pip install pyspark==2.3.1 2) pyspark> import pandas as pd from pyspark.sql.functions import pandas_udf, PandasUD...
Riesman asked 6/8, 2018 at 18:33

1

Solved

I am searching for an efficient solution to build a secondary in-memory index in Python using a high-level optimised mathematical package such as numpy and arrow. I am excluding pandas for performa...
Cesya asked 26/1, 2020 at 12:45

2

Is it possible to read and write parquet files from one folder to another folder in s3 without converting into pandas using pyarrow. Here is my code: import pyarrow.parquet as pq import pyarrow a...
Mortician asked 27/3, 2018 at 12:42

© 2022 - 2024 — McMap. All rights reserved.