pyarrow - 4 - McMap

0

How to fix timestamp interpretation in python pandas of parquet files

I have some spark(scala) dataframes/tables with timestamps which are coming from our DHW and which are using some High Watermarks some times. I want to work with this data in python with pandas so ...

python pandas timestamp parquet pyarrow

Augmenter asked 9/10, 2020 at 8:55

1

Solved

Feather format for long term storage since the release of apache arrow 1.0.1

As I'm given to understand due to the search of issues in the Feather Github, as well as questions in stackoverflow such as What are the differences between feather and parquet?, the Feather format...

python pandas dataframe pyarrow feather

Marciano asked 27/9, 2020 at 14:42

2

how to enable Apache Arrow in Pyspark

I am trying to enable Apache Arrow for conversion to Pandas. I am using: pyspark 2.4.4 pyarrow 0.15.0 pandas 0.25.1 numpy 1.17.2 This is the example code spark.conf.set("spark.sql.execution.arro...

pandas pyspark pyarrow

Orthopedics asked 7/10, 2019 at 11:58

1

Solved

Do memory mapped files in Docker containers in Kubernetes work the same as in regular processes in Linux?

I have process A and process B. Process A opens a file, calls mmap and write to it, process B do the same but reads the same mapped region when process A has finished writing. Using mmap, process B...

docker kubernetes pyarrow apache-arrow

Yvetteyvon asked 18/9, 2020 at 22:56

2

Solved

Pyarrow apply schema when using pandas to_parquet()

I have a very wide data frame (20,000 columns) that is mainly made up of float64 columns in Pandas. I want to cast these columns to float32 and write to Parquet format. I am doing this because the ...

python pandas pyarrow

Yoko asked 17/10, 2018 at 8:42

2

Unable to read a parquet file

I am breaking my head over this right now. I am new to this parquet files, and I am running into a LOT of issues with it. I am thrown an error that reads OSError: Passed non-file path: \datasets\p...

python pandas parquet pyarrow fastparquet

Desalvo asked 13/3, 2019 at 16:58

2

Solved

How to write Parquet metadata with pyarrow?

I use pyarrow to create and analyse Parquet tables with biological information and I need to store some metadata, e.g. which sample the data comes from, how it was obtained and processed. Parquet...

python parquet pyarrow

Busman asked 31/8, 2018 at 21:15

1

Unable to load libhdfs when using pyarrow

I'm trying to connect to HDFS through Pyarrow, but it does not work because libhdfs library cannot be loaded. libhdfs.so is in $HADOOP_HOME/lib/native as well as in $ARROW_LIBHDFS_DIR. print(os.e...

python hadoop hdfs pyarrow apache-arrow

Acrylonitrile asked 31/10, 2018 at 16:11

1

Solved

pyarrow data types for columns that have lists of dictionaries?

Is there a special pyarrow data type I should use for columns which have lists of dictionaries when I save to a parquet file? If I save lists or lists of dictionaries as a string, I normally have t...

pandas parquet pyarrow

Try asked 24/8, 2020 at 1:44

1

Solved

pyarrow add column to pyarrow table

I have a pyarrow table name final_table of shape 6132,7 I want to add column to this table list_ = ['IT'] * 6132 final_table.append_column('COUNTRY_ID', list_) but I am getting following error A...

python pyarrow

Aggrade asked 11/8, 2020 at 3:44

1

Solved

Write nested parquet format from Python

I have a flat parquet file where one varchar columns store JSON data as a string and I want to transform this data to a nested structure, i.e. the JSON data becomes nested parquet. I know the schem...

python json parquet pyarrow fastparquet

Madigan asked 6/7, 2020 at 6:41

1

Solved

Does any Python library support writing arrays of structs to Parquet files?

I want to write data where some columns are arrays of strings or arrays of structs (typically key-value pairs) into a Parquet file for use in AWS Athena. After finding two Python libraries (Arrow a...

python parquet pyarrow fastparquet

Ratchet asked 15/6, 2018 at 13:21

2

Solved

PySpark pandas_udfs java.lang.IllegalArgumentException error

Does anyone have experience using pandas UDFs on a local pyspark session running on Windows? I've used them on linux with good results, but I've been unsuccessful on my Windows machine. Environmen...

pandas apache-spark pyspark pyarrow

Narcose asked 19/2, 2020 at 18:5

4

Read Parquet file stored in S3 with AWS Lambda (Python 3)

I am trying to load, process and write Parquet files in S3 with AWS Lambda. My testing / deployment process is: https://github.com/lambci/docker-lambda as a container to mock the Amazon environm...

python amazon-s3 aws-lambda parquet pyarrow

Sixteenth asked 26/12, 2017 at 22:22

1

Solved

AWS Athena: HIVE_BAD_DATA ERROR: Field type DOUBLE in parquet is incompatible with type defined in table schema

I use AWS Athena to query some data stored in S3, namely partitioned parquet files with pyarrow compression. I have three columns with string values, one column called "key" with int values and on...

hive parquet amazon-athena pyarrow

Leonardo asked 22/5, 2020 at 6:9

1

Solved

How to read the arrow parquet key value metadata?

When I save a parquet file in R and Python (using pyarrow) I get a arrow schema string saved in the metadata. How do I read the metadata? Is it Flatbuffer encoded data? Where is the definition for...

parquet pyarrow apache-arrow

Haematozoon asked 10/5, 2020 at 4:26

4

finding nested columns in pandas dataframe

I have a large dataset with many columns in (compressed) JSON format. I'm trying to convert it to parquet for subsequent processing. Some columns have a nested structure. For now I want to ignore t...

python python-3.x pandas pyarrow

Hedwig asked 13/4, 2020 at 20:24

1

Solved

Google BigQuery Schema conflict (pyarrow error) with Numeric data type using load_table_from_dataframe

I got the following error when I upload numeric data (int64 or float64) from a Pandas dataframe to a "Numeric" Google BigQuery Data Type: pyarrow.lib.ArrowInvalid: Got bytestring of leng...

python pandas google-bigquery pyarrow

Gujarati asked 25/4, 2020 at 6:13

3

Solved

Fastest way to iterate Pyarrow Table

I am using Pyarrow library for optimal storage of Pandas DataFrame. I need to process pyarrow Table row by row as fast as possible without converting it to pandas DataFrame (it won't fit in memory)...

pandas pyarrow

Monochord asked 5/11, 2018 at 15:37

1

Solved

PySpark 2.4.5: IllegalArgumentException when using PandasUDF

I am trying Pandas UDF and facing the IllegalArgumentException. I also tried replicating examples from PySpark Documentation GroupedData to check but still getting the error. Following is the envi...

python pandas apache-spark pyspark pyarrow

Ophidian asked 14/4, 2020 at 6:41

2

Linux pyarrow undefined symbol

I am running Python 3.7.2 and using Miniconda3 to create a new environment named test-env. I have installed the pyarrow package from the default channel into this environment; however, when I try a...

python-3.x pyarrow

Bioenergetics asked 7/3, 2019 at 19:27

2

Memory leaks when using pandas_udf and Parquet serialization?

I am currently developing my first whole system using PySpark and I am running into some strange, memory-related issues. In one of the stages, I would like to resemble a Split-Apply-Combine strateg...

python pandas pyspark apache-spark-sql pyarrow

Limonene asked 27/5, 2019 at 15:45

2

Python pandas_udf spark error

I started playing around with spark locally and finding this weird issue 1) pip install pyspark==2.3.1 2) pyspark> import pandas as pd from pyspark.sql.functions import pandas_udf, PandasUD...

pandas apache-spark pyspark pyarrow

Riesman asked 6/8, 2018 at 18:33

1

Solved

Secondary in-memory index representations in Python

I am searching for an efficient solution to build a secondary in-memory index in Python using a high-level optimised mathematical package such as numpy and arrow. I am excluding pandas for performa...

python numpy adjacency-list pyarrow secondary-indexes

Cesya asked 26/1, 2020 at 12:45

2

Pyarrow read/write from s3

Is it possible to read and write parquet files from one folder to another folder in s3 without converting into pandas using pyarrow. Here is my code: import pyarrow.parquet as pq import pyarrow a...

python pyarrow

Mortician asked 27/3, 2018 at 12:42

pyarrow Questions

Recommended topics

Hot tags