apache-arrow Questions

1

Solved

What I am trying to do I am using PyArrow to read some CSVs and convert them to Parquet. Some of the files I read have plenty of columns and have a high memory footprint (enough to crash the machin...
Madra asked 28/7, 2021 at 5:54

1

Solved

What is the difference between Arrow IPC and Feather? The official Arrow documentation has this to say about Feather: Version 2 (V2), the default version, which is exactly represented as the Arrow...
Yurik asked 9/6, 2021 at 19:31

1

Solved

What is the purpose of Apache Arrow? It converts from one binary format to another, but why do i need that? If I have a spark program,then spark can read parquet,so why do i need to convert it into...
Maudiemaudlin asked 11/5, 2021 at 20:28

2

Solved

I have been using the latest R arrow package (arrow_2.0.0.20201106) that supports reading and writing from AWS S3 directly (which is awesome). I don't seem to have issues when I write and read my o...
Chilli asked 20/11, 2020 at 22:2

1

Solved

I'm trying to implement Vector Udf in C# Spark. I have created .Net Spark environment by following Spark .Net. Vector Udf (Apache arrow and Microsoft.Data.Analysis both) worked for me for IntegerTy...

1

Solved

as you can see in the code below, I'm having troubles adding new rows to a Table saved in a memory mapped file. I just want to write the file again with the new rows. import pyarrow as pa source =...
Vitovitoria asked 12/3, 2021 at 7:46

2

I'm doing very basic experiments with Apache Arrow, mostly in regards to passing some data between Java, C++, Python using Arrow's IPC format (to file), Parquet format (to file) and IPC format (str...
Eisk asked 16/7, 2020 at 15:33

1

Solved

Both are language-neutral and platform-neutral data exchange libraries. I wonder what are the difference of them and which library is good for which situations.
Stier asked 7/3, 2021 at 20:40

1

Solved

I plan to: join group by filter data using pyarrow (new to it). The idea is to get better performance and memory utilisation ( apache arrow compression) comparing to pandas. Seems like pyarrow ha...
Chalmer asked 1/1, 2021 at 17:15

2

I am looking for useful documentations or examples for the Apache Arrow API. Can anyone point to some useful resources? I was only able to find some blogs and JAVA documentation (which doesn'...
Benge asked 21/6, 2017 at 11:27

0

I see that parquet supports dictionary encoding on a per-column basis, and that dictionary encoding is described in the GitHub documentation: Dictionary Encoding (PLAIN_DICTIONARY = 2 and RLE_DICT...
Carnal asked 29/10, 2020 at 23:18

1

I'm trying to write a large parquet file onto disk (larger then memory). I naively thought I can be clever and use ParquetWriter and write_table to incrementally write a file, like this (POC): impo...
Satirical asked 14/9, 2020 at 20:11

1

Solved

I have process A and process B. Process A opens a file, calls mmap and write to it, process B do the same but reads the same mapped region when process A has finished writing. Using mmap, process B...
Yvetteyvon asked 18/9, 2020 at 22:56

1

I'm trying to connect to HDFS through Pyarrow, but it does not work because libhdfs library cannot be loaded. libhdfs.so is in $HADOOP_HOME/lib/native as well as in $ARROW_LIBHDFS_DIR. print(os.e...
Acrylonitrile asked 31/10, 2018 at 16:11

4

Solved

I'm running a job in pyspark where I at one point use a grouped aggregate Pandas UDF. This results in the following (here abbreviate) error: org.apache.arrow.vector.util.OversizedAllocationExceptio...
Dominicadominical asked 7/10, 2019 at 12:29

1

I have some Parquet files that I've written in Python using PyArrow (Apache Arrow): pyarrow.parquet.write_table(table, "example.parquet") Now I want to read these files (and preferably g...
Weatherly asked 27/5, 2020 at 15:42

1

Solved

When I save a parquet file in R and Python (using pyarrow) I get a arrow schema string saved in the metadata. How do I read the metadata? Is it Flatbuffer encoded data? Where is the definition for...
Haematozoon asked 10/5, 2020 at 4:26

1

Solved

I am pretty new to Apache Arrow so this question may be ignorant. Apache Arrow provides the capability to store data structures like primitive types/struct/array in standardised memory format, I wo...
Showalter asked 16/12, 2019 at 2:9

1

Solved

Using the IO tools in pandas it is possible to convert a DataFrame to an in-memory feather buffer: import pandas as pd from io import BytesIO df = pd.DataFrame({'a': [1,2], 'b': [3.0,4.0]}) b...
Whitebeam asked 8/6, 2018 at 13:31

1

I have an Apache arrow array that is created by reading a file. std::shared_ptr<arrow::Array> array; PARQUET_THROW_NOT_OK(reader->ReadColumn(0, &array)); Is there a way to convert i...
Tripper asked 17/11, 2018 at 0:21

2

Solved

I have a large dictionary that I want to iterate through to build a pyarrow table. The values of the dictionary are tuples of varying types and need to be unpacked and stored in separate columns in...
Lion asked 14/9, 2019 at 20:37

1

I've been very interested in Apache Arrow for a bit now due to the promises of "zero copy reads", "zero serde", and "No overhead for cross-system communication". My understanding of the project (th...
Crassus asked 17/9, 2019 at 0:54

2

Solved

I am running into this problem w/ Apache Arrow Spark Integration. Using AWS EMR w/ Spark 2.4.3 Tested this problem on both local spark single machine instance and a Cloudera cluster and everythin...
Lengthwise asked 1/8, 2019 at 18:28

1

Solved

I'm looking into a way to speed up my memory intensive frontend vis app. I saw some people recommend Apache Arrow, while I'm looking into it, I'm confused about the difference between Parquet and A...
Narvaez asked 6/6, 2019 at 7:25

4

Solved

Im working with pandas and with spark dataframes. The dataframes are always very big (> 20 GB) and the standard spark functions are not sufficient for those sizes. Currently im converting my pandas...
Rusty asked 20/11, 2017 at 13:19

© 2022 - 2024 — McMap. All rights reserved.