apache-arrow Questions
1
Solved
What I am trying to do
I am using PyArrow to read some CSVs and convert them to Parquet. Some of the files I read have plenty of columns and have a high memory footprint (enough to crash the machin...
Madra asked 28/7, 2021 at 5:54
1
Solved
What is the difference between Arrow IPC and Feather?
The official Arrow documentation has this to say about Feather:
Version 2 (V2), the default version, which is exactly represented as
the Arrow...
Yurik asked 9/6, 2021 at 19:31
1
Solved
What is the purpose of Apache Arrow? It converts from one binary format to another, but why do i need that? If I have a spark program,then spark can read parquet,so why do i need to convert it into...
Maudiemaudlin asked 11/5, 2021 at 20:28
2
Solved
I have been using the latest R arrow package (arrow_2.0.0.20201106) that supports reading and writing from AWS S3 directly (which is awesome).
I don't seem to have issues when I write and read my o...
Chilli asked 20/11, 2020 at 22:2
1
Solved
I'm trying to implement Vector Udf in C# Spark.
I have created .Net Spark environment by following Spark .Net.
Vector Udf (Apache arrow and Microsoft.Data.Analysis both) worked for me for IntegerTy...
Cerberus asked 25/3, 2021 at 7:38
1
Solved
as you can see in the code below, I'm having troubles adding new rows to a Table saved in a memory mapped file.
I just want to write the file again with the new rows.
import pyarrow as pa
source =...
Vitovitoria asked 12/3, 2021 at 7:46
2
I'm doing very basic experiments with Apache Arrow, mostly in regards to passing some data between Java, C++, Python using Arrow's IPC format (to file), Parquet format (to file) and IPC format (str...
Eisk asked 16/7, 2020 at 15:33
1
Solved
Both are language-neutral and platform-neutral data exchange libraries. I wonder what are the difference of them and which library is good for which situations.
Stier asked 7/3, 2021 at 20:40
1
Solved
I plan to:
join
group by
filter
data using pyarrow (new to it). The idea is to get better performance and memory utilisation ( apache arrow compression) comparing to pandas.
Seems like pyarrow ha...
Chalmer asked 1/1, 2021 at 17:15
2
I am looking for useful documentations or examples for the Apache Arrow API. Can anyone point to some useful resources? I was only able to find some blogs and JAVA documentation (which doesn'...
Benge asked 21/6, 2017 at 11:27
0
I see that parquet supports dictionary encoding on a per-column basis, and that dictionary encoding is described in the GitHub documentation:
Dictionary Encoding (PLAIN_DICTIONARY = 2 and RLE_DICT...
Carnal asked 29/10, 2020 at 23:18
1
I'm trying to write a large parquet file onto disk (larger then memory). I naively thought I can be clever and use ParquetWriter and write_table to incrementally write a file, like this (POC):
impo...
Satirical asked 14/9, 2020 at 20:11
1
Solved
I have process A and process B. Process A opens a file, calls mmap and write to it, process B do the same but reads the same mapped region when process A has finished writing.
Using mmap, process B...
Yvetteyvon asked 18/9, 2020 at 22:56
1
I'm trying to connect to HDFS through Pyarrow, but it does not work because libhdfs library cannot be loaded.
libhdfs.so is in $HADOOP_HOME/lib/native as well as in $ARROW_LIBHDFS_DIR.
print(os.e...
Acrylonitrile asked 31/10, 2018 at 16:11
4
Solved
I'm running a job in pyspark where I at one point use a grouped aggregate Pandas UDF. This results in the following (here abbreviate) error:
org.apache.arrow.vector.util.OversizedAllocationExceptio...
Dominicadominical asked 7/10, 2019 at 12:29
1
I have some Parquet files that I've written in Python using PyArrow (Apache Arrow):
pyarrow.parquet.write_table(table, "example.parquet")
Now I want to read these files (and preferably g...
Weatherly asked 27/5, 2020 at 15:42
1
Solved
When I save a parquet file in R and Python (using pyarrow) I get a arrow schema string saved in the metadata.
How do I read the metadata? Is it Flatbuffer encoded data? Where is the definition for...
Haematozoon asked 10/5, 2020 at 4:26
1
Solved
I am pretty new to Apache Arrow so this question may be ignorant. Apache Arrow provides the capability to store data structures like primitive types/struct/array in standardised memory format, I wo...
Showalter asked 16/12, 2019 at 2:9
1
Solved
Using the IO tools in pandas it is possible to convert a DataFrame to an in-memory feather buffer:
import pandas as pd
from io import BytesIO
df = pd.DataFrame({'a': [1,2], 'b': [3.0,4.0]})
b...
Whitebeam asked 8/6, 2018 at 13:31
1
I have an Apache arrow array that is created by reading a file.
std::shared_ptr<arrow::Array> array;
PARQUET_THROW_NOT_OK(reader->ReadColumn(0, &array));
Is there a way to convert i...
Tripper asked 17/11, 2018 at 0:21
2
Solved
I have a large dictionary that I want to iterate through to build a pyarrow table. The values of the dictionary are tuples of varying types and need to be unpacked and stored in separate columns in...
Lion asked 14/9, 2019 at 20:37
1
I've been very interested in Apache Arrow for a bit now due to the promises of "zero copy reads", "zero serde", and "No overhead for cross-system communication". My understanding of the project (th...
Crassus asked 17/9, 2019 at 0:54
2
Solved
I am running into this problem w/ Apache Arrow Spark Integration.
Using AWS EMR w/ Spark 2.4.3
Tested this problem on both local spark single machine instance and a Cloudera cluster and everythin...
Lengthwise asked 1/8, 2019 at 18:28
1
Solved
I'm looking into a way to speed up my memory intensive frontend vis app. I saw some people recommend Apache Arrow, while I'm looking into it, I'm confused about the difference between Parquet and A...
Narvaez asked 6/6, 2019 at 7:25
4
Solved
Im working with pandas and with spark dataframes. The dataframes are always very big (> 20 GB) and the standard spark functions are not sufficient for those sizes. Currently im converting my pandas...
Rusty asked 20/11, 2017 at 13:19
© 2022 - 2024 — McMap. All rights reserved.