apache-arrow Questions

2

Is there a language agnostic way of representing a Parquet or Arrow schema in a similar way to Avro? For example, an Avro schema might look like this: { "type": "record", &quo...
Osmunda asked 4/1 at 23:48

0

The pyarrow documentation builds a custom UUID type many times like this: import pyarrow as pa class UuidType(pa.PyExtensionType): def __init__(self): pa.PyExtensionType.__init__(self, pa.binary(...
Cigarillo asked 5/9, 2023 at 22:16

0

What is the best way to pass data using the Apache Arrow format from Node.js to Rust? Storing the data in each language is easy enough, but its the sharing memory that is giving me challenges. I'm ...
Pamulapan asked 12/8, 2023 at 2:36

0

I am using pyarrow and am struggling to understand the big difference in memory usage between the Dataset.to_batches method compared to ParquetFile.iter_batches. Using pyarrow.dataset >>> ...
Ting asked 4/8, 2023 at 1:11

1

Solved

I am starting to work with the parquet file format. The official Apache site recommends large row groups of 512MB to 1GB (here). Several online source (e.g. this one) suggest that the default row g...
Romanfleuve asked 27/7, 2023 at 17:6

4

I have feather format file sales.feather that I am using for exchanging data between python and R. In R I use the following command: df = arrow::read_feather("sales.feather", as_data_fra...
Constrained asked 1/12, 2018 at 9:49

1

Solved

Let's say I have the following dataframe: Code Price AA1 10 AA1 20 BB2 30 And I want to perform the following operation on it: df.groupby("code").aggregate({ "price&...
Tnt asked 3/4, 2023 at 9:6

2

I have a problem with filetypes when converting a parquet file to a dataframe. I do bucket = 's3://some_bucket/test/usages' import pyarrow.parquet as pq import s3fs s3 = s3fs.S3FileSystem() rea...
Urbannal asked 25/2, 2019 at 12:45

6

Solved

How do I read a partitioned parquet file into R with arrow (without any spark) The situation created parquet files with a Spark pipe and save on S3 read with RStudio/RShiny with one column as in...
Bugbear asked 17/10, 2019 at 20:2

1

Solved

I'm not sure where to begin, so looking for some guidance. I'm looking for a way to create some arrays/tables in one process, and have it accessible (read-only) from another. So I create a pyarrow....
Silverplate asked 8/2, 2023 at 23:34

3

How can I write a pandas dataframe to disk in .arrow format? I'd like to be able to read the arrow file into Arquero as demonstrated here.
Nemertean asked 1/11, 2020 at 7:40

3

I have data that is stored on a client's browser in-memory. For example, let's say the dataset is as follows: "name" (string), "age" (int32), "isAdult" (bool) "Tom" , 29 1 "Tom" , 14 0 "Dina" , 20...
Marquittamarr asked 15/6, 2019 at 0:36

2

Solved

I am attempting to use the arrow package (relatively recently implemented) DataSet API to to read a directory of files into memory, and leverage the c++ back-end to filter rows and columns. I would...
Precinct asked 28/4, 2021 at 15:26

2

Solved

I'm running into more and more situations where I need out-of-memory (OOM) approaches to data analytics in R. I am familiar with other OOM approaches, like sparklyr and DBI but I recently came acro...
Empathize asked 19/3, 2021 at 15:17

0

I'm trying to add an index to a dataset which is too large to fit in RAM. The tidyverse way of adding an index would be: library(tidyverse) df = mtcars df |> mutate(row_id = 1:nrow(cyl)) # any ...
Tease asked 27/6, 2022 at 7:21

2

Solved

I am trying to do a cross join (from the original question here), and I have 500GB of ram. The problem is that the final data.table has more than 2^31 rows, so I get this error: Error in vecseq(f__...
Sporule asked 7/4, 2020 at 23:59

2

Could somebody give me a hint on how can I copy a file form a local filesystem to a HDFS filesystem using PyArrow's new filesystem interface (i.e. upload, copyFromLocal)? I have read the documentat...
Ejective asked 28/7, 2021 at 11:11

1

Solved

I am aware that "Many Arrow objects are immutable: once constructed, their logical properties cannot change anymore" (docs). In this blog post by one of the Arrow creators it's said Tabl...
Performance asked 10/3, 2022 at 17:58

1

Solved

{arrow}s auto-detection of column types is causing me some trouble when opening a large csv file. In particular, it drops leading zeroes for some identifiers and does some other unfortunate stuff. ...
Attitudinarian asked 1/3, 2022 at 8:4

1

Solved

I have a calculator that iterates a couple of hundred object and produces Nx1 arrays for each of those objects. N here being 1-10m depending on configurations. Right now I am summing over these by ...
Doretha asked 24/2, 2022 at 18:50

5

Solved

Using Python, Parquet, and Spark and running into ArrowNotImplementedError: Support for codec 'snappy' not built after upgrading to pyarrow=3.0.0. My previous version without this error was pyarrow...
Squaw asked 2/2, 2021 at 21:19

1

Solved

I'm seeking advice from people deeply familiar with the binary layout of Apache Parquet: Having a data transformation F(a) = b where F is fully deterministic, and same exact versions of the entire ...
Heterothallic asked 3/12, 2021 at 21:41

4

I have been using Apache Arrow with Spark for a while in Python and have been easily able to convert between dataframes and Arrow objects by using Pandas as an intermediary. Recently, however, I’v...
Asben asked 27/7, 2017 at 17:4

1

I am trying to read and write a trivial dataset into Julia. The dataset is mtcars, taken from R, with an arbitrarily added column bt with random Boolean values. The file/folder structure (below) wa...
Linker asked 17/5, 2021 at 17:26

1

Solved

What I am trying to do I am using PyArrow to read some CSVs and convert them to Parquet. Some of the files I read have plenty of columns and have a high memory footprint (enough to crash the machin...
Halliehallman asked 4/8, 2021 at 13:29

© 2022 - 2024 — McMap. All rights reserved.