apache-arrow

2

How to define Parquet and/or Arrow schemas?

Is there a language agnostic way of representing a Parquet or Arrow schema in a similar way to Avro? For example, an Avro schema might look like this: { "type": "record", &quo...

parquet apache-arrow

Osmunda asked 4/1 at 23:48

0

Specifying logical types (in particular, UUID) when writing parquet files from pyarrow

The pyarrow documentation builds a custom UUID type many times like this: import pyarrow as pa class UuidType(pa.PyExtensionType): def __init__(self): pa.PyExtensionType.__init__(self, pa.binary(...

parquet pyarrow apache-arrow

Cigarillo asked 5/9, 2023 at 22:16

0

Pass Arrow data from Node.js to Rust without copy

What is the best way to pass data using the Apache Arrow format from Node.js to Rust? Storing the data in each language is easy enough, but its the sharing memory that is giving me challenges. I'm ...

node.js rust apache-arrow rust-arrow2

Pamulapan asked 12/8, 2023 at 2:36

0

pyarrow memory consumption difference between Dataset.to_batches and ParquetFile.iter_batches

I am using pyarrow and am struggling to understand the big difference in memory usage between the Dataset.to_batches method compared to ParquetFile.iter_batches. Using pyarrow.dataset >>> ...

parquet pyarrow apache-arrow

Ting asked 4/8, 2023 at 1:11

1

Solved

What is actually meant when referring to parquet row-group size?

I am starting to work with the parquet file format. The official Apache site recommends large row groups of 512MB to 1GB (here). Several online source (e.g. this one) suggest that the default row g...

parquet pyarrow apache-arrow

Romanfleuve asked 27/7, 2023 at 17:6

4

How to read feather/arrow file natively?

I have feather format file sales.feather that I am using for exchanging data between python and R. In R I use the following command: df = arrow::read_feather("sales.feather", as_data_fra...

apache-spark pyspark pyarrow apache-arrow feather

Constrained asked 1/12, 2018 at 9:49

1

Solved

Very slow aggregate on Pandas 2.0 dataframe with pyarrow as dtype_backend

Let's say I have the following dataframe: Code Price AA1 10 AA1 20 BB2 30 And I want to perform the following operation on it: df.groupby("code").aggregate({ "price&...

python pandas group-by pyarrow apache-arrow

Tnt asked 3/4, 2023 at 9:6

2

Datatypes issue when convert parquet data to pandas dataframe

I have a problem with filetypes when converting a parquet file to a dataframe. I do bucket = 's3://some_bucket/test/usages' import pyarrow.parquet as pq import s3fs s3 = s3fs.S3FileSystem() rea...

pandas parquet pyarrow apache-arrow

Urbannal asked 25/2, 2019 at 12:45

6

Solved

Read partitioned parquet directory (all files) in one R dataframe with apache arrow

How do I read a partitioned parquet file into R with arrow (without any spark) The situation created parquet files with a Spark pipe and save on S3 read with RStudio/RShiny with one column as in...

r parquet apache-arrow

Bugbear asked 17/10, 2019 at 20:2

1

Solved

How to use Apache Arrow IPC from multiple processes (possibly from different languages)?

I'm not sure where to begin, so looking for some guidance. I'm looking for a way to create some arrays/tables in one process, and have it accessible (read-only) from another. So I create a pyarrow....

python ipc pyarrow apache-arrow

Silverplate asked 8/2, 2023 at 23:34

3

How to write a pandas dataframe to .arrow file

How can I write a pandas dataframe to disk in .arrow format? I'd like to be able to read the arrow file into Arquero as demonstrated here.

pandas apache-arrow

Nemertean asked 1/11, 2020 at 7:40

3

SQL on top of apache arrow in-browser?

I have data that is stored on a client's browser in-memory. For example, let's say the dataset is as follows: "name" (string), "age" (int32), "isAdult" (bool) "Tom" , 29 1 "Tom" , 14 0 "Dina" , 20...

javascript webassembly apache-arrow dremio

Marquittamarr asked 15/6, 2019 at 0:36

2

Solved

Proper Syntax for Filtering Expressions for Arrow Datasets in R

I am attempting to use the arrow package (relatively recently implemented) DataSet API to to read a directory of files into memory, and leverage the c++ back-end to filter rows and columns. I would...

r apache-arrow

Precinct asked 28/4, 2021 at 15:26

2

Solved

Create parquet file directory from CSV file in R

I'm running into more and more situations where I need out-of-memory (OOM) approaches to data analytics in R. I am familiar with other OOM approaches, like sparklyr and DBI but I recently came acro...

r csv import parquet apache-arrow

Empathize asked 19/3, 2021 at 15:17

0

How to add a column with an index to an apache arrow dataset in R?

I'm trying to add an index to a dataset which is too large to fit in RAM. The tidyverse way of adding an index would be: library(tidyverse) df = mtcars df |> mutate(row_id = 1:nrow(cyl)) # any ...

r dplyr dataset apache-arrow

Tease asked 27/6, 2022 at 7:21

2

Solved

Are data tables with more than 2^31 rows supported in R with the data table package yet?

I am trying to do a cross join (from the original question here), and I have 500GB of ram. The problem is that the final data.table has more than 2^31 rows, so I get this error: Error in vecseq(f__...

r merge data.table cross-join apache-arrow

Sporule asked 7/4, 2020 at 23:59

2

PyArrow: How to copy files from local to remote using new filesystem interface?

Could somebody give me a hint on how can I copy a file form a local filesystem to a HDFS filesystem using PyArrow's new filesystem interface (i.e. upload, copyFromLocal)? I have read the documentat...

python hdfs pyarrow apache-arrow

Ejective asked 28/7, 2021 at 11:11

1

Solved

Is it possible to append rows to an existing Arrow (PyArrow) Table?

I am aware that "Many Arrow objects are immutable: once constructed, their logical properties cannot change anymore" (docs). In this blog post by one of the Arrow creators it's said Tabl...

pyarrow apache-arrow

Performance asked 10/3, 2022 at 17:58

1

Solved

r arrow set column type/schema to char for all columns

{arrow}s auto-detection of column types is causing me some trouble when opening a large csv file. In particular, it drops leading zeroes for some identifiers and does some other unfortunate stuff. ...

r apache-arrow

Attitudinarian asked 1/3, 2022 at 8:4

1

Solved

Pyarrow Write/Append Columns Arrow File

I have a calculator that iterates a couple of hundred object and produces Nx1 arrays for each of those objects. N here being 1-10m depending on configurations. Right now I am summing over these by ...

python pyarrow apache-arrow

Doretha asked 24/2, 2022 at 18:50

5

Solved

Python error using pyarrow - ArrowNotImplementedError: Support for codec 'snappy' not built

Using Python, Parquet, and Spark and running into ArrowNotImplementedError: Support for codec 'snappy' not built after upgrading to pyarrow=3.0.0. My previous version without this error was pyarrow...

parquet pyarrow apache-arrow

Squaw asked 2/2, 2021 at 21:19

1

Solved

How reproducible / deterministic is Parquet format?

I'm seeking advice from people deeply familiar with the binary layout of Apache Parquet: Having a data transformation F(a) = b where F is fully deterministic, and same exact versions of the entire ...

parquet apache-arrow

Heterothallic asked 3/12, 2021 at 21:41

4

Spark dataframe to arrow

I have been using Apache Arrow with Spark for a while in Python and have been easily able to convert between dataframes and Arrow objects by using Pandas as an intermediary. Recently, however, I’v...

scala apache-spark dataframe apache-arrow

Asben asked 27/7, 2017 at 17:4

1

How to read/write partitioned Apache Arrow or Parquet files into/out of Julia

I am trying to read and write a trivial dataset into Julia. The dataset is mtcars, taken from R, with an arbitrarily added column bt with random Boolean values. The file/folder structure (below) wa...

julia parquet apache-arrow

Linker asked 17/5, 2021 at 17:26

1

Solved

How do I debug OverflowError: value too large to convert to int32_t?

What I am trying to do I am using PyArrow to read some CSVs and convert them to Parquet. Some of the files I read have plenty of columns and have a high memory footprint (enough to crash the machin...

python pyarrow apache-arrow

Halliehallman asked 4/8, 2021 at 13:29

apache-arrow Questions

Recommended topics

Hot tags