Is it possible to append rows to an existing Arrow (PyArrow) Table?
Asked Answered
P

1

6

I am aware that "Many Arrow objects are immutable: once constructed, their logical properties cannot change anymore" (docs). In this blog post by one of the Arrow creators it's said

Table columns in Arrow C++ can be chunked, so that appending to a table is a zero copy operation, requiring no non-trivial computation or memory allocation.

However, I am unable to find in the documentation how to append a row to a table. pyarrow.concat_tables(tables, promote=False) does something similar, but it is my understanding that it produces a new Table object, rather than, say, adding chunks to the existing one.

I am unsure if this is operation is at all possible/makes sense (in which case I'd like to know how) or if it doesn't (in which case, pyarrow.concat_tables is exactly what I need).

Similar questions:

Performance answered 10/3, 2022 at 17:58 Comment(1)
This should probably be explained more clearly somewhere but effectively Table is a container of pointers to actual data. So you can concatenate two tables, and you'll just be copying the pointers, not the data itself. Hence "If promote==False, a zero-copy concatenation will be performed."Krawczyk
K
11

Basically, a Table in PyArrow/Arrow C++ isn't really the data itself, but rather a container consisting of pointers to data. How it works is:

  • A Buffer represents an actual, singular allocation. In other words, Buffers are contiguous, full stop. They may be mutable or immutable.
  • An Array contains 0+ Buffers and imposes some sort of semantics into them. (For instance, an array of integers, or an array of strings.) Arrays are "contiguous" in the sense that each buffer is contiguous, and conceptually the "column" is not "split" across multiple buffers. (This gets really fuzzy with nested arrays: a struct array does split its data across multiple buffers, in some sense! I need to come up with a better wording of this, and will contribute this to upstream docs. But I hope what I mean here is reasonably clear.)
  • A ChunkedArray contains 0+ Arrays. A ChunkedArray is not logically contiguous. It's kinda like a linked list of chunks of data. Two ChunkedArrays can be concatenated "zero copy", i.e. the underlying buffers will not get copied.
  • A Table contains 0+ ChunkedArrays. A Table is a 2D data structure (both columns and rows).
  • A RecordBatch contains 0+ Arrays. A RecordBatch is also a 2D data structure.

Hence, you can concantenate two Tables "zero copy" with pyarrow.concat_tables, by just copying pointers. But you cannot concatenate two RecordBatches "zero copy", because you have to concatenate the Arrays, and then you have to copy data out of buffers.

Krawczyk answered 10/3, 2022 at 18:29 Comment(6)
Superb explanation! Thanks for spelling out the relationship between Table -> ChunkedArray -> Array and RecordBatch -> Array. The Python docs mention all the arrays separately but then ChunkedArrays are introduced suddenly when talking about tables, perhaps a better categorization could be done there.Performance
You wrote that "conceptually the "column" is not "split" across multiple buffers" in some sense, but not in another. Would it be correct to say that there is a bijection between elements of the Array and elements of each of the 0+ Buffers? PyArrow documentation is unclear about the relation between an Array and Buffers. It may seem that "Array can hold multiple Buffers" contradicts "Array is contiguous". I guess, there is no contradiction because Buffers represent different aspects of the Array, e.g. one Buffer stores int values, another Buffer stores NaN bitmap. Is this correct?Cheviot
Is there any resource that says why arrow arrays are immutable? Say compared to numpy arrays, there would be no problem modifying values directly there.Booth
Some Arrow implementations allow mutable arrays (e.g. Java), but the C++ one that underlies PyArrow does not. Mutable arrays have their own pitfalls, which Pandas and Arrow-Java users have to deal with.Krawczyk
@Cheviot sorry for the delay, but I think that's correct. Well, with a caveat; StringView arrays can have multiple data buffers. So you could almost zero-copy concatenate two StringView arrays, unlike other arrays, but you'd still have to copy their validity and offset buffers.Krawczyk
And consequently there isn't a bijection anymore, elements in a StringView array will "appear" once in the validity and offset buffers, but will appear in at most 1 of the data buffers.Krawczyk

© 2022 - 2024 — McMap. All rights reserved.