I'm not sure where to begin, so looking for some guidance. I'm looking for a way to create some arrays/tables in one process, and have it accessible (read-only) from another.
So I create a pyarrow.Table
like this:
a1 = pa.array(list(range(3)))
a2 = pa.array(["foo", "bar", "baz"])
a1
# <pyarrow.lib.Int64Array object at 0x7fd7c4510100>
# [
# 0,
# 1,
# 2
# ]
a2
# <pyarrow.lib.StringArray object at 0x7fd7c5d6fa00>
# [
# "foo",
# "bar",
# "baz"
# ]
tbl = pa.Table.from_arrays([a1, a2], names=["num", "name"])
tbl
# pyarrow.Table
# num: int64
# name: string
# ----
# num: [[0,1,2]]
# name: [["foo","bar","baz"]]
Now how do I read this from a different process? I thought I would use multiprocessing.shared_memory.SharedMemory
, but that didn't quite work:
shm = shared_memory.SharedMemory(name='pa_test', create=True, size=tbl.nbytes)
with pa.ipc.new_stream(shm.buf, tbl.schema) as out:
for batch in tbl.to_batches():
out.write(batch)
# TypeError: Unable to read from object of type: <class 'memoryview'>
Do I need to wrap the shm.buf
with something?
Even if I get this to work, it seems very fiddly. How would I do this in a robust manner? Do I need something like zmq?
I'm not clear how this is zero copy though. When I write the record batches, isn't that serialisation? What am I missing?
In my real use case, I also want to talk to Julia, but maybe that should be a separate question when I come to it.
PS: I have gone through the docs, it didn't clarify this part for me.
calculate_ipc_size
also serialises the data? That would mean I end up serialising twice. I intend to share a large-ish table, it might mean a noticeable delay. – Silverplate