I'm looking for fast ways to store and retrieve numpy
array using pyarrow
. I'm pretty satisfied with retrieval. It takes less than 1 second to extract columns from my .arrow
file that contains 1.000.000.000 integers of dtype = np.uint16
.
import pyarrow as pa
import numpy as np
def write(arr, name):
arrays = [pa.array(col) for col in arr]
names = [str(i) for i in range(len(arrays))]
batch = pa.RecordBatch.from_arrays(arrays, names=names)
with pa.OSFile(name, 'wb') as sink:
with pa.RecordBatchStreamWriter(sink, batch.schema) as writer:
writer.write_batch(batch)
def read(name):
source = pa.memory_map(name, 'r')
table = pa.ipc.RecordBatchStreamReader(source).read_all()
for i in range(table.num_columns):
yield table.column(str(i)).to_numpy()
arr = np.random.randint(65535, size=(250, 4000000), dtype=np.uint16)
%%timeit -r 1 -n 1
write(arr, 'test.arrow')
>>> 25.6 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
%%timeit -r 1 -n 1
for n in read('test.arrow'): n
>>> 901 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Can efficiency of writing to .arrow
format be improved? In addition, I tested np.save
:
%%timeit -r 1 -n 1
np.save('test.npy', arr)
>>> 18.5 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
It looks a little bit faster. Can we optimise Apache Arrow for better writing into .arrow
format further?
np.random.randint()
return a generator or similarly lazy structure? Might you be timing the random number generation as well as the writes? (When I write to parquet from pandas, it's much faster than that, even the HDD.) – Charissawrite
andread
as it is shown in my code. I really wonder what is wrong with my testing? I've done it on Google Colab too and I've got 16s for writing, 24ms for reading. I'm going to try different methods as well. – Nunatak