Fastest way to construct pyarrow table row by row
Asked Answered
L

2

12

I have a large dictionary that I want to iterate through to build a pyarrow table. The values of the dictionary are tuples of varying types and need to be unpacked and stored in separate columns in the final pyarrow table. I do know the schema ahead of time. The keys also need to be stored as a column. I have a method below to construct the table row by row - is there another method that is faster? For context, I want to parse a large dictionary into a pyarrow table to write out to a parquet file. RAM usage is less of a concern than CPU time. I'd prefer not to drop down to the arrow C++ API.

import pyarrow as pa
import random
import string 
import time

large_dict = dict()

for i in range(int(1e6)):
    large_dict[i] = (random.randint(0, 5), random.choice(string.ascii_letters))


schema = pa.schema({
        "key"  : pa.uint32(),
        "col1" : pa.uint8(),
        "col2" : pa.string()
   })

start = time.time()

tables = []
for key, item in large_dict.items():
    val1, val2 = item
    tables.append(
            pa.Table.from_pydict({
                    "key"  : [key],
                    "col1" : [val1],
                    "col2" : [val2]
                }, schema = schema)

            )

table = pa.concat_tables(tables)
end = time.time()
print(end - start) # 22.6 seconds on my machine

Lion answered 14/9, 2019 at 20:37 Comment(0)
G
10

Since the schema is known ahead of time, you can make a list for each column and build a dictionary of column name and column values pairs.

%%timeit -r 10
import pyarrow as pa
import random
import string 
import time

large_dict = dict()

for i in range(int(1e6)):
    large_dict[i] = (random.randint(0, 5), random.choice(string.ascii_letters))


schema = pa.schema({
        "key"  : pa.uint32(),
        "col1" : pa.uint8(),
        "col2" : pa.string()
  })

keys = []
val1 = []
val2 = []
for k, (v1, v2) in large_dict.items():
  keys.append(k)
  val1.append(v1)
  val2.append(v2)

table = pa.Table.from_pydict(
    dict(
        zip(schema.names, (keys, val1, val2))
    ),
    schema=schema
)

2.92 s ± 236 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

Guerin answered 14/9, 2019 at 22:12 Comment(4)
Thanks, this is faster. Would you expect any benefit from using pyarrow arrays instead of lists? I know the number of elements ahead of time so could pre-allocate.Lion
That should prove to be similar performance. pyarrow is implicitly converting the native python list to an array github.com/apache/arrow/blob/…Guerin
pyarrow arrays are immutable, so you'll have a hard time appending to them. But you could use numpy ndarray and that should be faster than python lists.Envenom
If the schema is not known ahead of time, just usepa.Table.from_pydict() without a pa.schema and it will infer the data types.Pol
M
3

I am playing with pyarrow as well. For me it seems that in your code data-preparing stage (random, etc) is most time consuming part itself. So may be first try to convert data into dict of arrays, and then feed them to Arrow Table.

Please look, I make example based on your data and %%timeit-ing only Table population stage. But do it with RecordBatch.from_arrays() and array of three arrays.

I = iter(pa.RecordBatch.\
         from_arrays(
                      get_data(l0, l1_0, l2, i),
                      schema=schema) for i in range(1000)
        )

T1 = pa.Table.from_batches(I, schema=schema)

With static data set 1000 rows batched 1000 times - table is populated with incredible 15 ms :) Due to caching maybe. And with 1000 rows modified like col1*integer batched 1000 times - 33.3 ms, which is also looks nice.

My sample notebook is here

PS. I was wondering could be numba jit be helpful, but seems it only making timing worse here.

Modular answered 18/10, 2019 at 17:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.