How can I convert a ndarray/multi-dimensional array to a parquet file?
Asked Answered
D

1

5

I have a <class 'numpy.ndarray'> array that I would like saved to a parquet file to pass to a ML model I'm building. My array has 159573 arrays and each array has 1395 array in each.

Here is a sample of my data:

[[0.         0.         0.         ... 0.24093714 0.75547471 0.74532781]
 [0.         0.         0.         ... 0.24093714 0.75547471 0.74532781]
 [0.         0.         0.         ... 0.24093714 0.75547471 0.74532781]
 ...
 [0.         0.         0.         ... 0.89473684 0.29282009 0.29277004]
 [0.         0.         0.         ... 0.89473684 0.29282009 0.29277004]
 [0.         0.         0.         ... 0.89473684 0.29282009 0.29277004]]

I tried to convert using this code:

import pyarrow as pa
pa_table = pa.table({"data": Main_x})
pa.parquet.write_table(pa_table, "full_data.parquet")

I get this stacktrace:

5 frames
/usr/local/lib/python3.7/dist-packages/pyarrow/table.pxi in pyarrow.lib.table()

/usr/local/lib/python3.7/dist-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pydict()

/usr/local/lib/python3.7/dist-packages/pyarrow/array.pxi in pyarrow.lib.asarray()

/usr/local/lib/python3.7/dist-packages/pyarrow/array.pxi in pyarrow.lib.array()

/usr/local/lib/python3.7/dist-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()

/usr/local/lib/python3.7/dist-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: only handle 1-dimensional arrays

I'm wondering if there is a way to save a multi-dimensional array to parquet?

Disclosure answered 12/8, 2021 at 15:3 Comment(0)
R
7

Parquet/Arrow isn't best suited to save this type of data. It's better at dealing with tabular data with a well defined schema and specific columns names and types. In particular the numpy conversion API only supports one dimensional data.

Having that said you can easily convert your 2-d numpy array to parquet, but you need to massage it first.

You're best option is to save it as a table with n columns of m double each.

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

matrix = np.random.rand(10, 100)
arrays = [
    pa.array(col)  # Create one arrow array per column
    for col in matrix
]

table = pa.Table.from_arrays(
    arrays,
    names=[str(i) for i in range(len(arrays))] # give names to each columns
)
# Save it:
pq.write_table(table, 'table.pq')

# Read it back as numpy:
table_from_parquet = pq.read_table('table.pq')
matrix_from_parquet = table_from_parquet.to_pandas().T.to_numpy()

The intermediate table has got 10 columns and 100 rows:

|         0 |          1 |          2 |         3 |          4 |          5 |          6 |         7 |         8 |          9 |
|----------:|-----------:|-----------:|----------:|-----------:|-----------:|-----------:|----------:|----------:|-----------:|
| 0.45774   | 0.92753    | 0.252345   | 0.982261  | 0.503732   | 0.543526   | 0.22827    | 0.347948  | 0.654259  | 0.152693   |
| 0.287813  | 0.793067   | 0.972282   | 0.739047  | 0.0689906  | 0.102235   | 0.110273   | 0.166839  | 0.907481  | 0.427729   |
| 0.523928  | 0.511737   | 0.473887   | 0.771607  | 0.707633   | 0.276726   | 0.943073   | 0.788174  | 0.305119  | 0.511876   |
| 0.67563   | 0.947449   | 0.895125   | 0.246979  | 0.703503   | 0.256418   | 0.93113    | 0.116715  | 0.330746  | 0.566704   |
| 0.471526  | 0.45332    | 0.546384   | 0.822873  | 0.333542   | 0.518933   | 0.229525   | 0.381977  | 0.893204  | 0.932781   |
...
Rump answered 12/8, 2021 at 15:28 Comment(3)
wow this is so cool, thank you! just to confirm my understanding. You took every column name from numpy and converted it to a string(since there are no names, it's a string based on numerical value) then wrote the table? Thanks for the code example it is super helpful and I confirmed they perfectly match = (Main_x==matrix_from_parquet).all() resulted in "True"Disclosure
Numpy doesn't have a notion of column name, so I just generated names using the index of each column ("0", "1"., "2", ...). I also had to create a pyarrow Array for each column. Then I put the array and names together to create a table.Rump
What data format is suitable for storing large (~300GB) numpy arrays?Sajovich

© 2022 - 2024 — McMap. All rights reserved.