Data structure for n-dimensional array / tensor such A[0, :, :] and A[1, :, :] can have different shapes
Asked Answered
C

5

5

With Python/Numpy, I'm working with n-dimensional data (ideally in a ndarray) such that:

  • (1): ragged array

    For example A[0, :, :], ..., A[49, :, :] can be of shape 100x100, and A[50, :, :] could be of shape 10000x10000: I don't want to create a ndarray of shape (..., 10000, 10000) because it would be a waste of space: A[n, :, :] only contains 100x100 coefficients for n = 0 .. 49.

  • (2): labeled indexing instead of positional indexing

    I also would like to be able to work with B[:, :, :, :, 202206231808], B[:, :, :, :, 19700101000000], i.e. the last dimension would be a numerical timestamp in format YYYYMMDDhhmmss or more generally an integer label (not in a continuous 0 .. n-1 range)

  • (3): easy Numpy-like arithmetic

    All of this should keep (as much as possible) all the standard Numpy operations such as B.mean(axis=4) to average the data on all timestamps, and similar useful Numpy operations, etc.

  • (4): serialization / random access

    We should be able to save 100 GB of data in such a data structure, to disk. Then the next day, if we want to only modify a few values, we should be able to save it on disk without rewriting the full 100 GB file:

    x = datastore.open('datastore.dat')                              # open the data store, *without* loading everything in memory
    x[20220624000000, :, :, :] = 0                                   # modify some values
    x[20220510120000, :, :, :] -= x[20220510120000, :, :, :].mean()  # modify other values
    x.close()                                                        # only a few bytes written to disk 
    

What is the right data structure for such a n-dimensional array, with Numpy or Pandas? (NB: I will have probably 5 or 6 dimensions)

Collative answered 23/6, 2022 at 16:11 Comment(11)
Have you tried with tensorflow tf.constant?Photocathode
Your A is impossible with numpy. Your B doesn't make sense. B[:,:,...1000] means B indexed on the last dimension with 1000.Nankeen
@Nankeen With a standard numpy ndarray, you're right, A is impossible; my goal is to see if there are solutions "close" to how ndarray work in the philosophy. B is an example of 5D-array for which the last dimension can be (big) numbers. Of course for B to make sense, we need a sparse structure, if not, it would uselessly require petabytes of data :)Collative
another thing to look at is the scikit awkward package, which does deal with jagged arrays, but not with labeling, and it doesn't fully conform to the numpy array protocol, so it's not compatible with xarray or any of the other options you're looking at. ¯_(ツ)_/¯Norman
Does anyone know if (1), (2), (3), (4) would be doable with HDF5 format? Would be a great bounty answer!Collative
(3) is unclear for the labeled axis (HDF5 dataset-keys), probably yes but with some custom programming. Can you elaborate on the requirements in more detail? (1), (2), (4), definitely. It seems like the best option anyway.Propriety
Thanks @MichaelSzczesny! An answer with HDF5 showing 1, 2, 4 and partial 3 would be great! Here are the more precise requirements: afewthingz.com/ndarraydatastore If 3 is not available for the labeled-axis, it's not such a big deal (I can probably do a .tonumpy() to get a standard ndarray in this case: in the case I need to average over 1 labeled-axis and fix a value for another labeled-axis, the remaining dimensions will be fixed and not ragged, so I can probably convert to a normal ndarray in this case)Collative
I do not think there is an efficient mathematical model for jagged array. With regular tensors, you only need to keep track of strides. With jagged arrays layed out contiguosly in memory, you need to keep tracks of all sizes for all axes (which grows exponentially with the number of dims) and slicing/resizing/reshaping would be a nightmare. Sparse data might be the way to go if the memory/computation trade-off is favourable. Otherwise, a masked array may also be helpful with jaggedness. This would make 1, 3, 4. For 2, xarray, but not sure if it supports sparse/masked arrays.Dullard
1 is fundamentally incompatible with 2, 3, and 4. If you're willing to give up 1 there are good solutions to your problem. There are no good solutions if you keep 1.Divulgate
While this is true from a theoretical/CS standpoint, practically you can still have slower or faster libraries in python. Strings arrays are jagged but numpy has invested a lot of time in Improving string array performance. They’re not as performant as numerical arrays but they’re better than carrying around lists of lists right? So I think it’s a fair question, though I don’t think any library checks all these boxes right nowNorman
@MichaelDelgado string arrays in numpy aren't jagged, they're fixed-length and padded with null bytes, for the usual reasonsDivulgate
B
4

Another option is to combine sparse with Zarr. The support for persistent sparse arrays in Zarr (AFAIU) is only a by-product of the compression (see GH issue), but it might be suitable for your purpose:

from numpy import random
from sparse import DOK
from tqdm import tqdm
from zarr import open as zarr_open

s3 = DOK(
    (10, 20, 30, 300000000000), fill_value=np.nan
)  # allowing timestamps up to year 3000!
s3[:5, :15, 0, 202206240936] = random.rand(5, 15)

z = zarr_open("data/example.zarr", mode="w", shape=s3.shape, chunks=(10, 10, 5, 1))

# there is a better way to update the elements in bulk using .set_coordinate_selection
for coords, val in tqdm(s3.data.items()):
    z[coords] = val

Note that the chunking is going to play a critical role: if the chunks are too big, then the stored amount is going to be larger (because Zarr will store the full-size chunk, even if only a subset of the chunk had non-default values), while if the chunks are too small, there might be some performance hit (if your data has substantial dense blocks/areas).

Edit: Zarr is supported by Xarray, so will work well for distributed computations.

Bookmark answered 27/6, 2022 at 7:23 Comment(0)
N
3

It looks like you're looking for xarray. Indices in numpy are purely positional. You can't have an index in numpy be a timestamp, because the first index is always 0, and the last index is always len(axis) - 1.

xarray uses n-dimensional arrays as a computational engine, but adds the concept of labeled indexing from pandas. It's a NumFOCUS-supported project with a lot of users and growing tie-ins to pandas, numpy, and dask (for distributed processing). You can easily create an ND-Array with e.g. datetime coordinates (dimension labels) and select using these labels. You can also use the sparse package's COO arrays as a backend if desired.

See the quickstart for an introduction.

For example, you can create an array from a numpy NDArray, but add dimension names and coordinate labels:

import xarray as xr, numpy as np, pandas as pd

da = xr.DataArray(
    np.random.random(size=(10, 10, 100)),
    dims=['x', 'y', 'time'],
    coords=[
        range(10),
        range(-100, 0, 10),
        pd.date_range('2022-06-23 18:08', periods=100, freq='s'),
    ],
)

Here's what this looks like displayed:

In [3]: da
Out[3]:
<xarray.DataArray (x: 10, y: 10, time: 100)>
array([[[5.20920842e-01, 4.69121072e-01, 6.40222454e-01, ...,
         2.99971293e-01, 2.62265561e-01, 6.35366406e-01],
        ...,
        [2.67650196e-01, 1.83472873e-01, 9.28958673e-01, ...,
         2.54365478e-01, 5.31364961e-01, 7.64313509e-01]],
...

       [[4.36503680e-01, 6.04280469e-01, 3.74281880e-01, ...,
         9.41795201e-03, 2.45035315e-01, 4.36213072e-01],
        ...,
        [2.70554857e-01, 9.81791362e-01, 3.67033886e-01, ...,
         2.37171168e-01, 3.92829137e-01, 1.18888502e-02]]])
Coordinates:
  * x        (x) int64 0 1 2 3 4 5 6 7 8 9
  * y        (y) int64 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10
  * time     (time) datetime64[ns] 2022-06-23T18:08:00 ... 2022-06-23T18:09:39

The underlying array is still numpy:

In [4]: type(da.data)
Out[4]: numpy.ndarray

You can select along dimensions positionally, or by label using .sel:

In [5]: da.sel(time='2022-06-23T18:09:01')
Out[5]:
<xarray.DataArray (x: 10, y: 10)>
array([[0.61802968, 0.44798696, 0.53146839, 0.54672015, 0.52251633,
        0.69215547, 0.84386726, 0.72421072, 0.87467204, 0.87845358],
       [0.22257334, 0.32035713, 0.08175992, 0.34816822, 0.84258207,
        0.80708575, 0.02339722, 0.1904887 , 0.77412369, 0.34198665],
       [0.4987155 , 0.05057836, 0.11611118, 0.95652761, 0.88992791,
        0.15960549, 0.31591357, 0.77504342, 0.04418024, 0.02722908],
       [0.76613849, 0.88007545, 0.27904722, 0.56225594, 0.39773015,
        0.23494531, 0.54437166, 0.41985857, 0.92803277, 0.63992328],
       [0.00981116, 0.2688392 , 0.17421749, 0.45761431, 0.74987955,
        0.8115907 , 0.42623655, 0.9660985 , 0.25014544, 0.47767839],
       [0.21176705, 0.17295334, 0.25520267, 0.17743549, 0.10468529,
        0.48232753, 0.55139512, 0.9658701 , 0.52430646, 0.99446656],
       [0.83707974, 0.07546811, 0.70503445, 0.62984982, 0.5956393 ,
        0.93147836, 0.97454177, 0.92595764, 0.4889221 , 0.59362206],
       [0.04210777, 0.56803518, 0.78362288, 0.54106628, 0.09178342,
        0.63581206, 0.03913531, 0.43868853, 0.22767441, 0.86995461],
       [0.88047   , 0.86284775, 0.26553173, 0.06123448, 0.55392798,
        0.44922685, 0.18933487, 0.16720496, 0.40440954, 0.79741338],
       [0.22714674, 0.76756767, 0.08131078, 0.64319224, 0.39983711,
        0.792     , 0.32000998, 0.42772083, 0.19313205, 0.35174807]])
Coordinates:
  * x        (x) int64 0 1 2 3 4 5 6 7 8 9
  * y        (y) int64 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10
    time     datetime64[ns] 2022-06-23T18:09:01

Alignment in xarray is done by dimension name rather than axis order, so there's no reason to have an array with shape (1, 1, 1, 1, 1000). Instead, just ensure that dimension names are consistent across your arrays, and two arrays with shared dimension names will be broadcast against each other correctly. See the docs on computation: automatic alignment for more info.

Norman answered 23/6, 2022 at 16:31 Comment(10)
Thanks! I'll read this in detail, it looks great! Is xarray more general than Pandas DataFrame? Or is it the contrary?Collative
sort of, but there are distinct differences. If you think of the pd.Series as the unit of analysis in pandas, and pd.DataFrame as a dictionary of series, then xr.DataArray is the analogous ND-array equivalent of pd.Series, with an unlimited number of indexes, and xr.Dataset is the equivalent dataarray dictionary container.Norman
The thing that catches some people off guard when coming from pandas is that dimensions in xarray need to be orthogonal/independent, as in numpy. In pandas, you might have a MultiIndex with both timestamp and year. Do not make those perpendicular axes in xarray. Instead, you dimension is timestamp - year could be a non-indexing coordinate - essentially metadata about (and indexed by) timestamp. Note also that you can have a pandas.MultiIndex coordinate along a single xarray dimension.Norman
but pandas did deprecate pd.Panel in large part because xarray obviated the need, and the pandas devs thought that any use cases with ND>2 could (and should) be handled in xarray, not pandas. See the third warning in the v0.25.0 release notesNorman
Thanks @MichaelDelgado! So this totally fills the (2) requirement from my question. Do you think it also solves the requirement (1)? Exemple: can A[0, :, :] be of totally different shape than A[50, :, :]?Collative
Aslo @MichaelDelgado, is it possible to work with 100GB xarrays if we only have much less RAM? i.e. can it lazy load only certain parts? Or is an xarray always fully loaded in memory? See requirement (4) in my question. Thanks!Collative
here is what I want to achieve: afewthingz.com/ndarraydatastoreCollative
Sorry but at this point I think you’re going to need to just start reading docs and trying things. No package that I’m aware of offers super high performance for huge labeled n-d jagged arrays. You’ll have to tweak your workflow a bit and experiment with the great suggestions you’ve gotten here to see what performs best in your situation.Norman
Yes @MichaelDelgado, that's what I'll continue to do in the next days. Thank you very much for all your insights!Collative
fwiw, xarray and tensorflow have tons of docs and examples of how to work with distributed and out-of-core processing. xarray can work with partitioned sparse dask arrays, so that kind of checks a good number of boxes you're looking for, but they're not full jagged arrays. I think that's a tough set of constraints you're looking for. good luck!Norman
P
2

As far as I can know, you can use a non-consistent dimensional data structure like the one that you want with tensorflow.ragged.constant() as follows:

import numpy as np
import tensorflow as tf

l1 = tf.ragged.constant([[0, 1, 0], [1, 1]])
print(l1)

The main advantage of using the TensorFlow library here is that you can convert your tensors to NumPy arrays with an easy instruction your_tensor.numpy().

Photocathode answered 23/6, 2022 at 16:25 Comment(3)
Thanks! Can you give an example of how would you create a A for which A[0, :, :], ..., A[49, :, :] are of shape (100, 100) and A[50, :, :] of shape 10 000 x 10 000?Collative
@basj Sure, here you have one tf.ragged.constant([[[1]*10]*10, [[1]*100]*100]). I reduced the shape in order to make it understandable when printed in the console, but you can expand its shape to fit what you need.Photocathode
Nice indeed, tf.ragged.constant seems to be perfect for my question's requirement (1). Do you think it is possible to do my question's requirement (2) with tf.ragged.constant? i.e. access an element by integer label that is not necessarily 0...n-1.Collative
C
2

The sparse data structure might be useful as well. Example:

import numpy as np
import sparse
s3 = sparse.DOK((1000, 1000, 10, 300000000000), fill_value=np.nan)  # allowing timestamps
                                                                    # up to year 3000!
s3[:500, :200, 0, 202206240936] = np.random.rand(500, 200)
print(s3[:, :, 0, 202206240936])
print(s3[:, :, 0, 202206240936].todense())  # the  500x200 matrix above, completed with NaN
print(s3[:, :, 0, 197001010000].todense())  # only NaN because not defined
print(s3.nbytes)
sparse.save_npz('test.npz', s3)  # Here it takes ~ 2MB for 100k float64
                                 #                  which would normally take 0.8 MB

Only drawback: I'm not sure how it's possible to save many data in this (for example 100GB of data), then the next day open this structure, only modify a few values here and there, and save it back to disk, without having to rewrite the whole 100 GB file... (comments are welcome).

Collative answered 24/6, 2022 at 8:20 Comment(1)
There was some discussion about sparse support in Zarr. This addresses the issue of re-writing only the modified chunks. The only drawback (AFAIU this is temporary, until sparse is implemented) is that individual chunks will be dense (at least until converted to sparse).Bookmark
B
2

Sounds like a potential use case for awkward-array. I'm not sure about points 2 and 4 in your wishlist, but consider the following examples.

Numpy-like syntax:

from awkward import Array

a = Array([[1, 2, 3], [4, 5]])

print(a + 1)
# [[2, 3, 4], [5, 6]]

Some support for labelled indexing (for incomplete records):

python_dicts = [
    {"x": 1, "y": 1.1, "z": "one"},
    {"x": 2, "y": 2.2, "z": "two"},
    {"x": 3, "y": 3.3, "z": "three"},
    {"y": 4.4, "z": "four"},
    {"z": "five"},
]
b = Array(python_dicts)

print(b["x"])
# [1, 2, 3, None, None]
Bookmark answered 27/6, 2022 at 6:38 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.