Convert Python sequence to NumPy array, filling missing values

Asked 27/7, 2016 at 17:1 Answered 2/7, 2021 at 12:51

Solved python arrays numpy sequence variable-length-array

The implicit conversion of a Python sequence of variable-length lists into a NumPy array cause the array to be of type object.

v = [[1], [1, 2]]
np.array(v)
>>> array([[1], [1, 2]], dtype=object)

Trying to force another type will cause an exception:

np.array(v, dtype=np.int32)
ValueError: setting an array element with a sequence.

What is the most efficient way to get a dense NumPy array of type int32, by filling the "missing" values with a given placeholder?

From my sample sequence v, I would like to get something like this, if 0 is the placeholder

array([[1, 0], [1, 2]], dtype=int32)

Nidify answered 27/7, 2016 at 17:1 Comment(0)

You can use itertools.zip_longest:

import itertools
np.array(list(itertools.zip_longest(*v, fillvalue=0))).T
Out: 
array([[1, 0],
       [1, 2]])

Note: For Python 2, it is itertools.izip_longest.

Glasper answered 27/7, 2016 at 17:12 Comment(1)

This seems really good when the the size variation is huge within the list elements based on a quick runtime test for a large dataset. – Rebus 27/7, 2016 at 17:36

Here's an almost* vectorized boolean-indexing based approach that I have used in several other posts -

def boolean_indexing(v):
    lens = np.array([len(item) for item in v])
    mask = lens[:,None] > np.arange(lens.max())
    out = np.zeros(mask.shape,dtype=int)
    out[mask] = np.concatenate(v)
    return out

Sample run

In [27]: v
Out[27]: [[1], [1, 2], [3, 6, 7, 8, 9], [4]]

In [28]: out
Out[28]: 
array([[1, 0, 0, 0, 0],
       [1, 2, 0, 0, 0],
       [3, 6, 7, 8, 9],
       [4, 0, 0, 0, 0]])

*Please note that this coined as almost vectorized because the only looping performed here is at the start, where we are getting the lengths of the list elements. But that part not being so computationally demanding should have minimal effect on the total runtime.

Runtime test

In this section I am timing DataFrame-based solution by @Alberto Garcia-Raboso, itertools-based solution by @ayhan as they seem to scale well and the boolean-indexing based one from this post for a relatively larger dataset with three levels of size variation across the list elements.

Case #1 : Larger size variation

In [44]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8,9,3,6,4,8,3,2,4,5,6,6,8,7,9,3,6,4]]

In [45]: v = v*1000

In [46]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 9.82 ms per loop

In [47]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
100 loops, best of 3: 5.11 ms per loop

In [48]: %timeit boolean_indexing(v)
100 loops, best of 3: 6.88 ms per loop

Case #2 : Lesser size variation

In [49]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8]]

In [50]: v = v*1000

In [51]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 3.12 ms per loop

In [52]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1000 loops, best of 3: 1.55 ms per loop

In [53]: %timeit boolean_indexing(v)
100 loops, best of 3: 5 ms per loop

Case #3 : Larger number of elements (100 max) per list element

In [139]: # Setup inputs
     ...: N = 10000 # Number of elems in list
     ...: maxn = 100 # Max. size of a list element
     ...: lens = np.random.randint(0,maxn,(N))
     ...: v = [list(np.random.randint(0,9,(L))) for L in lens]
     ...: 

In [140]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
1 loops, best of 3: 292 ms per loop

In [141]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1 loops, best of 3: 264 ms per loop

In [142]: %timeit boolean_indexing(v)
10 loops, best of 3: 95.7 ms per loop

To me, it seems ~~itertools.izip_longest is doing pretty well!~~ there's no clear winner, but would have to be taken on a case-by-case basis!

Rebus answered 27/7, 2016 at 17:13 Comment(3)

@ayhan Hmm can't run that on my Python 2 version. Could it be my NumPy version 1.11.1? – Rebus 27/7, 2016 at 18:6

I guess all methods are iterating over v but as the lists inside v are getting larger, your method starts to be faster. I tried it with n=10^3, m=10^4 and it was 5 times faster. I have 1.11.1 in Python 3 but results are very similar to Python 2.7 numpy 1.10.4 – Glasper 27/7, 2016 at 18:13

@ayhan Appreciate the feedback and honesty! ;) Added another case for that :) – Rebus 27/7, 2016 at 18:34

Pandas and its DataFrame-s deal beautifully with missing data.

import numpy as np
import pandas as pd

v = [[1], [1, 2]]
print(pd.DataFrame(v).fillna(0).values.astype(np.int32))

# array([[1, 0],
#        [1, 2]], dtype=int32)

Toor answered 27/7, 2016 at 17:10 Comment(1)

This is great for data with less size variation, good solution really! – Rebus 27/7, 2016 at 17:38

Here is a general way:

>>> v = [[1], [2, 3, 4], [5, 6], [7, 8, 9, 10], [11, 12]]
>>> max_len = np.argmax(v)
>>> np.hstack(np.insert(v, range(1, len(v)+1),[[0]*(max_len-len(i)) for i in v])).astype('int32').reshape(len(v), max_len)
array([[ 1,  0,  0,  0],
       [ 2,  3,  4,  0],
       [ 5,  6,  0,  0],
       [ 7,  8,  9, 10],
       [11, 12,  0,  0]], dtype=int32)

Article answered 27/7, 2016 at 17:17 Comment(0)

max_len = max(len(sub_list) for sub_list in v)

result = np.array([sub_list + [0] * (max_len - len(sub_list)) for sub_list in v])

>>> result
array([[1, 0],
       [1, 2]])

>>> type(result)
numpy.ndarray

Jello answered 27/7, 2016 at 17:13 Comment(0)

If you want to extend the same logic to deeper levels (list of lists of lists,..) you can use tensorflow ragged tensors and convert to tensors/arrays. For example:

import tensorflow as tf
v = [[1], [1, 2]]
padded_v = tf.ragged.constant(v).to_tensor(0)

This creates an array padded with 0. or a deeper example:

w = [[[1]], [[2],[1, 2]]]
padded_w = tf.ragged.constant(w).to_tensor(0)

Typesetting answered 2/7, 2021 at 12:51 Comment(0)

you can try to convert pandas dataframe first, after that convert it to numpy array

ll = [[1, 2, 3], [4, 5], [6, 7, 8, 9]]

df = pd.DataFrame(ll)
print(df)
#    0  1    2    3
# 0  1  2  3.0  NaN
# 1  4  5  NaN  NaN
# 2  6  7  8.0  9.0

npl = df.to_numpy()
print(npl)

# [[ 1.  2.  3. nan]
#  [ 4.  5. nan nan]
#  [ 6.  7.  8.  9.]]

Mairemaise answered 13/6, 2020 at 16:6 Comment(0)

I was having a numpy broadcast error with Alexander's answer so I added a small variation with numpy.pad:

pad = len(max(X, key=len))

result = np.array([np.pad(i, (0, pad-len(i)), 'constant') for i in X])

Apocope answered 10/7, 2020 at 12:6 Comment(0)

Recommended topics

Hot tags