Python: Create structured numpy structured array from two columns in a DataFrame
Asked Answered
L

3

7

How do you create a structured array from two columns in a DataFrame? I tried this:

df = pd.DataFrame(data=[[1,2],[10,20]], columns=['a','b'])
df

    a   b
0   1   2
1   10  20

x = np.array([([val for val in list(df['a'])],
               [val for val in list(df['b'])])])

But this gives me this:

array([[[ 1, 10],
        [ 2, 20]]])

But I wanted this:

[(1,2),(10,20)]

Thanks!

Lactalbumin answered 11/7, 2018 at 7:48 Comment(2)
Because a package that I am using only takes input as a structured array. Why is this important?Lactalbumin
Because there might be no need to create a list of tuple at all or it's also useful in terms of the way of creating that list of tuple.Carvalho
S
12

There are a couple of methods. You may experience a loss in performance and functionality relative to regular NumPy arrays.

record array

You can use pd.DataFrame.to_records with index=False. Technically, this is a record array, but for many purposes this will be sufficient.

res1 = df.to_records(index=False)

print(res1)

rec.array([(1, 2), (10, 20)], 
          dtype=[('a', '<i8'), ('b', '<i8')])

structured array

Manually, you can construct a structured array via conversion to tuple by row, then specifying a list of tuples for the dtype parameter.

s = df.dtypes
res2 = np.array([tuple(x) for x in df.values], dtype=list(zip(s.index, s)))

print(res2)

array([(1, 2), (10, 20)], 
      dtype=[('a', '<i8'), ('b', '<i8')])

What's the difference?

Very little. recarray is a subclass of ndarray, the regular NumPy array type. On the other hand, the structured array in the second example is of type ndarray.

type(res1)                    # numpy.recarray
isinstance(res1, np.ndarray)  # True
type(res2)                    # numpy.ndarray

The main difference is record arrays facilitate attribute lookup, while structured arrays will yield AttributeError:

print(res1.a)
array([ 1, 10], dtype=int64)

print(res2.a)
AttributeError: 'numpy.ndarray' object has no attribute 'a'

Related: NumPy “record array” or “structured array” or “recarray”

Scrubber answered 11/7, 2018 at 8:23 Comment(0)
E
1

Use list comprehension for convert nested lists to tuples:

print ([tuple(x) for x in df.values.tolist()])
[(1, 2), (10, 20)]

Detail:

print (df.values.tolist())
[[1, 2], [10, 20]]

EDIT: You can convert by to_records and then to np.asarray, check link:

df = pd.DataFrame(data=[[True, 1,2],[False, 10,20]], columns=['a','b','c'])
print (df)
       a   b   c
0   True   1   2
1  False  10  20

print (np.asarray(df.to_records(index=False)))
[( True,  1,  2) (False, 10, 20)]
Essen answered 11/7, 2018 at 7:51 Comment(3)
Neither are numpy structured arrays. Is it possible to do this?Lactalbumin
@KimO - Can you explain more?Essen
Yes. docs.scipy.org/doc/numpy/user/basics.rec.html The result should be: array([(x,y), (x2,y2)]Lactalbumin
B
0

Here's a one-liner:

list(df.apply(lambda x: tuple(x), axis=1))

or

df.apply(lambda x: tuple(x), axis=1).values
Balaklava answered 11/7, 2018 at 8:7 Comment(4)
This is not a numpy structured array.. is that possible?Lactalbumin
edited it, is the second version what you are looking for?Balaklava
YES! Is there are way to control the types of the fields? For example, if the dataFrame has two columns and I want the first to turn into a "binary class event indicator"? As explained here: scikit-survival.readthedocs.io/en/latest/generated/… Search for "structured array" .. So "bool" typeLactalbumin
I strongly recommend you don't use object dtype for integers, even with structured arrays.Scrubber

© 2022 - 2024 — McMap. All rights reserved.