Numpy obtain dtype per column

Asked 30/11, 2018 at 10:52 Answered 21/6, 2023 at 11:50

I need to obtain the type for each column to properly preprocess it.

Currently I do this via the following method:

import pandas as pd

# input is of type List[List[any]]
# but has one type (int, float, str, bool) per column

df = pd.DataFrame(input, columns=key_labels)
column_types = dict(df.dtypes)
matrix = df.values

Since I only use pandas for obtaining the dtypes (per column) and use numpy for everything else I want to cut pandas from my project.

In summary: Is there a way to obtain (specific) dtypes per column from numpy

!Or: Is there a fast way to recompute the dtype of ndarray (after splicing the matrix)

Dessert answered 30/11, 2018 at 10:52 Comment(1)

In NumPy, each array has a single dtype. One of the points of pd.DataFrame is the ability to have mixed data types in a single data structure. This functionality is not provided by NumPy. If you do df.values, the dtype you will get is one that can hold any of the values, which in this case would be np.object, for all the values. – Mindoro 30/11, 2018 at 10:56

It would help if you gave a concrete example, but I'll demonstrate with @jpp's list:

In [509]: L = [[0.5, True, 'hello'], [1.25, False, 'test']]
In [510]: df = pd.DataFrame(L)
In [511]: df
Out[511]: 
      0      1      2
0  0.50   True  hello
1  1.25  False   test
In [512]: df.dtypes
Out[512]: 
0    float64
1       bool
2     object
dtype: object

pandas doesn't like to use string dtypes, so the last column is object.

In [513]: arr = df.values
In [514]: arr
Out[514]: 
array([[0.5, True, 'hello'],
       [1.25, False, 'test']], dtype=object)

So because of the mix in column dtypes, pandas is making the whole thing object. I don't know pandas well enough to know if you can control the dtype better.

To make a numpy structured array from L, the obvious thing to do is:

In [515]: np.array([tuple(row) for row in L], dtype='f,bool,U10')
Out[515]: 
array([(0.5 ,  True, 'hello'), (1.25, False, 'test')],
      dtype=[('f0', '<f4'), ('f1', '?'), ('f2', '<U10')])

That answers the question of how to specify a different dtype per 'column'. But keep in mind that this array is 1d, and has fields not columns.

But whether it's possible to deduce or set the dtype automatically, that's trickier. It might be possible to build a recarray from the columns, or use one of the functions in np.lib.recfunctions.

If I use a list 'transpose' I can format each column as a separate numpy array.

In [537]: [np.array(col) for col in zip(*L)]
Out[537]: 
[array([0.5 , 1.25]),
 array([ True, False]),
 array(['hello', 'test'], dtype='<U5')]

Then join them into one array with rec.fromarrays:

In [538]: np.rec.fromarrays([np.array(col) for col in zip(*L)])
Out[538]: 
rec.array([(0.5 ,  True, 'hello'), (1.25, False, 'test')],
          dtype=[('f0', '<f8'), ('f1', '?'), ('f2', '<U5')])

Or I could use genfromtxt to deduce fields from a csv format.

In [526]: np.savetxt('test.txt', np.array(L,object),delimiter=',',fmt='%s')
In [527]: cat test.txt
0.5,True,hello
1.25,False,test

In [529]: data = np.genfromtxt('test.txt',dtype=None,delimiter=',',encoding=None)
In [530]: data
Out[530]: 
array([(0.5 ,  True, 'hello'), (1.25, False, 'test')],
      dtype=[('f0', '<f8'), ('f1', '?'), ('f2', '<U5')])

Lumbye answered 30/11, 2018 at 18:41 Comment(1)

I'll look in to this later today, It has been by far the most insightful answer thus far. – Dessert 3/12, 2018 at 9:49

In numpy, an array has the same dtypes for all its entries. So no, it's not possible to have the dedicated/fast float in one column and another one in another column.

That's the point of pandas to allow you to jump from one column with one type to another.

Sepaloid answered 30/11, 2018 at 10:56 Comment(0)

In order to obtain each column type and use it in your program, you can use Numpy Structured Arrays.

Structured Arrays are a composition of simpler data types organized as a sequence of named fields.

They have a property called dtype which you can use to answer your question.

Note that Numpy also has a “Record Array” or “recarray” data type, that is quite similar to Structured Arrays. But according to this post, Record Arrays are much slower than Structured Arrays and are probably kept for convenience and backward compatibility.

import numpy as np

# Initialize structured array.
df = np.array([(10, 3.14, 'Hello', True),
                 (20, 2.71, 'World', False)],
                dtype=[
                    ("ci", "i4"),
                    ("cf", "f4"),
                    ("cs", "U16"),
                    ("cb", "?")])

# Basic usage.
print(df)
print(np.size(df))
print(df.shape)
print(df["cs"])
print(df["cs"][0])
print(type(df))
print(df.dtype)
print(df.dtype.names)

# Check exact data type.
print(df.dtype["ci"] == "i4")
print(df.dtype["cf"] == "f4")
print(df.dtype["cs"] == "U16")
print(df.dtype["cb"] == "?")

# Check general data type kind.
print(df.dtype["ci"].kind == "i")
print(df.dtype["cf"].kind == "f")
print(df.dtype["cs"].kind == "U")
print(df.dtype["cb"].kind == "b")

Pumpernickel answered 21/6, 2023 at 11:50 Comment(0)

Is there a way to obtain (specific) dtypes per column from numpy

No, there isn't. Since your dataframe has mixed types, your NumPy dtype will be object. Such an array is not stored in a contiguous memory block with each column having a fixed dtype. Instead, each value in the 2d array consists of a pointer.

Your question is no different from asking whether you can get the type of each "column" in this list of lists:

L = [[0.5, True, 'hello'], [1.25, False, 'test']]

Since the data in a collection of pointers has no columnar structure, there's no concept of "column dtype". You can test the type of each value for specific indices in each sublist. But this defeats the point of Pandas / NumPy.

Abutment answered 30/11, 2018 at 10:57 Comment(0)

Recommended topics

Hot tags