It would help if you gave a concrete example, but I'll demonstrate with @jpp's
list:
In [509]: L = [[0.5, True, 'hello'], [1.25, False, 'test']]
In [510]: df = pd.DataFrame(L)
In [511]: df
Out[511]:
0 1 2
0 0.50 True hello
1 1.25 False test
In [512]: df.dtypes
Out[512]:
0 float64
1 bool
2 object
dtype: object
pandas
doesn't like to use string dtypes, so the last column is object
.
In [513]: arr = df.values
In [514]: arr
Out[514]:
array([[0.5, True, 'hello'],
[1.25, False, 'test']], dtype=object)
So because of the mix in column dtypes, pandas
is making the whole thing object
. I don't know pandas well enough to know if you can control the dtype better.
To make a numpy
structured array from L
, the obvious thing to do is:
In [515]: np.array([tuple(row) for row in L], dtype='f,bool,U10')
Out[515]:
array([(0.5 , True, 'hello'), (1.25, False, 'test')],
dtype=[('f0', '<f4'), ('f1', '?'), ('f2', '<U10')])
That answers the question of how to specify a different dtype per 'column'. But keep in mind that this array is 1d, and has fields
not columns
.
But whether it's possible to deduce or set the dtype automatically, that's trickier. It might be possible to build a recarray
from the columns, or use one of the functions in np.lib.recfunctions
.
If I use a list 'transpose' I can format each column as a separate numpy array.
In [537]: [np.array(col) for col in zip(*L)]
Out[537]:
[array([0.5 , 1.25]),
array([ True, False]),
array(['hello', 'test'], dtype='<U5')]
Then join them into one array with rec.fromarrays
:
In [538]: np.rec.fromarrays([np.array(col) for col in zip(*L)])
Out[538]:
rec.array([(0.5 , True, 'hello'), (1.25, False, 'test')],
dtype=[('f0', '<f8'), ('f1', '?'), ('f2', '<U5')])
Or I could use genfromtxt
to deduce fields from a csv
format.
In [526]: np.savetxt('test.txt', np.array(L,object),delimiter=',',fmt='%s')
In [527]: cat test.txt
0.5,True,hello
1.25,False,test
In [529]: data = np.genfromtxt('test.txt',dtype=None,delimiter=',',encoding=None)
In [530]: data
Out[530]:
array([(0.5 , True, 'hello'), (1.25, False, 'test')],
dtype=[('f0', '<f8'), ('f1', '?'), ('f2', '<U5')])
pd.DataFrame
is the ability to have mixed data types in a single data structure. This functionality is not provided by NumPy. If you dodf.values
, the dtype you will get is one that can hold any of the values, which in this case would benp.object
, for all the values. – Mindoro