Asserting column(s) data type in Pandas
Asked Answered
M

3

30

I'm trying to find a better way to assert the column data type in Python/Pandas of a given dataframe.

For example:

import pandas as pd
t = pd.DataFrame({'a':[1,2,3], 'b':[2,6,0.75], 'c':['foo','bar','beer']})

I would like to assert that specific columns in the data frame are numeric. Here's what I have:

numeric_cols = ['a', 'b']  # These will be given
assert [x in ['int64','float'] for x in [t[y].dtype for y in numeric_cols]]

This last assert line doesn't feel very pythonic. Maybe it is and I'm just cramming it all in one hard to read line. Is there a better way? I would like to write something like:

assert t[numeric_cols].dtype.isnumeric()

I can't seem to find something like that though.

Menjivar answered 19/2, 2015 at 0:15 Comment(0)
N
52

You could use ptypes.is_numeric_dtype to identify numeric columns, ptypes.is_string_dtype to identify string-like columns, and ptypes.is_datetime64_any_dtype to identify datetime64 columns:

import pandas as pd
import pandas.api.types as ptypes

t = pd.DataFrame({'a':[1,2,3], 'b':[2,6,0.75], 'c':['foo','bar','beer'],
              'd':pd.date_range('2000-1-1', periods=3)})
cols_to_check = ['a', 'b']

assert all(ptypes.is_numeric_dtype(t[col]) for col in cols_to_check)
# True
assert ptypes.is_string_dtype(t['c'])
# True
assert ptypes.is_datetime64_any_dtype(t['d'])
# True

The pandas.api.types module (which I aliased to ptypes) has both a is_datetime64_any_dtype and a is_datetime64_dtype function. The difference is in how they treat timezone-aware array-likes:

In [239]: ptypes.is_datetime64_any_dtype(pd.DatetimeIndex([1, 2, 3], tz="US/Eastern"))
Out[239]: True

In [240]: ptypes.is_datetime64_dtype(pd.DatetimeIndex([1, 2, 3], tz="US/Eastern"))
Out[240]: False
Newel answered 19/2, 2015 at 0:47 Comment(7)
@Mr.F. You're right; thanks. But since I'm going for clarity, not winning code golf, I've changed it to ['a', 'b'].Newel
The reason I do not want to use for col in 'ab' is because it does not generalize to the case of multi-character column names. Of course, you could say the same about for col in list('ab'). (Sometimes, however, list('ab') is useful where 'ab' is not -- consider pd.DataFrame(..., index=list('ab')) for instance.) In any case, since most people coming to this page are going to have multi-character column names, we might as well write code that generalizes easily to that case.Newel
Is there anything similar to is_numeric_dtype for strings (or objects, in pandas terminology)?Coworker
@famargar: You could use ptypes.is_string_dtype. I've edited the post above to show what I mean.Newel
Thanks, this is great!Coworker
and what about datetime columns?Coworker
@famargar: You could use ptypes.is_datetime64_any_dtype. See above. (I found this by perusing dir(ptypes).)Newel
L
6

You can do this

import numpy as np
numeric_dtypes = [np.dtype('int64'), np.dtype('float64')]
# or whatever types you want

assert t[numeric_cols].apply(lambda c: c.dtype).isin(numeric_dtypes).all()
Labial answered 19/2, 2015 at 0:24 Comment(0)
T
0

Example how to simple do python's isinstance check of column's panda dtype where column is numpy datetime:

isinstance(dfe.dt_column_name.dtype, type(np.dtype('datetime64')))

note: dtype could be checked against list/tuple as 2nd argument.

If you're interested in checking column's data type consistency over rows then @ely answer using apply could be better choice

Tuscan answered 30/1, 2023 at 15:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.