I put together this function to help with the type inference of lists.
def infer_dtypes(values:List, sample_size:int=300, stop_after:int=300):
"""
Infers the data type by randomly sampling from a list. Values are explicitly converted to string before checking.
Args:
values (list): A list to infer data types from.
sample_size (int, optional): The number of values to sample from the list. Entire list will be sampled if set to None. Defaults to 300.
stop_after (int, optional): The maximum number of non-empty values needed for the test. Equal to sample_size if set to None. Defaults to 300.
Returns:
str: The inferred data type ('int', 'float', 'bool', 'str', 'mixed', 'empty').
"""
found = 0
non_empty_count = 0
sample_size = sample_size if sample_size is not None else len(values)
stop_after = stop_after if stop_after is not None else sample_size
for v in np.random.choice(values, sample_size):
v = str(v)
if v != '':
non_empty_count += 1
if non_empty_count > stop_after:
break
try:
int(v)
found |= 1
except ValueError:
try:
float(v)
found |= 2
except ValueError:
if v.lower() in ['true', 'false']:
found |= 4
else:
found |= 8
# Check if the data is mixed
if bin(found).count('1') > 1:
return 'mixed'
if found & 8:
return 'str'
elif found & 4:
return 'bool'
elif found & 2:
return 'float'
elif found & 1:
return 'int'
else:
return 'empty'
Produces:
infer_dtypes(['', '', '1', '2', '3', '4', '5']) # int
infer_dtypes(['', '', '1.0', '2.0', '', '3.0', '4.4', '5.0']) # float
infer_dtypes(['', '', 'True', 'False', '', '', 'False', 'True']) # bool
infer_dtypes(['', '', 'never', 'gonna', '', '', 'give', '']) # str
infer_dtypes(['', '', 'never', '', '5', 'True', '5.2', '']) # mixed
infer_dtypes(['', '', '', '', '', '', '', '']) # empty
Rationale, feel free to skip this:
I wrote this function as currently Pandas' df.convert_dtypes, df.infer_objects and pd.to_numeric don't work nicely if you have columns with empty strings. This could be solved (source 1, source 2) if a DataFrame has columns of uniform datatypes, for example if we know that it only has floats we could replace ''
with np.nan
and then infer. However for a DataFrame with mixed column types (strings, floats, ints), replacing ''
with np.nan
wouldn't work. This function helps solve this issue by running:
values = np.where(pd.isnull(df.T.values), '', df.T.values)
for l in values:
infer_dtypes(l)
See this GitHub Gist for a full example. Hope it helps!