Guessing data type for dataframe

Is there an algorithm to detect the data types of each column of a file or dataframe? The challenge is to suggest the data type by having wrong, missing or unnormalized data. I want to detect the data types named in the keys. My first try was to use messytables but the result is really bad without normalizing the data before. So maybe there is an algorithm to get better results for an type suggestion or a way to normalize the data without knowning the data. The result should match the keys from the dataframe.

import pandas as pd
from messytables import CSVTableSet, type_guess

data = {
  "decimals": ["N.A", "", "111", "111.00", "111,12", "11,111.34"],
  "dates1": ["N.A.", "", "02/17/2009", "2009/02/17", "February 17, 2009", "2014, Feb 17"],
  "dates2": ["N.A.", "", "02/17/2009", "2009/02/17", "02/17/2009", "02/17/2009"],
  "dates3": ["N.A.", "", "2009/02/17", "2009/02/17", "2009/02/17", "2009/02/17"],
  "strings": ["N.A.", "", "N.A.", "N.A.", "test", "abc"],
  "integers": ["N.A.", "", "1234", "123123", "2222", "0"],
  "time": ["N.A.", "", "05:41:12", "05:40:12", "05:41:30", "06:41:12"],
  "datetime": ["N.A.", "", "10/02/2021 10:39:24", "10/02/2021 10:39:24", "10/02/2021 10:39:24", "10/02/2021 10:39:24"],
  "boolean": ["N.A.", "", "True", "False", "False", "False"]
}
df = pd.DataFrame(data)

towrite = io.BytesIO()
df.to_csv(towrite)  # write to BytesIO buffer
towrite.seek(0)

rows = CSVTableSet(towrite).tables[0]
types = type_guess(rows.sample)
print(types) # [Integer, Integer, String, String, Date(%Y/%m/%d), String, Integer, String, Date(%d/%m/%Y %H:%M:%S), Bool]

# For each main type, define a lambda helper function which returns the number of values in the given column of said type helpers = { "float": lambda df, col: df[col] .apply(lambda x: x.replace(".", "").isdigit() and "." in x) .sum(), "integer": lambda df, col: df[col].apply(lambda x: x.isdigit()).sum(), "datetime": lambda df, col: pd.to_datetime( df[col], errors="coerce", infer_datetime_format=True ) .notna() .sum(), "bool": lambda df, col: df[col].apply(lambda x: x == "True" or x == "False").sum(), } # Iterate on each column of the dataframe and get the type with maximum number of values df_dtypes = {} for col in df.columns: results = {key: helper(df, col) for key, helper in helpers.items()} best_result = max(results, key=results.get) df_dtypes[col] = best_result if max(results.values()) else "string"

print(df_dtypes) # Output { "decimals": "float", "dates1": "datetime", "dates2": "datetime", "dates3": "datetime", "strings": "string", "integers": "integer", "time": "datetime", "datetime": "datetime", "boolean": "bool", }

Recommended topics

Hot tags