Guessing data type for dataframe
Asked Answered
W

1

7

Is there an algorithm to detect the data types of each column of a file or dataframe? The challenge is to suggest the data type by having wrong, missing or unnormalized data. I want to detect the data types named in the keys. My first try was to use messytables but the result is really bad without normalizing the data before. So maybe there is an algorithm to get better results for an type suggestion or a way to normalize the data without knowning the data. The result should match the keys from the dataframe.

import pandas as pd
from messytables import CSVTableSet, type_guess

data = {
  "decimals": ["N.A", "", "111", "111.00", "111,12", "11,111.34"],
  "dates1": ["N.A.", "", "02/17/2009", "2009/02/17", "February 17, 2009", "2014, Feb 17"],
  "dates2": ["N.A.", "", "02/17/2009", "2009/02/17", "02/17/2009", "02/17/2009"],
  "dates3": ["N.A.", "", "2009/02/17", "2009/02/17", "2009/02/17", "2009/02/17"],
  "strings": ["N.A.", "", "N.A.", "N.A.", "test", "abc"],
  "integers": ["N.A.", "", "1234", "123123", "2222", "0"],
  "time": ["N.A.", "", "05:41:12", "05:40:12", "05:41:30", "06:41:12"],
  "datetime": ["N.A.", "", "10/02/2021 10:39:24", "10/02/2021 10:39:24", "10/02/2021 10:39:24", "10/02/2021 10:39:24"],
  "boolean": ["N.A.", "", "True", "False", "False", "False"]
}
df = pd.DataFrame(data)

towrite = io.BytesIO()
df.to_csv(towrite)  # write to BytesIO buffer
towrite.seek(0)

rows = CSVTableSet(towrite).tables[0]
types = type_guess(rows.sample)
print(types) # [Integer, Integer, String, String, Date(%Y/%m/%d), String, Integer, String, Date(%d/%m/%Y %H:%M:%S), Bool]
Wreckful answered 14/7, 2022 at 8:58 Comment(0)
P
0

Here is my take on your interesting question.

With the dataframe you provided, here is one way to do it:

# For each main type, define a lambda helper function which returns the number of values in the given column of said type
helpers = {
    "float": lambda df, col: df[col]
    .apply(lambda x: x.replace(".", "").isdigit() and "." in x)
    .sum(),
    "integer": lambda df, col: df[col].apply(lambda x: x.isdigit()).sum(),
    "datetime": lambda df, col: pd.to_datetime(
        df[col], errors="coerce", infer_datetime_format=True
    )
    .notna()
    .sum(),
    "bool": lambda df, col: df[col].apply(lambda x: x == "True" or x == "False").sum(),
}

# Iterate on each column of the dataframe and get the type with maximum number of values
df_dtypes = {}
for col in df.columns:
    results = {key: helper(df, col) for key, helper in helpers.items()}
    best_result = max(results, key=results.get)
    df_dtypes[col] = best_result if max(results.values()) else "string"
print(df_dtypes)
# Output
{
    "decimals": "float",
    "dates1": "datetime",
    "dates2": "datetime",
    "dates3": "datetime",
    "strings": "string",
    "integers": "integer",
    "time": "datetime",
    "datetime": "datetime",
    "boolean": "bool",
}
Posticous answered 16/7, 2022 at 17:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.