Is there an algorithm to detect the data types of each column of a file or dataframe? The challenge is to suggest the data type by having wrong, missing or unnormalized data. I want to detect the data types named in the keys. My first try was to use messytables but the result is really bad without normalizing the data before. So maybe there is an algorithm to get better results for an type suggestion or a way to normalize the data without knowning the data. The result should match the keys from the dataframe.
import pandas as pd
from messytables import CSVTableSet, type_guess
data = {
"decimals": ["N.A", "", "111", "111.00", "111,12", "11,111.34"],
"dates1": ["N.A.", "", "02/17/2009", "2009/02/17", "February 17, 2009", "2014, Feb 17"],
"dates2": ["N.A.", "", "02/17/2009", "2009/02/17", "02/17/2009", "02/17/2009"],
"dates3": ["N.A.", "", "2009/02/17", "2009/02/17", "2009/02/17", "2009/02/17"],
"strings": ["N.A.", "", "N.A.", "N.A.", "test", "abc"],
"integers": ["N.A.", "", "1234", "123123", "2222", "0"],
"time": ["N.A.", "", "05:41:12", "05:40:12", "05:41:30", "06:41:12"],
"datetime": ["N.A.", "", "10/02/2021 10:39:24", "10/02/2021 10:39:24", "10/02/2021 10:39:24", "10/02/2021 10:39:24"],
"boolean": ["N.A.", "", "True", "False", "False", "False"]
}
df = pd.DataFrame(data)
towrite = io.BytesIO()
df.to_csv(towrite) # write to BytesIO buffer
towrite.seek(0)
rows = CSVTableSet(towrite).tables[0]
types = type_guess(rows.sample)
print(types) # [Integer, Integer, String, String, Date(%Y/%m/%d), String, Integer, String, Date(%d/%m/%Y %H:%M:%S), Bool]