Data Type Recognition/Guessing of CSV data in python

Asked 26/7, 2011 at 3:17 Answered 7/8, 2018 at 18:4

Solved python algorithm csv schema heuristics

My problem is in the context of processing data from large CSV files.

I'm looking for the most efficient way to determine (that is, guess) the data type of a column based on the values found in that column. I'm potentially dealing with very messy data. Therefore, the algorithm should be error-tolerant to some extent.

Here's an example:

arr1 = ['0.83', '-0.26', '-', '0.23', '11.23']               # ==> recognize as float
arr2 = ['1', '11', '-1345.67', '0', '22']                    # ==> regognize as int
arr3 = ['2/7/1985', 'Jul 03 1985, 00:00:00', '', '4/3/2011'] # ==> recognize as date
arr4 = ['Dog', 'Cat', '0.13', 'Mouse']                       # ==> recognize as str

Bottom line: I'm looking for a python package or an algorithm that can detect either

the schema of a CSV file, or even better
the data type of an individual column as an array

Method for guessing type of data represented currently represented as strings goes in a similar direction. I'm worried about performance, though, since I'm possibly dealing with many large spreadsheets (where the data stems from)

Plash answered 26/7, 2011 at 3:17 Comment(0)

You may be interested in this python library which does exactly this kind of type guessing on CSVs and XLS files for you:

It happily scales to very large files, to streaming data off the internet etc.

There is also an even simpler wrapper library that includes a command line tool named dataconverters: http://okfnlabs.org/dataconverters/ (and an online service: https://github.com/okfn/dataproxy!)

The core algorithm that does the type guessing is here: https://github.com/okfn/messytables/blob/7e4f12abef257a4d70a8020e0d024df6fbb02976/messytables/types.py#L164

Dykes answered 3/8, 2013 at 20:15 Comment(0)

After putting some thought into it, this is how I would design the algorithm myself:

For performance reasons: take a sample for each column (say, 1%)
run a regex match for each cell in the sample, checking for the data type
Choose the appropriate data type for the column based on the frequency distribution

The two questions that arise:

What's a sufficient sample size? For small data sets? For large data sets?
What's a high enough threshold for selecting a data type based on the frequency distribution?

Plash answered 26/7, 2011 at 18:16 Comment(0)

Maybe csvsql could be useful here? No idea how efficient it is but definitely gets the job done for generating sql create table statements out of csvs.

$ csvsql so_many_columns.csv  >> sql_create_table_with_char_types.txt

Benildas answered 7/8, 2018 at 18:4 Comment(0)

You could try a pre parse using regex. For example:

import re
pattern = re.compile(r'^-?\d+.{1}\d+$')
data = '123.42'
print pattern.match(data) # ----> object
data2 = 'NOT123.42GONNA31.4HAPPEN'
print pattern.match(data2) # ----> None

This way you could do a dictionary of regex and try each of them until you find a match

myregex = {int: r'^-?\d+$', float: r'^\d+.{1}\d+$', ....}

for key, reg in myregex.items():
    to_del = []
    for index, data in enumerate(arr1):
        if re.match(reg,data):
            d = key(data) # You will need to insert data differently depending on function
            ....#---> do something 
            to_del.append(data) # ---> delete this when you can from arr1

Don't forget the '^' at the beggining and the '$' at the end, if not the regex could match part of the string and return an object.

Hope this helps :)

Bundle answered 26/7, 2011 at 4:50 Comment(3)

Remember that re.match matches at the start of the string, so the ^ is not necessary. (re.search matches anywhere.) – Trilemma 26/7, 2011 at 5:21

I considered regex expressions myself. If I could bank on the fact that the data was clean, this would probably be the way to go. Although I could still see a performance issue. – Plash 26/7, 2011 at 18:9

the idea is to take what is left in arr1 and log it to figure out where that data should fit or if you consider it as garbage – Bundle 26/7, 2011 at 18:26

I solved the same problem in c#. This is how I built the sample set:
For every column in the CSV, I selected the row with the longest value, as well as the row with the shortest value.
I then built an array with the 1st 50 non-empty rows.
So my samples had at least 0 and at most 50 rows which covered the whole range in a column.
After that, I would try to parse from widest definition to narrowest:

if (value is String) then thisType = String;

if (value is DateTime) then thisType is DateTime;

if (value is Decimal) then thisType is Decimal;

if (value is Integer) then thisType is Integer;

if (value is Boolean) then thisType is Boolean;

I use TryParse in C#, but I am sure other languages would have similar methods to use.

Anoint answered 8/2, 2012 at 15:17 Comment(0)

Recommended topics

Hot tags