Get datatype from values passed as string
Asked Answered
A

5

13

I am writing a framework that will connect to many different data source types and return values from these sources. The easy ones are SQL, Access and Oracle. The tougher ones are Sharepoint, CSV.

If I return values from text based sources, I would like to determine the datatype of the data.

Since a CSV is all text, there is no metadata to interrogate, I would need to parse the data somehow to determine the data type.

Example:

List of "true", "true", "false", "false" would be boolean
List of "1", "0", "1", "0" would be boolean
List of "1", "4", "-10", "500" would be integer
List of "15.2", "2015.5896", "1.0245", "500" would be double
List of "2001/01/01", "2010/05/29 12:00", "1989/12/25 10:34:21" would be datetime

It is based on https://mcmap.net/q/903196/-c-doubt-finding-the-datatype/606381#606381

object ParseString(string str)
{

   Int32 intValue;
   Int64 bigintValue;
   double doubleValue;
   bool boolValue;
   DateTime dateValue;        

// Place checks higher in if-else statement to give higher priority to type.

if (Int32.TryParse(str, out intValue))
    return intValue;
else if (Int64.TryParse(str, out bigintValue))
    return bigintValue;
else if (double.TryParse(str, out doubleValue))
    return doubleValue;       
else if (bool.TryParse(str, out boolValue))
    return boolValue;
else if (DateTime.TryParse(str, out dateValue))
    return dateValue;
else return str;

}

Edit: I only need to cater for the following:

BIT  
DATETIME  
INT  
NVARCHAR(255)  
NVARCHAR(MAX)  
BIGINT  
DECIMAL(36, 17)  

Can you see any possible improvement to the priority?

Auspicate answered 15/3, 2011 at 12:28 Comment(7)
For the record, Boolean.TryParse will fail for 1 or 0, it will convert only true or false.Libra
"Since a CSV is all text, there is no metadata to interrogate" - not necessarily true. If you have control of the CSV you could make the first row describe the datatypes of the various filed positions or if it varies by row then you could have a field at the beginning of the row or similar. It may be this isn't possible but I would have thought that for any given CSV you should have something somewhere that tells you what the fields are...Monitory
No control over the CSV. Thanks for your input though!Auspicate
@Dimi what kind of "modern" solution would you expect?Art
@Art Something is better that marked like an answer here. Something is more compact for example.Bodgie
@Dimi I see answer completely duplicates question, plus there is some GetColumnType function. You want alternative to that GetColumnType?Art
@Art I need C# types. Thank you.Bodgie
A
22

I've come up with the following solution which works:

enum dataType
    {
        System_Boolean = 0,
        System_Int32 = 1,
        System_Int64 = 2,
        System_Double = 3,
        System_DateTime = 4,
        System_String = 5
    }

    private dataType ParseString(string str)
    {

        bool boolValue;
        Int32 intValue;
        Int64 bigintValue;
        double doubleValue;
        DateTime dateValue;

        // Place checks higher in if-else statement to give higher priority to type.

        if (bool.TryParse(str, out boolValue))
            return dataType.System_Boolean;
        else if (Int32.TryParse(str, out intValue))
            return dataType.System_Int32;
        else if (Int64.TryParse(str, out bigintValue))
            return dataType.System_Int64;
        else if (double.TryParse(str, out doubleValue))
            return dataType.System_Double;
        else if (DateTime.TryParse(str, out dateValue))
            return dataType.System_DateTime;
        else return dataType.System_String;

    }


    /// <summary>
    /// Gets the datatype for the Datacolumn column
    /// </summary>
    /// <param name="column">Datacolumn to get datatype of</param>
    /// <param name="dt">DataTable to get datatype from</param>
    /// <param name="colSize">ref value to return size for string type</param>
    /// <returns></returns>
    public Type GetColumnType(DataColumn column, DataTable dt, ref int colSize)
    {

        Type T;
        DataView dv = new DataView(dt);
        //get smallest and largest values
        string colName = column.ColumnName;

        dv.RowFilter = "[" + colName + "] = MIN([" + colName + "])";
        DataTable dtRange = dv.ToTable();
        string strMinValue = dtRange.Rows[0][column.ColumnName].ToString();
        int minValueLevel = (int)ParseString(strMinValue);

        dv.RowFilter = "[" + colName + "] = MAX([" + colName + "])";
        dtRange = dv.ToTable();
        string strMaxValue = dtRange.Rows[0][column.ColumnName].ToString();
        int maxValueLevel = (int)ParseString(strMaxValue);
        colSize = strMaxValue.Length;

        //get max typelevel of first n to 50 rows
        int sampleSize = Math.Max(dt.Rows.Count, 50);
        int maxLevel = Math.Max(minValueLevel, maxValueLevel);

        for (int i = 0; i < sampleSize; i++)
        {
            maxLevel = Math.Max((int)ParseString(dt.Rows[i][column].ToString()), maxLevel);
        }

        string enumCheck = ((dataType)maxLevel).ToString();
        T = Type.GetType(enumCheck.Replace('_', '.'));

        //if typelevel = int32 check for bit only data & cast to bool
        if (maxLevel == 1 && Convert.ToInt32(strMinValue) == 0 && Convert.ToInt32(strMaxValue) == 1)
        {
            T = Type.GetType("System.Boolean");
        }

        if (maxLevel != 5) colSize = -1;


        return T;
    }
Auspicate answered 16/3, 2011 at 13:5 Comment(0)
A
14

Since Dimi put a bounty and needs more "modern" solution, I'll try to provide one. First, what do we need from reasonable class which converts strings to different stuff?

Reasonable behavior with basic types.

Respect culture info, especially when converting numbers and dates.

Ability to extend logic with custom converters if necessary.

As a bonus avoid long "if" chains since they are quite error-prone.

public class StringConverter {
    // delegate for TryParse(string, out T)
    public delegate bool TypedConvertDelegate<T>(string value, out T result);
    // delegate for TryParse(string, out object)
    private delegate bool UntypedConvertDelegate(string value, out object result);        
    private readonly List<UntypedConvertDelegate> _converters = new List<UntypedConvertDelegate>();
    // default converter, lazyly initialized
    private static readonly Lazy<StringConverter> _default = new Lazy<StringConverter>(CreateDefault, true);

    public static StringConverter Default => _default.Value;

    private static StringConverter CreateDefault() {
        var d = new StringConverter();
        // add reasonable default converters for common .NET types. Don't forget to take culture into account, that's
        // important when parsing numbers\dates.
        d.AddConverter<bool>(bool.TryParse);
        d.AddConverter((string value, out byte result) => byte.TryParse(value, NumberStyles.Integer, d.Culture, out result));
        d.AddConverter((string value, out short result) => short.TryParse(value, NumberStyles.Integer, d.Culture, out result));
        d.AddConverter((string value, out int result) => int.TryParse(value, NumberStyles.Integer, d.Culture, out result));
        d.AddConverter((string value, out long result) => long.TryParse(value, NumberStyles.Integer, d.Culture, out result));
        d.AddConverter((string value, out float result) => float.TryParse(value, NumberStyles.Number, d.Culture, out result));
        d.AddConverter((string value, out double result) => double.TryParse(value, NumberStyles.Number, d.Culture, out result));
        d.AddConverter((string value, out DateTime result) => DateTime.TryParse(value, d.Culture, DateTimeStyles.None, out result));
        return d;
    }

    //
    public CultureInfo Culture { get; set; } = CultureInfo.CurrentCulture;

    public void AddConverter<T>(Predicate<string> match, Func<string, T> converter) {
        // create converter from match predicate and convert function
        _converters.Add((string value, out object result) => {
            if (match(value)) {
                result = converter(value);
                return true;
            }
            result = null;
            return false;
        });
    }

    public void AddConverter<T>(Regex match, Func<string, T> converter) {
        // create converter from match regex and convert function
        _converters.Add((string value, out object result) => {
            if (match.IsMatch(value)) {
                result = converter(value);
                return true;
            }
            result = null;
            return false;
        });
    }

    public void AddConverter<T>(TypedConvertDelegate<T> constructor) {
        // create converter from typed TryParse(string, out T) function
        _converters.Add(FromTryPattern<T>(constructor));
    }

    public bool TryConvert(string value, out object result) {
        if (this != Default) {
            // if this is not a default converter - first try convert with default
            if (Default.TryConvert(value, out result))
                return true;
        }
        // then use local converters. Any will return after the first match
        object tmp = null;
        bool anyMatch = _converters.Any(c => c(value, out tmp));
        result = tmp;
        return anyMatch;
    }

    private static UntypedConvertDelegate FromTryPattern<T>(TypedConvertDelegate<T> inner) {
        return (string value, out object result) => {
            T tmp;
            if (inner.Invoke(value, out tmp)) {
                result = tmp;
                return true;
            }
            else {
                result = null;
                return false;
            }
        };
    }
}

Use like this:

static void Main(string[] args) {
    // set culture to invariant
    StringConverter.Default.Culture = CultureInfo.InvariantCulture;
    // add custom converter to default, it will match strings starting with CUSTOM: and return MyCustomClass
    StringConverter.Default.AddConverter(c => c.StartsWith("CUSTOM:"), c => new MyCustomClass(c));
    var items = new[] {"1", "4343434343", "3.33", "true", "false", "2014-10-10 22:00:00", "CUSTOM: something"};
    foreach (var item in items) {
        object result;
        if (StringConverter.Default.TryConvert(item, out result)) {
            Console.WriteLine(result);
        }
    }
    // create new non-default converter
    var localConverter = new StringConverter();
    // add custom converter to parse json which matches schema for MySecondCustomClass
    localConverter.AddConverter((string value, out MySecondCustomClass result) => TryParseJson(value, @"{'value': {'type': 'string'}}", out result));
    {
        object result;
        // check if that works
        if (localConverter.TryConvert("{value: \"Some value\"}", out result)) {
            Console.WriteLine(((MySecondCustomClass) result).Value);
        }
    }
    Console.ReadKey();
}

static bool TryParseJson<T>(string json, string rawSchema, out T result) where T : new() {
    // we are using Newtonsoft.Json here
    var parsedSchema = JsonSchema.Parse(rawSchema);
    JObject jObject = JObject.Parse(json);
    if (jObject.IsValid(parsedSchema)) {
        result = JsonConvert.DeserializeObject<T>(json);
        return true;
    }
    else {
        result = default(T);
        return false;
    }
}

class MyCustomClass {
    public MyCustomClass(string value) {
        this.Value = value;
    }

    public string Value { get; private set; }
}

public class MySecondCustomClass {
    public string Value { get; set; }
}
Art answered 5/5, 2016 at 9:7 Comment(0)
A
3
    List<Type> types = new List<Type>(new Type[] {
        typeof(Boolean)
        , typeof(int)
        , typeof(double)
        , typeof(DateTime)
    });
    string t = "true";
    object retu;
    foreach (Type type in types)
    {
        TypeConverter tc = TypeDescriptor.GetConverter(type);
        if (tc != null)
        {
            try
            {
                object obj = tc.ConvertFromString(t); // your return value;
            }
            catch (Exception)
            {
                continue;
            }
        }
    }
Antibody answered 16/3, 2011 at 13:50 Comment(3)
Thanks for your input. However, using a try-catch to control the flow of your code is bad practise. Using exceptions as control flow like this makes it difficult to debug a program. Debuggers are often configured to stop on any exception, whether it is handled or not. Lots of "expected" exceptions make that harder. Exceptions should never be expected, they should be exceptional; that's why they're called "exceptions". The catch catches everything, including stuff that probably indicates a fatal error that should be reported to the user. (Credit to Eric Lippert)Auspicate
I'm agree with this comment for common cases. But in this case no, because TypeDescriptor doesn't allow to "try" convert string, and return any flag of the convert result, just throws exception. This solution works. BTW in huge projects debuggers must be configured to catch not handled exception, not all thrown. Exception - in many cases it is the mark that something wrong in try block, but it never says that it must be fatal error. Example, my code - exception throws only when string not fits for a type convert to.Antibody
It is your decision to use my solution or not. You had need better solution, neither yours, I've gave it.Antibody
T
2

Starting with the narrowest types and working towards the widest may not be the best approach. If I knew anything about the data, I'd start with the most frequently occurring type, and work toward the least. If if did not know that, I'd might or might not do some research to to get an idea of what that might be statistically, if possible. Else I'd just make my best guess. Why test for bit or datetime early if you only expect them to occur once every 10,000 records?

Terraqueous answered 15/3, 2011 at 13:47 Comment(1)
Valid points, but since I'll be testing up to a maximum of 52 rows ( row 1 to 50 + shortest + longest) per column for every import, the performance impact is negligible.Auspicate
P
1

Would it be easier to store it in a generic datatype with .ToInt16(), .ToInt32(), .ToBool(), etc.? If you write an app expecting int and it gets boolean it will fail, so it would be better to let the programmer explicit convert to the expected datatype.

The problem with your approach is that you don't know if a row containing 0 as the first item will contain -100000 as item number 100. This means you can't do a successfull conversion until all rows has been TryParsed by all the different datatypes. Very expensive operation!

If anything I'd use precompiled regular expressions and/or custom logic to process the data. For instance iterating all rows to find highest/lowest number, occurence of string, etc.

Pathfinder answered 15/3, 2011 at 12:40 Comment(1)
I have no control over the storage type: This is data import functionality. During runtime clients will import data from many possible sources, and the types must match between source and target. I have to provide datatype information with the Source data before a target is selected to make selection of target fields based on data types. Your idea of finding highest/lowest seems a step in the right direction: Thanks!Auspicate

© 2022 - 2024 — McMap. All rights reserved.