Interpreting Strings as Other Data Types in Python
Asked Answered
C

9

7

I'm reading a file into python 2.4 that's structured like this:

field1: 7
field2: "Hello, world!"
field3: 6.2

The idea is to parse it into a dictionary that takes fieldfoo as the key and whatever comes after the colon as the value.

I want to convert whatever is after the colon to it's "actual" data type, that is, '7' should be converted to an int, "Hello, world!" to a string, etc. The only data types that need to be parsed are ints, floats and strings. Is there a function in the python standard library that would allow one to make this conversion easily?

The only things this should be used to parse were written by me, so (at least in this case) safety is not an issue.

Coagulase answered 31/1, 2012 at 0:30 Comment(0)
F
1

For older python versions, like the one being asked, the eval function can be used but, to reduce evilness, a dict to be the global namespace should be used as second argument to avoid function calls.

>>> [eval(i, {"__builtins__":None}) for i in ['6.2', '"Hello, world!"', '7']]
[6.2, 'Hello, world!', 7]
Firearm answered 31/1, 2012 at 0:50 Comment(1)
it raise "SyntaxError: unexpected EOF while parsing" when applying "alphanumeric" values instead to interpret a string.Dowsabel
S
6

First parse your input into a list of pairs like fieldN: some_string. You can do this easily with re module, or probably even simpler with slicing left and right of the index line.strip().find(': '). Then use a literal eval on the value some_string:

>>> import ast
>>> ast.literal_eval('6.2')
6.2
>>> type(_)
<type 'float'>
>>> ast.literal_eval('"Hello, world!"')
'Hello, world!'
>>> type(_)
<type 'str'>
>>> ast.literal_eval('7')
7
>>> type(_)
<type 'int'>
Sartorius answered 31/1, 2012 at 0:35 Comment(7)
The version of python I'm using doesn't have the ast module.Coagulase
@MikeSamuel obviously the input must be preprocessed into fieldn: string pairs first, but that part is trivial. @julio.alegria _ is a handy shortcut for the last returned value in the interactive interpreter. @Coagulase ..erm.. now you tell me ;) upgrade python? is there a reason why you need to use such an old version?Sartorius
@Mike Samuel: Safety isn't an issue for me. I don't need to parse anything that I haven't written myself with another program. +1 on your comment for pointing it out, though.Coagulase
mail.python.org/pipermail/python-list/2009-September/… here someone backported literal_eval to 2.4, but it all sounds a bit hacky to me. i would prefer to upgrade python than use that, personally.Sartorius
@wim: I figured out I could just use eval(). See answer below, and thanks for pointing me in the right direction.Coagulase
I know it's hacky and evil, but I don't see any other easy way of doing it.Coagulase
@juliomalegria are you seriously that lost?Jute
B
4

You can use yaml to parse the literals which is better than ast in that it does not throw you an error if strings are not wrapped around extra pairs of apostrophes or quotation marks.

>>> import yaml
>>> yaml.safe_load('7')
7
>>> yaml.safe_load('Hello')
'Hello'
>>> yaml.safe_load('7.5')
7.5
Bunker answered 18/2, 2022 at 3:11 Comment(0)
L
2

You can attempt to convert it to an int first using the built-in function int(). If the string cannot be interpreted as an int a ValueError exception is raised. You can then attempt to convert to a float using float(). If this fails also then just return the initial string

def interpret(val):
    try:
        return int(val)
    except ValueError:
        try:
            return float(val)
        except ValueError:
            return val
Lorica answered 31/1, 2012 at 0:56 Comment(0)
F
1

For older python versions, like the one being asked, the eval function can be used but, to reduce evilness, a dict to be the global namespace should be used as second argument to avoid function calls.

>>> [eval(i, {"__builtins__":None}) for i in ['6.2', '"Hello, world!"', '7']]
[6.2, 'Hello, world!', 7]
Firearm answered 31/1, 2012 at 0:50 Comment(1)
it raise "SyntaxError: unexpected EOF while parsing" when applying "alphanumeric" values instead to interpret a string.Dowsabel
A
1

Since the "only data types that need to be parsed are int, float and str", maybe somthing like this will work for you:

entries = {'field1': '7', 'field2': "Hello, world!", 'field3': '6.2'}

for k,v in entries.items():
    if v.isdecimal():
        conv = int(v)
    else:
        try:
            conv = float(v)
        except ValueError:
            conv = v
    entries[k] = conv

print(entries)
# {'field2': 'Hello, world!', 'field3': 6.2, 'field1': 7}
Absent answered 31/1, 2012 at 0:57 Comment(0)
S
1

There is strconv lib.

In [22]: import strconv
/home/tworec/.local/lib/python2.7/site-packages/strconv.py:200: UserWarning: python-dateutil is not installed. As of version 0.5, this will be a hard dependency of strconv fordatetime parsing. Without it, only a limited set of datetime formats are supported without timezones.
  warnings.warn('python-dateutil is not installed. As of version 0.5, '

In [23]: strconv.convert('1.2')
Out[23]: 1.2

In [24]: type(strconv.convert('1.2'))
Out[24]: float

In [25]: type(strconv.convert('12'))
Out[25]: int

In [26]: type(strconv.convert('true'))
Out[26]: bool

In [27]: type(strconv.convert('tRue'))
Out[27]: bool

In [28]: type(strconv.convert('12 Jan'))
Out[28]: str

In [29]: type(strconv.convert('12 Jan 2018'))
Out[29]: str

In [30]: type(strconv.convert('2018-01-01'))
Out[30]: datetime.date
Sennet answered 9/5, 2018 at 17:56 Comment(1)
Actually, it does not handle unicode strings, see github.com/bruth/strconv/issues/2Unbeaten
B
0

Hope this helps to do what you are trying to do:

#!/usr/bin/python

a = {'field1': 7}
b = {'field2': "Hello, world!"}
c = {'field3': 6.2}

temp1 = type(a['field1'])
temp2 = type(b['field2'])
temp3 = type(c['field3'])

print temp1
print temp2
print temp3
Brack answered 31/1, 2012 at 0:40 Comment(2)
I don't want to get the types of objects in a dictionary, I want to convert strings in a dictionary that are annotated as python types to the types they represent.Coagulase
Can you post example input and output, that will easier to understand?Brack
C
0

Thanks to wim for helping me figure out what I needed to search for to figure this out.

One can just use eval():

>>> a=eval("7")
>>> b=eval("3")
>>> a+b
10
>>> b=eval("7.2")
>>> a=eval("3.5")
>>> a+b
10.699999999999999
>>> a=eval('"Hello, "')
>>> b=eval('"world!"')
>>> a+b
'Hello, world!'
Coagulase answered 31/1, 2012 at 0:57 Comment(3)
Great! Now make sure you don't import os in your source, to avoid evaluating values like os.system("rm *"). And that's not the only way. So this method works, but it's not recommended.Wald
It's evil and insecure, but this entire script is a quick and dirty fix that should (ideally) be thrown away in a few months.Coagulase
I had a Q&D awk script that I wrote in 1989 implementing a very crude commercial order processor “until the app we wait is ready” that was still being used up to 1996 that I know of, and a Q&D 1995 QBasic army service chores assigner (whatever you might understand of it :) that was still used in 2007 (albeit modified by others to no end, I presume), so I'm certain “quick&dirty” programs are as quick but lots more dirtier than people usually think they are.Wald
C
0

I put together this function to help with the type inference of lists.

def infer_dtypes(values:List, sample_size:int=300, stop_after:int=300):
    """
    Infers the data type by randomly sampling from a list. Values are explicitly converted to string before checking.

    Args:
        values (list): A list to infer data types from.
        sample_size (int, optional): The number of values to sample from the list. Entire list will be sampled if set to None. Defaults to 300.
        stop_after (int, optional): The maximum number of non-empty values needed for the test. Equal to sample_size if set to None. Defaults to 300.

    Returns:
        str: The inferred data type ('int', 'float', 'bool', 'str', 'mixed', 'empty').
    """
    found = 0
    non_empty_count = 0

    sample_size = sample_size if sample_size is not None else len(values)
    stop_after = stop_after if stop_after is not None else sample_size

    for v in np.random.choice(values, sample_size):
        v = str(v)
        if v != '':
            non_empty_count += 1
            if non_empty_count > stop_after:
                break
            try:
                int(v)
                found |= 1
            except ValueError:
                try:
                    float(v)
                    found |= 2
                except ValueError:
                    if v.lower() in ['true', 'false']:
                        found |= 4
                    else:
                        found |= 8


    # Check if the data is mixed
    if bin(found).count('1') > 1:
        return 'mixed'

    if found & 8:
        return 'str'
    elif found & 4:
        return 'bool'
    elif found & 2:
        return 'float'
    elif found & 1:
        return 'int'
    else:
        return 'empty'

Produces:

infer_dtypes(['', '', '1', '2', '3', '4', '5'])  # int
infer_dtypes(['', '', '1.0', '2.0', '', '3.0', '4.4', '5.0'])  # float
infer_dtypes(['', '', 'True', 'False', '', '', 'False', 'True'])  # bool
infer_dtypes(['', '', 'never', 'gonna', '', '', 'give', ''])  # str
infer_dtypes(['', '', 'never', '', '5', 'True', '5.2', ''])  # mixed
infer_dtypes(['', '', '', '', '', '', '', ''])  # empty

Rationale, feel free to skip this:

I wrote this function as currently Pandas' df.convert_dtypes, df.infer_objects and pd.to_numeric don't work nicely if you have columns with empty strings. This could be solved (source 1, source 2) if a DataFrame has columns of uniform datatypes, for example if we know that it only has floats we could replace '' with np.nan and then infer. However for a DataFrame with mixed column types (strings, floats, ints), replacing '' with np.nan wouldn't work. This function helps solve this issue by running:

values = np.where(pd.isnull(df.T.values), '', df.T.values)
for l in values:
    infer_dtypes(l)

See this GitHub Gist for a full example. Hope it helps!

Cherry answered 4/6 at 7:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.