Python tokenize sentence with optional key/val pairs

Asked 22/7, 2013 at 18:50 Answered 15/6, 2021 at 16:39

Solved python regex tokenize text-parsing

I'm trying to parse a sentence (or line of text) where you have a sentence and optionally followed some key/val pairs on the same line. Not only are the key/value pairs optional, they are dynamic. I'm looking for a result to be something like:

Input:

"There was a cow at home. home=mary cowname=betsy date=10-jan-2013"

Output:

Values = {'theSentence' : "There was a cow at home.",
          'home' : "mary",
          'cowname' : "betsy",
          'date'= "10-jan-2013"
         }

Input:

"Mike ordered a large hamburger. lastname=Smith store=burgerville"

Output:

Values = {'theSentence' : "Mike ordered a large hamburger.",
          'lastname' : "Smith",
          'store' : "burgerville"
         }

Input:

"Sam is nice."

Output:

Values = {'theSentence' : "Sam is nice."}

Thanks for any input/direction. I know the sentences appear that this is a homework problem, but I'm just a python newbie. I know it's probably a regex solution, but I'm not the best regarding regex.

Friedcake answered 22/7, 2013 at 18:50 Comment(4)

Is the sentence guaranteed to end on a .? – Industrialist 22/7, 2013 at 18:51

Can you assume that = will not appear in the sentence itself? – Stakhanovism 22/7, 2013 at 18:53

split(), split(), split(). – Determinative 22/7, 2013 at 18:53

is there a compelling reason the variables follow one form and the sentence does not? ie "thesentence=some sentence you want to see". Ideally you'd have some delimiter here. – Cowry 22/7, 2013 at 18:53

I'd use re.sub:

import re

s = "There was a cow at home. home=mary cowname=betsy date=10-jan-2013"

d = {}

def add(m):
    d[m.group(1)] = m.group(2)

s = re.sub(r'(\w+)=(\S+)', add, s)
d['theSentence'] = s.strip()

print d

Here's more compact version if you prefer:

d = {}
d['theSentence'] = re.sub(r'(\w+)=(\S+)',
    lambda m: d.setdefault(m.group(1), m.group(2)) and '',
    s).strip()

Or, maybe, findall is a better option:

rx = '(\w+)=(\S+)|(\S.+?)(?=\w+=|$)'
d = {
    a or 'theSentence': (b or c).strip()
    for a, b, c in re.findall(rx, s)
}
print d

Faison answered 22/7, 2013 at 19:4 Comment(3)

cmooon, make it a one-liner. You know you want to – Hoover 22/7, 2013 at 19:7

@SlaterTyranus The Zen says: Sparse is better than dense. – Fortis 22/7, 2013 at 19:11

Thanks for the quick response! It's very much appreciated. This solution works both with and without periods, so it's great. – Friedcake 22/7, 2013 at 20:5

The first step is to do

inputStr = "There was a cow at home. home=mary cowname=betsy date=10-jan-2013"
theSentence, others = str.split('.')

You're going to then want to break up "others". Play around with split() (the argument you pass in tells Python what to split the string on), and see what you can do. :)

Silverweed answered 22/7, 2013 at 18:53 Comment(3)

Don't name variables str, native datatype!! – Fortis 22/7, 2013 at 19:1

@ManuelGutierrez thanks! Wow that's a bad habit I accidentally developed, always assumed it was string and so str was safe... – Silverweed 22/7, 2013 at 19:2

This doesn't... answer the question at all. Why does this have more upvotes than answers that are actually answers? – Hoover 22/7, 2013 at 19:8

If your sentence is guaranteed to end on ., then, you could follow the following approach.

>>> testList = inputString.split('.')
>>> Values['theSentence'] = testList[0]+'.'

For the rest of the values, just do.

>>> for elem in testList[1].split():
        key, val = elem.split('=')
        Values[key] = val

Giving you a Values like so

>>> Values
{'date': '10-jan-2013', 'home': 'mary', 'cowname': 'betsy', 'theSentence': 'There was a cow at home.'}
>>> Values2
{'lastname': 'Smith', 'theSentence': 'Mike ordered a large hamburger.', 'store': 'burgerville'}
>>> Values3
{'theSentence': 'Sam is nice.'}

Industrialist answered 22/7, 2013 at 18:58 Comment(0)

Assuming there could be only 1 dot, that divides the sentence and assignment pairs:

input = "There was a cow at home. home=mary cowname=betsy date=10-jan-2013"
sentence, assignments = input.split(". ")

result = {'theSentence': sentence + "."}
for item in assignments.split():
    key, value = item.split("=")
    result[key] = value

print result

prints:

{'date': '10-jan-2013', 
 'home': 'mary', 
 'cowname': 'betsy', 
 'theSentence': 'There was a cow at home.'}

Bauske answered 22/7, 2013 at 18:58 Comment(2)

+1 We think identical on this one, I'm not even posting mine. BTW why the if item: check? Looks like the for will do. – Fortis 22/7, 2013 at 19:5

Thank you, I've removed if item check and switched to splitting by . instead of just dot. – Bauske 22/7, 2013 at 19:7

Assuming = doesn't appear in the sentence itself. This seems to be more valid than assuming the sentence ends with a ..

s = "There was a cow at home. home=mary cowname=betsy date=10-jan-2013"

eq_loc = s.find('=')
if eq_loc > -1:
    meta_loc = s[:eq_loc].rfind(' ')
    s = s[:meta_loc]
    metastr = s[meta_loc + 1:]

    metadict = dict(m.split('=') for m in metastr.split())
else:
    metadict = {}

metadict["theSentence"] = s

Stakhanovism answered 22/7, 2013 at 19:0 Comment(0)

So as usual, there's a bunch of ways to do this. Here's a regexp based approach that looks for key=value pairs:

import re

sentence = "..."

values = {}
for match in re.finditer("(\w+)=(\S+)", sentence):
    if not values:
        # everything left to the first key/value pair is the sentence                                                                               
        values["theSentence"] = sentence[:match.start()].strip()
    else:
        key, value = match.groups()
        values[key] = value
if not values:
    # no key/value pairs, keep the entire sentence
    values["theSentence"] = sentence

This assumes that the key is a Python-style identifiers, and that the value consists of one or more non-whitespace characters.

Welch answered 22/7, 2013 at 19:1 Comment(0)

Supposing that the first period separates the sentence from the values, you can use something like this:

#! /usr/bin/python3

a = "There was a cow at home. home=mary cowname=betsy date=10-jan-2013"

values = (lambda s, tail: (lambda d, kv: (d, d.update (kv) ) ) ( {'theSentence': s}, {k: v for k, v in (x.split ('=') for x in tail.strip ().split (' ') ) } ) ) (*a.split ('.', 1) ) [0]

print (values)

Privateer answered 22/7, 2013 at 19:4 Comment(2)

lambdas are slow and a bit overkill for this methinks. – Hoover 22/7, 2013 at 19:12

There have been various discussion on stackoverflow comparing lambda expressions with named functions. IIRC, once compiled there is no way to tell them apart, but I am not sure though. But my point was more to show the multi-paradigm character of python. Use it procedural (like the other answers here), functional (like mine), object-oriented, whatever suits you best according to your personal preferences. – Privateer 22/7, 2013 at 20:13

Nobody posted a comprehensible one-liner. The question is answered, but gotta do it in one line, it's the Python way!

{"theSentence": sentence.split(".")[0]}.update({item.split("=")[0]: item.split("=")[1] for item in sentence.split(".")[1].split()})

Eh, not super elegant, but it's totally in one line. No imports even.

Hoover answered 22/7, 2013 at 19:12 Comment(2)

In my opinion, that's the exact opposite of the Python way. – Dermatome 22/7, 2013 at 19:17

If I wanted to get headaches while coding, I'd use Perl. :P – Dermatome 22/7, 2013 at 19:22

use the regular expression findall. the first capture group is the sentence. | is the or condition for the second capture group: one or more spaces, one or more characters, the equal sign, and one or more non space characters.

s = "There was a cow at home. home=mary cowname=betsy date=10-jan-2013"
all_matches = re.findall(r'([\w+\s]+\.{1})|((\s+\w+)=(\S+))',s)
d={}
for i in np.arange(len(all_matches)):
   #print(all_matches[i])
   if all_matches[i][0] != "":
       d["theSentence"]=all_matches[i][0]
   else:
       d[all_matches[i][2]]=all_matches[i][3]
   
print(d)

output:

  {'theSentence': 'There was a cow at home.', ' home': 'mary', ' cowname': 'betsy', ' date': '10-jan-2013'}

Anemography answered 15/6, 2021 at 16:39 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags