Retrieve definition for parenthesized abbreviation, based on letter count

Asked 2/6, 2019 at 2:45 Answered 2/6, 2019 at 10:42

Solved python regex text text-parsing abbreviation

I need to retrieve the definition of an acronym based on the number of letters enclosed in parentheses. For the data I'm dealing with, the number of letters in parentheses corresponds to the number of words to retrieve. I know this isn't a reliable method for getting abbreviations, but in my case it will be. For example:

String = 'Although family health history (FHH) is commonly accepted as an important risk factor for common, chronic diseases, it is rarely considered by a nurse practitioner (NP).'

Desired output: family health history (FHH), nurse practitioner (NP)

I know how to extract parentheses from a string, but after that I am stuck. Any help is appreciated.

 import re

 a = 'Although family health history (FHH) is commonly accepted as an 
 important risk factor for common, chronic diseases, it is rarely considered 
 by a nurse practitioner (NP).'

 x2 = re.findall('(\(.*?\))', a)

 for x in x2:
    length = len(x)
    print(x, length)

Natal answered 2/6, 2019 at 2:45 Comment(3)

I think you will need to write some parsing logic here, in addition to maybe using regex. – Bunton 2/6, 2019 at 2:55

I know I can run a loop and do a Len(string) to get the number of letters, but I guess it's after that point I'm lost. Like if it's 3 letters, how to capture the previous 3 words. – Natal 2/6, 2019 at 2:59

You should use """ instead of ' for multiline string – Cloak 2/6, 2019 at 3:0

Use the regex match to find the position of the start of the match. Then use python string indexing to get the substring leading up to the start of the match. Split the substring by words, and get the last n words. Where n is the length of the abbreviation.

import re
s = 'Although family health history (FHH) is commonly accepted as an important risk factor for common, chronic diseases, it is rarely considered by a nurse practitioner (NP).'


for match in re.finditer(r"\((.*?)\)", s):
    start_index = match.start()
    abbr = match.group(1)
    size = len(abbr)
    words = s[:start_index].split()[-size:]
    definition = " ".join(words)

    print(abbr, definition)

This prints:

FHH family health history
NP nurse practitioner

Cloak answered 2/6, 2019 at 3:7 Comment(3)

Man, what a life saver. That makes sense. Thanks so much . – Natal 2/6, 2019 at 3:9

You can add output = "" to the top of the code, and output += definition + ", (" + abbr + ")" to the end of the loop to get your desired output. – Chesson 2/6, 2019 at 3:12

I would suggest to match only capital letters: re.finditer(r"\(([A-Z]*?)\)", s) – Ovenware 2/6, 2019 at 13:24

does this solve your problem?

a = 'Although family health history (FHH) is commonly accepted as an important risk factor for common, chronic diseases, it is rarely considered by a nurse practitioner (NP).'
splitstr=a.replace('.','').split(' ')
output=''
for i,word in enumerate(splitstr):
    if '(' in word:
        w=word.replace('(','').replace(')','').replace('.','')
        for n in range(len(w)+1):
            output=splitstr[i-n]+' '+output

print(output)

actually, Keatinge beat me to it

Irra answered 2/6, 2019 at 3:9 Comment(0)

An idea, to use a recursive pattern with PyPI regex module.

\b[A-Za-z]+\s+(?R)?\(?[A-Z](?=[A-Z]*\))\)?

See this pcre demo at regex101

\b[A-Za-z]+\s+ matches a word boundary, one or more alpha, one or more white space
(?R)? recursive part: optionally paste the pattern from start
\(? need to make the parenthesis optional for recursion to fit in \)?
[A-Z](?=[A-Z]*\) match one upper alpha if followed by closing ) with any A-Z in between

Does not check if the first word letter actually match the letter at position in the abbreviation.
Does not check for an opening parenthesis in front of the abbreviation. To check, add a variable length lookbehind. Change [A-Z](?=[A-Z]*\)) to (?<=\([A-Z]*)[A-Z](?=[A-Z]*\)).

Pyretotherapy answered 2/6, 2019 at 10:42 Comment(0)

Using re with list-comprehension

x_lst = [ str(len(i[1:-1])) for i in re.findall('(\(.*?\))', a) ]

[re.search( r'(\S+\s+){' + i + '}\(.{' + i + '}\)', a).group(0) for i in x_lst]
#['family health history (FHH)', 'nurse practitioner (NP)']

Subdivide answered 2/6, 2019 at 3:17 Comment(0)

This solution isn't particularly clever, it simpy searches for the acronyms and then builds up a pattern to extract the words ahead of each one:

import re

string = "Although family health history (FHH) is commonly accepted as an important risk factor for common, chronic diseases, it is rarely considered by a nurse practitioner (NP)."

definitions = []

for acronym in re.findall(r'\(([A-Z]+?)\)', string):
    length = len(acronym)

    match = re.search(r'(?:\w+\W+){' + str(length) + r'}\(' + acronym + r'\)', string)

    definitions.append(match.group(0))

print(", ".join(definitions))

OUTPUT

> python3 test.py
family health history (FHH), nurse practitioner (NP)
>

Bergama answered 2/6, 2019 at 3:22 Comment(0)

Recommended topics

Hot tags