Check if a string is a possible abbrevation for a name

Asked 7/9, 2011 at 9:20 Answered 7/9, 2011 at 10:28

Solved python string-matching slug abbreviation text-analysis

I'm trying to develop a python algorithm to check if a string could be an abbrevation for another word. For example

fck is a match for fc kopenhavn because it matches the first characters of the word. fhk would not match.
fco should not match fc kopenhavn because no one irl would abbrevate FC Kopenhavn as FCO.
irl is a match for in real life.
ifk is a match for ifk goteborg.
aik is a match for allmanna idrottskluben.
aid is a match for allmanna idrottsklubben. This is not a real team name abbrevation, but I guess it is hard to exclude it unless you apply domain specific knowledge on how Swedish abbrevations are formed.
manu is a match for manchester united.

It is hard to describe the exact rules of the algorithm, but I hope my examples show what I'm after.

Update I made a mistake in showing the strings with the matching letters uppercased. In the real scenario, all letters are lowercase so it is not as easy as just checking which letters are uppercased.

Vitrify answered 7/9, 2011 at 9:20 Comment(4)

So you want to see if the string matches just the uppercase letters in those strings? If so, try writing something for it which does that: takes just the uppercase letters of your full things and slaps them in a dictionary (as the keys with the full version as the values), and then it's easy to look it up. As it is, you haven't really asked a question... – Slowly 7/9, 2011 at 9:24

Best thing I could think of is to extract all the uppercase letters, convert the short string to upper case then do equality testing. – Instead 7/9, 2011 at 9:26

Semi-OT: ManU can be seen as an insult for ManUnited fans, although it is widely used as an abbrev in non-english-countries. – Zulmazulu 7/9, 2011 at 9:29

I don't think that is possible, e.g. how should you decide that aikis valid, but aid is not valid? – Zulmazulu 7/9, 2011 at 10:15

This passes all the tests, including a few extra I created. It uses recursion. Here are the rules that I used:

The first letter of the abbreviation must match the first letter of the text
The rest of the abbreviation (the abbrev minus the first letter) must be an abbreviation for:
- the remaining words, or
- the remaining text starting from any position in the first word.

tests=(
    ('fck','fc kopenhavn',True),
    ('fco','fc kopenhavn',False),
    ('irl','in real life',True),
    ('irnl','in real life',False),    
    ('ifk','ifk gotebork',True),   
    ('ifko','ifk gotebork',False),    
    ('aik','allmanna idrottskluben',True),
    ('aid','allmanna idrottskluben',True),
    ('manu','manchester united',True), 
    ('fz','faz zoo',True), 
    ('fzz','faz zoo',True),
    ('fzzz','faz zoo',False),    
    )

def is_abbrev(abbrev, text):
    abbrev=abbrev.lower()
    text=text.lower()
    words=text.split()
    if not abbrev:
        return True
    if abbrev and not text:
        return False
    if abbrev[0]!=text[0]:
        return False
    else:
        return (is_abbrev(abbrev[1:],' '.join(words[1:])) or
                any(is_abbrev(abbrev[1:],text[i+1:])
                    for i in range(len(words[0]))))

for abbrev,text,answer in tests:
    result=is_abbrev(abbrev,text)
    print(abbrev,text,result,answer)
    assert result==answer

Frentz answered 7/9, 2011 at 9:28 Comment(2)

Darn, beat me by 30 seconds :) +1 – Litt 7/9, 2011 at 9:29

Sorry, all the strings are supposed to be all lowercase. All is lowercase in the original. – Vitrify 7/9, 2011 at 10:5

Here's a way to accomplish what you seem to want to do

import re    
def is_abbrev(abbrev, text):
    pattern = ".*".join(abbrev.lower())
    return re.match("^" + pattern, text.lower()) is not None

The caret makes sure that the first character of the abbreviation matches the first character of the word, it should be true for most abbreviations.

Edit: Your new update changed the rules a bit. By using "(|.*\s)" instead of ".*" the characters in the abbreviation will only match if they are next to each other, or if the next character appears at the start of a new word.

This will correctly match fck with FC Kopenhavn, but fco will not. However, matching aik with allmanna idrottskluben will not work, as that requires knowledge of the swedish language and is not as trivial to do.

Here's the new code with the minor modification

import re    
def is_abbrev(abbrev, text):
    pattern = "(|.*\s)".join(abbrev.lower())
    return re.match("^" + pattern, text.lower()) is not None

Plast answered 7/9, 2011 at 10:5 Comment(0)

@Ocaso Protal said in comment how should you decide that aik is valid, but aid is not valid? and he is right.

The algo which came in my mind is to work with word threshold (number of words separated by space).

words = string.strip().split()
if len(words) > 2:
   #take first letter of every word
elif len(words) == 2:
   #take two letters from first word and one letter from other
else:
   #we have single word, take first three letter or as you like

you have to define your logic, you can't find abbreviation blindly.

Galilee answered 7/9, 2011 at 10:28 Comment(0)

Your algorithm seems simple - the abbreviation is the Concatenation of all upper case letters. so:

upper_case_letters = "QWERTYUIOPASDFGHJKLZXCVBNM"
abbrevation = ""
for letter in word_i_want_to_check:
    if letter in letters:
        abbrevation += letter
for abb in _list_of_abbrevations:
    if abb=abbrevation:
        great_success()

Asinine answered 7/9, 2011 at 9:28 Comment(1)

You could use string.ascii_uppercase – Instead 7/9, 2011 at 9:30

This might be good enough.

def is_abbrevation(abbrevation, word):
    lowword = word.lower()
    lowabbr = abbrevation.lower()

    for c in lowabbr:
        if c not in lowword:
            return False

    return True

print is_abbrevation('fck', 'FC Kopenhavn')

Wetterhorn answered 7/9, 2011 at 9:30 Comment(1)

That is not correct, e.g. try print is_abbrevation('fkc', 'FC Kopenhavn') – Zulmazulu 7/9, 2011 at 9:56

Recommended topics

Hot tags