Could use some help with this soundex coding

Asked 13/10, 2009 at 19:31 Answered 13/10, 2009 at 20:1

The US census bureau uses a special encoding called “soundex” to locate information about a person. The soundex is an encoding of surnames (last names) based on the way a surname sounds rather than the way it is spelled. Surnames that sound the same, but are spelled differently, like SMITH and SMYTH, have the same code and are filed together. The soundex coding system was developed so that you can find a surname even though it may have been recorded under various spellings.

In this lab you will design, code, and document a program that produces the soundex code when input with a surname. A user will be prompted for a surname, and the program should output the corresponding code.

Basic Soundex Coding Rules

Every soundex encoding of a surname consists of a letter and three numbers. The letter used is always the first letter of the surname. The numbers are assigned to the remaining letters of the surname according to the soundex guide shown below. Zeroes are added at the end if necessary to always produce a four-character code. Additional letters are disregarded.

Soundex Coding Guide

Soundex assigns a number for various consonants. Consonants that sound alike are assigned the same number:

Number Consonants

1 B, F, P, V 2 C, G, J, K, Q, S, X, Z 3 D, T 4 L 5 M, N 6 R

Soundex disregards the letters A, E, I, O, U, H, W, and Y.

There are 3 additional Soundex Coding Rules that are followed. A good program design would implement these each as one or more separate functions.

Rule 1. Names With Double Letters

If the surname has any double letters, they should be treated as one letter. For example:

Gutierrez is coded G362 (G, 3 for the T, 6 for the first R, second R ignored, 2 for the Z).

Rule 2. Names with Letters Side-by-Side that have the Same Soundex Code Number

If the surname has different letters side-by-side that have the same number in the soundex coding guide, they should be treated as one letter. Examples:

Pfister is coded as P236 (P, F ignored since it is considered same as P, 2 for the S, 3 for the T, 6 for the R).
Jackson is coded as J250 (J, 2 for the C, K ignored same as C, S ignored same as C, 5 for the N, 0 added).

Rule 3. Consonant Separators

3.a. If a vowel (A, E, I, O, U) separates two consonants that have the same soundex code, the consonant to the right of the vowel is coded. Example:

Tymczak is coded as T-522 (T, 5 for the M, 2 for the C, Z ignored (see "Side-by-Side" rule above), 2 for the K). Since the vowel "A" separates the Z and K, the K is coded.

3.b. If "H" or "W" separate two consonants that have the same soundex code, the consonant to the right is not coded. Example:

*Ashcraft is coded A261 (A, 2 for the S, C ignored since same as S with H in between, 6 for the R, 1 for the F). It is not coded A226.

So far this is my code:

surname = raw_input("Please enter surname:")
outstring = ""

outstring = outstring + surname[0]
for i in range (1, len(surname)):
    nextletter = surname[i]
    if nextletter in ['B','F','P','V']:
        outstring = outstring + '1'

    elif nextletter in ['C','G','J','K','Q','S','X','Z']:
        outstring = outstring + '2'

    elif nextletter in ['D','T']:
        outstring = outstring + '3'

    elif nextletter in ['L']:
        outstring = outstring + '4'

    elif nextletter in ['M','N']:
        outstring = outstring + '5'

    elif nextletter in ['R']:
        outstring = outstring + '6'

print outstring

The code sufficiently does what it is asked to, I am just not sure how to code the three rules. That is where I need help. So, any help is appreciated.

Accretion answered 13/10, 2009 at 19:31 Comment(5)

code.activestate.com/recipes/52213 – Empty 13/10, 2009 at 19:39

Note that the homework assignment said that a good design would use functions. This recipe would not get a good grade on this particular homework... :-) That said, this recipe presumably returns correct answers, and could be useful to check that the homework code works correctly. – Felonious 13/10, 2009 at 20:3

IT does not seem to implement rule 3b. – Erebus 13/10, 2009 at 21:15

Rule 3b is not often implemented. There is no such thing as a universal Soundex algorithm. The rules that the OP is quoting are taken from a US National Archives web page, which totally ignores the fact that the Census Bureau changed the rules between censuses. – Fluidics 14/10, 2009 at 0:27

Possible duplicate of Soundex algorithm in Python (homework help request) – Selfsacrifice 21/11, 2018 at 2:43

A few hints:

By using an array where each Soundex code is stored and indexed by the ASCII value (or a value in a shorter numeric range derived thereof) of the letter it corresponds to, you will both make the code for efficient and more readable. This is a very common technique: understand, use and reuse ;-)
As you parse the input string, you need keep track of (or compare with) the letter previously handled to ignore repeating letters, and handle other the rules. (implement each of these these in a separate function as hinted in the write-up). The idea could be to introduce a function in charge of -maybe- adding the soundex code for the current letter of the input being processed. This function would in turn call each of the "rules" functions, possibly quiting early based on the return values of some rules. In other words, replace the systematic...

    outstring = outstring + c    # btw could be +=
...with
    outstring += AppendCodeIfNeeded(c)

beware that this multi-function structure is overkill for such trivial logic, but it is not a bad idea to do it for practice.

Frederic answered 13/10, 2009 at 19:41 Comment(0)

Here are some small hints on general Python stuff.

0) You can use a for loop to loop over any sequence, and a string counts as a sequence. So you can write:

for nextletter in surname[1:]:
    # do stuff

This is easier to write and easier to understand than computing an index and indexing the surname.

1) You can use the += operator to append strings. Instead of

x = x + 'a'

you can write

x += 'a'

As for help with your specific problem, you will want to keep track of the previous letter. If your assignment had a rule that said "two 'z' characters in a row should be coded as 99" you could add code like this:

def rule_two_z(prevletter, curletter):
    if prevletter.lower() == 'z' and curletter.lower() == 'z':
        return 99
    else:
        return -1


prevletter = surname[0]
for curletter in surname[1:]:
    code = rule_two_z(prevletter, curletter)
    if code < 0:
        # do something else here
    outstring += str(code)
    prevletter = curletter

Hmmm, you were writing your code to return string integers like '3', while I wrote my code to return an actual integer and then call str() on it before adding it to the string. Either way is probably fine.

Good luck!

Felonious answered 13/10, 2009 at 20:1 Comment(0)

Recommended topics

Hot tags