Converting plural to singular in a text file with Python
Asked Answered
A

3

14

I have txt files that look like this:

word, 23
Words, 2
test, 1
tests, 4

And I want them to look like this:

word, 23
word, 2
test, 1
test, 4

I want to be able to take a txt file in Python and convert plural words to singular. Here's my code:

import nltk

f = raw_input("Please enter a filename: ")

def openfile(f):
    with open(f,'r') as a:
       a = a.read()
       a = a.lower()
       return a

def stem(a):
    p = nltk.PorterStemmer()
    [p.stem(word) for word in a]
    return a

def returnfile(f, a):
    with open(f,'w') as d:
        d = d.write(a)
    #d.close()

print openfile(f)
print stem(openfile(f))
print returnfile(f, stem(openfile(f)))

I have also tried these 2 definitions instead of the stem definition:

def singular(a):
    for line in a:
        line = line[0]
        line = str(line)
        stemmer = nltk.PorterStemmer()
        line = stemmer.stem(line)
        return line

def stem(a):
    for word in a:
        for suffix in ['s']:
            if word.endswith(suffix):
                return word[:-len(suffix)]
            return word

Afterwards I'd like to take duplicate words (e.g. test and test) and merge them by adding up the numbers next to them. For example:

word, 25
test, 5

I'm not sure how to do that. A solution would be nice but not necessary.

Azral answered 13/7, 2015 at 15:50 Comment(9)
To try and collapse all the values into one line per word, I recommend looking up dictionaries in the docs.Calpac
Do you want to do anything with plurals that don't end in 's'? ie. Geese? Because that becomes a lot harder than removing a trailing s. Also, what about when a word ends with 's', eg. 'Class'. Should your script handle any word or is there a smaller, more specific pool it can draw on?Calpac
No, I just want to remove the s, at least for nowAzral
Just to be clear, your original stem definition does not work? Is that code exactly how you have been trying to run it?Byers
I ask because the nltk stemmers should be more than enough for your de-pluralizing purposes. I see that you do [p.stem(word) for word in a] but don't save the results anywhere. Does a = [p.stem(word) for word in a] work for you? In my experience the stemmers do not work in-place, and so you need to store the results somewhere.Byers
it did not work, when I tried running a test file similar to the one in my question it had all the words lowercase but they were still plurals.Azral
When I tried your suggestion, it did change a few things but it turned: soc, 32 soc, 1 test, 3 soc, 90 socs, 43 socs, 1 tests, 2 into: [u's', u'o', u'c', u',', u' ', u'3', u'2', u'\n', u's', u'o', u'c', u',', u' ', u'1', u'\n', u't', u'e', u's', u't', u',', u' ', u'3', u'\n', u's', u'o', u'c', u',', u' ', u'9', u'0', u'\n', u's', u'o', u'c', u's', u',', u' ', u'4', u'3', u'\n', u's', u'o', u'c', u's', u',', u' ', u'1', u'\n', u't', u'e', u's', u't', u's', u',', u' ', u'2', u'\n']Azral
Okay so any suggestions I give will be more of an answer, so I'll write something below. You'll have to do more than just my suggestion to make it work, but it shouldn't be much more.Byers
how to add values of the word and wordsNaranjo
B
11

It seems like you're pretty familiar with Python, but I'll still try to explain some of the steps. Let's start with the first question of depluralizing words. When you read in a multiline file (the word, number csv in your case) with a.read(), you're going to be reading the entire body of the file into one big string.

def openfile(f):
    with open(f,'r') as a:
        a = a.read() # a will equal 'soc, 32\nsoc, 1\n...' in your example
        a = a.lower()
        return a

This is fine and all, but when you want to pass the result into stem(), it will be as one big string, and not as a list of words. This means that when you iterate through the input with for word in a, you will be iterating through each individual character of the input string and applying the stemmer to those individual characters.

def stem(a):
    p = nltk.PorterStemmer()
    a = [p.stem(word) for word in a] # ['s', 'o', 'c', ',', ' ', '3', '2', '\n', ...]
    return a

This definitely doesn't work for your purposes, and there are a few different things we can do.

  1. We can change it so that we read the input file as one list of lines
  2. We can use the big string and break it down into a list ourselves.
  3. We can go through and stem each line in the list of lines one at a time.

Just for expedience's sake, let's roll with #1. This will require changing openfile(f) to the following:

def openfile(f):
    with open(f,'r') as a:
        a = a.readlines() # a will equal 'soc, 32\nsoc, 1\n...' in your example
        b = [x.lower() for x in a]
        return b

This should give us b as a list of lines, i.e. ['soc, 32', 'soc, 1', ...]. So the next problem becomes what do we do with the list of strings when we pass it to stem(). One way is the following:

def stem(a):
    p = nltk.PorterStemmer()
    b = []
    for line in a:
        split_line = line.split(',') #break it up so we can get access to the word
        new_line = str(p.stem(split_line[0])) + ',' + split_line[1] #put it back together 
        b.append(new_line) #add it to the new list of lines
    return b

This is definitely a pretty rough solution, but should adequately iterate through all of the lines in your input, and depluralize them. It's rough because splitting strings and reassembling them isn't particularly fast when you scale it up. However, if you're satisfied with that, then all that's left is to iterate through the list of new lines, and write them to your file. In my experience it's usually safer to write to a new file, but this should work fine.

def returnfile(f, a):
    with open(f,'w') as d:
        for line in a:
            d.write(line)


print openfile(f)
print stem(openfile(f))
print returnfile(f, stem(openfile(f)))

When I have the following input.txt

soc, 32
socs, 1
dogs, 8

I get the following stdout:

Please enter a filename: input.txt
['soc, 32\n', 'socs, 1\n', 'dogs, 8\n']
['soc, 32\n', 'soc, 1\n', 'dog, 8\n']
None

And input.txt looks like this:

soc, 32
soc, 1
dog, 8

The second question regarding merging numbers with the same words changes our solution from above. As per the suggestion in the comments, you should take a look at using dictionaries to solve this. Instead of doing this all as one big list, the better (and probably more pythonic) way to do this is to iterate through each line of your input, and stemming them as you process them. I'll write up code about this in a bit, if you're still working to figure it out.

Byers answered 13/7, 2015 at 19:51 Comment(1)
this works but gives more worse result e.g (dry converted into dri and neckline converted into necklin )etcFlinn
C
33

If you have complex words to singularize, I don't advise you to use stemming but a proper python package link pattern :

from pattern.text.en import singularize

plurals = ['caresses', 'flies', 'dies', 'mules', 'geese', 'mice', 'bars', 'foos',
           'families', 'dogs', 'child', 'wolves']

singles = [singularize(plural) for plural in plurals]
print(singles)

returns:

>>> ['caress', 'fly', 'dy', 'mule', 'goose', 'mouse', 'bar', 'foo', 'foo', 'family', 'family', 'dog', 'dog', 'child', 'wolf']

It's not perfect but it's the best I found. 96% based on the docs : http://www.clips.ua.ac.be/pages/pattern-en#pluralization

Carnot answered 30/12, 2016 at 10:21 Comment(10)
It seems that pattern package is only available for Python 2.*: 'The Python 3 version is currently only available on the development branch'Embayment
According to their webpage: "The singularize() function returns the plural form of a singular noun. The pos parameter (part-of-speech) can be set to NOUN or ADJECTIVE" The 2nd parameter of the function singularize() is pos.Lamee
Not so accurate 'mucous' becomes 'mucou'Dowery
I think mucous is a collective noun so doesn't have a plural versionSenlac
Works in fine in python 3 as wellSenlac
The project is now located here: github.com/clips/patternPrescription
the inflect package is a lot easier to install https://mcmap.net/q/660028/-plural-string-formattingDilly
@Dilly there is nothing difficult about installing the pattern package. 'pip install pattern'Quartan
@Quartan you are generally correct. The procedure for installing both of these packages is the same. (pip install). However, I have had issues with dependencies of pattern. I don't have an example at hand, but pip install pattern installs 39 packages. pip install inflect only installs 1.Dilly
@Dilly Got it, fair point! Yeah, I installed pattern and I fortunately already had a sizeable portion of the dependencies, since I do a fair amount of sentiment analysis for work.Quartan
B
11

It seems like you're pretty familiar with Python, but I'll still try to explain some of the steps. Let's start with the first question of depluralizing words. When you read in a multiline file (the word, number csv in your case) with a.read(), you're going to be reading the entire body of the file into one big string.

def openfile(f):
    with open(f,'r') as a:
        a = a.read() # a will equal 'soc, 32\nsoc, 1\n...' in your example
        a = a.lower()
        return a

This is fine and all, but when you want to pass the result into stem(), it will be as one big string, and not as a list of words. This means that when you iterate through the input with for word in a, you will be iterating through each individual character of the input string and applying the stemmer to those individual characters.

def stem(a):
    p = nltk.PorterStemmer()
    a = [p.stem(word) for word in a] # ['s', 'o', 'c', ',', ' ', '3', '2', '\n', ...]
    return a

This definitely doesn't work for your purposes, and there are a few different things we can do.

  1. We can change it so that we read the input file as one list of lines
  2. We can use the big string and break it down into a list ourselves.
  3. We can go through and stem each line in the list of lines one at a time.

Just for expedience's sake, let's roll with #1. This will require changing openfile(f) to the following:

def openfile(f):
    with open(f,'r') as a:
        a = a.readlines() # a will equal 'soc, 32\nsoc, 1\n...' in your example
        b = [x.lower() for x in a]
        return b

This should give us b as a list of lines, i.e. ['soc, 32', 'soc, 1', ...]. So the next problem becomes what do we do with the list of strings when we pass it to stem(). One way is the following:

def stem(a):
    p = nltk.PorterStemmer()
    b = []
    for line in a:
        split_line = line.split(',') #break it up so we can get access to the word
        new_line = str(p.stem(split_line[0])) + ',' + split_line[1] #put it back together 
        b.append(new_line) #add it to the new list of lines
    return b

This is definitely a pretty rough solution, but should adequately iterate through all of the lines in your input, and depluralize them. It's rough because splitting strings and reassembling them isn't particularly fast when you scale it up. However, if you're satisfied with that, then all that's left is to iterate through the list of new lines, and write them to your file. In my experience it's usually safer to write to a new file, but this should work fine.

def returnfile(f, a):
    with open(f,'w') as d:
        for line in a:
            d.write(line)


print openfile(f)
print stem(openfile(f))
print returnfile(f, stem(openfile(f)))

When I have the following input.txt

soc, 32
socs, 1
dogs, 8

I get the following stdout:

Please enter a filename: input.txt
['soc, 32\n', 'socs, 1\n', 'dogs, 8\n']
['soc, 32\n', 'soc, 1\n', 'dog, 8\n']
None

And input.txt looks like this:

soc, 32
soc, 1
dog, 8

The second question regarding merging numbers with the same words changes our solution from above. As per the suggestion in the comments, you should take a look at using dictionaries to solve this. Instead of doing this all as one big list, the better (and probably more pythonic) way to do this is to iterate through each line of your input, and stemming them as you process them. I'll write up code about this in a bit, if you're still working to figure it out.

Byers answered 13/7, 2015 at 19:51 Comment(1)
this works but gives more worse result e.g (dry converted into dri and neckline converted into necklin )etcFlinn
C
4

The Nodebox English Linguistics library contains scripts for converting plural form to single form and vice versa. Checkout tutorial: https://www.nodebox.net/code/index.php/Linguistics#pluralization

To convert plural to single just import singular module and use singular() function. It handles proper conversions for words with different endings, irregular forms, etc.

from en import singular
print(singular('analyses'))   
print(singular('planetoids'))
print(singular('children'))
>>> analysis
>>> planetoid
>>> child
Cutthroat answered 24/12, 2017 at 0:30 Comment(2)
can we pip3 install this?Enamor
Not sure. By the time I used it I just downloaded their open-source libraries (nodebox.net/code/index.php/Library.html). I would recommend you to ask NodeBox authors about it.Cutthroat

© 2022 - 2024 — McMap. All rights reserved.