Python 3.5 - Get counter to report zero-frequency items

Asked 29/6, 2017 at 15:14 Answered 15/2, 2021 at 13:24

I am doing textual analysis on texts that due to PDF-to-txt conversion errors, sometime lump words together. So instead of matching words, I want to match strings.

For example, I have the string:

mystring='The lossof our income made us go into debt but this is not too bad as we like some debts.'

And I search for

key_words=['loss', 'debt', 'debts', 'elephant']

The output should be of the form:

Filename Debt Debts Loss Elephant
mystring  2    1     1    0

The code I have works well, except for a few glitches: 1) it does not report the frequency of zero-frequency words (so 'Elephant' would not be in the output: 2) the order of the words in key_words seems to matter (ie. I sometimes get 1 count each for 'debt' and 'debts', and sometimes it reports only 2 counts for 'debt', and 'debts is not reported. I could live with the second point if I managed to "print" the variable names to the dataset... but not sure how.

Below is the relevant code. Thanks! PS. Needless to say, it is not the most elegant piece of code, but I am slowly learning.

bad=set(['debts', 'debt'])

csvfile=open("freq_10k_test.csv", "w", newline='', encoding='cp850', errors='replace')
writer=csv.writer(csvfile)
for filename in glob.glob('*.txt'):

    with open(filename, encoding='utf-8', errors='ignore') as f:
      file_name=[]
      file_name.append(filename)

      new_review=[f.read()]
      freq_all=[]
      rev=[]

      from collections import Counter

      for review in new_review:
        review_processed=review.lower()
        for p in list(punctuation):
           review_processed=review_processed.replace(p,'')
           pattern = re.compile("|".join(bad), flags = re.IGNORECASE)
           freq_iter=collections.Counter(pattern.findall(review_processed))           

        frequency=[value for (key,value) in sorted(freq_iter.items())]
        freq_all.append(frequency)
        freq=[v for v in freq_all]

    fulldata = [ [file_name[i]] + freq  for i, freq in enumerate(freq)]  

    writer=csv.writer(open("freq_10k_test.csv",'a',newline='', encoding='cp850', errors='replace'))
    writer.writerows(fulldata)
    csvfile.flush()

Tetrasyllable answered 29/6, 2017 at 15:14 Comment(2)

may I point out that "Python 3.5 - Get counter to report zero-frequency items" is misleading since python has a Counter in its collections and this question is not related to it.(see e.g. my answer) A better question title would be e.g. "Python 3 - Count umber of occurrences of set of substrings - with overlapped substrings" – Undertrump 17/2, 2021 at 9:5

To get counter to report zero-frequency items (which brought me here) you need to initialize it with zero-frequency items, e.g. Counter({x:0 for x in list}) – Undertrump 17/2, 2021 at 9:7

You can just pre-initialize the counter, something like this:

freq_iter = collections.Counter()
freq_iter.update({x:0 for x in bad})
freq_iter.update(pattern.findall(review_processed))

One nice thing about Counter is that you don't actually have to pre-initialize it - you can just do c = Counter(); c['key'] += 1, but nothing prevents you from pre-initializing some values to 0 if you want.

For the debt/debts thing - that is just an insufficiently specified problem. What do you want the code to do in that case? If you want it to match on the longest pattern matched, you need to sort the list longest-first, that will solve it. If you want both reported, you may need to do multiple searches and save all the results.

Updated to add some information on why it can't find debts: That has more to do with the regex findall than anything else. re.findall always looks for the shortest match, but also once it finds one, it doesn't include it in subsequent matches:

In [2]: re.findall('(debt|debts)', 'debtor debts my debt')
Out[2]: ['debt', 'debt', 'debt']

If you really want to find all instances of every word, you need to do them separately:

In [3]: re.findall('debt', 'debtor debts my debt')
Out[3]: ['debt', 'debt', 'debt']

In [4]: re.findall('debts', 'debtor debts my debt')
Out[4]: ['debts']

However, maybe what you are really looking for is words. in this case, use the \b operator to require a word break:

In [13]: re.findall(r'\bdebt\b', 'debtor debts my debt')
Out[13]: ['debt']

In [14]: re.findall(r'(\b(?:debt|debts)\b)', 'debtor debts my debt')
Out[14]: ['debts', 'debt']

I don't know whether this is what you want or not... in this case, it was able to differentiate debt and debts correctly, but it missed debtor because it only matches a substring, and we asked it not to.

Depending on your use case, you may want to look into stemming the text... I believe there is one in nltk that is pretty simple (used it only once, so I won't try to post an example... this question Combining text stemming and removal of punctuation in NLTK and scikit-learn may be useful), it should reduce debt, debts, and debtor all to the same root word debt, and do similar things for other words. This may or may not be helpful; I don't know what you are doing with it.

Columbarium answered 29/6, 2017 at 15:37 Comment(4)

Be careful with using zero values in counters, though. If you do some arithmetic operations with the counter, then keys and values can be silently lost. – Roadhouse 29/6, 2017 at 15:44

Thanks. I'll have to go through the full list to see whether I keep the singular/plural. For my own benefit, why does Counter not find occurrences of all the strings in the list, but only keep the shortest (ie 'debt' versus 'debts')? – Tetrasyllable 29/6, 2017 at 17:56

I'll good this solution, because it works like a charm with minimal editing. I'll note the caveat highlighted by @wim, though. – Tetrasyllable 30/6, 2017 at 2:11

updated the question with information on why the regex doesn't work. I didn't know that about Counter either - good to know! (I thought at one time it was just a defaultdict(int) but it has some custom behaviours, like being able to update/initialize from a list, which the defaultdict can't do... also good to know.. – Columbarium 3/7, 2017 at 15:45

Like you want :

mystring='The lossof our income made us go into debt but this is not too bad as we like some debts.'
key_words=['loss', 'debt', 'debts', 'elephant']
for kw in key_words:
  count = mystring.count(kw)
  print('%s %s' % (kw, count))

Or for words:

from collections import defaultdict
words = set(mystring.split())
key_words=['loss', 'debt', 'debts', 'elephant']
d = defaultdict(int)
for word in words:
  d[word] += 1

for kw in key_words:
  print('%s %s' % (kw, d[kw]))

Costello answered 29/6, 2017 at 15:26 Comment(4)

in part 2, you could change 'dict()' to 'defaultdict(int)' to get rid of the inner 'if' statements. – Counsel 29/6, 2017 at 15:41

Thanks. I'll test this as soon as I get back to my computer. – Tetrasyllable 29/6, 2017 at 18:0

It does work well, however, would require me to edit my code, more so than with the alternative. Thanks! – Tetrasyllable 30/6, 2017 at 2:12

i thought Counter used to just be basically a defaultdict(int), but it has some other behavours which aren't supported by defaultdict (such as updating/initializing with a list). so this might be necessary when you don't want that special behaviour. – Columbarium 3/7, 2017 at 15:29

A sleek solution is to use regex:

import regex
mystring='The lossof our income made us go into debt but this is not too bad as we like some debts.'
key_words=['loss', 'debt', 'debts', 'elephant']
print ({k:len(regex.findall(k,mystring,overlapped=True)) for k in key_words})

results to:

{'loss': 1, 'debt': 2, 'debts': 1, 'elephant': 0}

Undertrump answered 15/2, 2021 at 12:52 Comment(0)

Counting the occurrences can be done in a simple one-liner:

counts = {k: mystring.count(k) for k in key_words}

Putting that together with a csv.DictWriter results in:

import csv

mystring = 'The lossof our income made us go into debt but this is not too bad as we like some debts.'
key_words = ['loss', 'debt', 'debts', 'elephant']

counts = {k: mystring.count(k) for k in key_words}
print(counts) # {'loss': 1, 'debt': 2, 'debts': 1, 'elephant': 0}

# write out
with open('out.csv', 'w', newline='') as csv_file:
    writer = csv.DictWriter(csv_file, fieldnames=counts, delimiter=' ')
    # key_words
    writer.writeheader()
    # counts
    writer.writerow(counts)

# out.csv:
# loss debt debts elephant
# 1 2 1 0

Severable answered 15/2, 2021 at 13:24 Comment(0)

Recommended topics

Hot tags