Counting word frequency and making a dictionary from it [duplicate]
Asked Answered
H

13

10

I want to take every word from a text file, and count the word frequency in a dictionary.

Example: 'this is the textfile, and it is used to take words and count'

d = {'this': 1, 'is': 2, 'the': 1, ...} 

I am not that far, but I just can't see how to complete it. My code so far:

import sys

argv = sys.argv[1]
data = open(argv)
words = data.read()
data.close()
wordfreq = {}
for i in words:
    #there should be a counter and somehow it must fill the dict.
Harlem answered 18/2, 2014 at 11:15 Comment(1)
Start here: docs.python.org/2/library/collections.html#counter-objects. You will also need to split your input, to get individual words, and remove any punctuation, see: docs.python.org/2/library/stdtypes.html#string-methodsRetool
N
16

If you don't want to use collections.Counter, you can write your own function:

import sys

filename = sys.argv[1]
fp = open(filename)
data = fp.read()
words = data.split()
fp.close()

unwanted_chars = ".,-_ (and so on)"
wordfreq = {}
for raw_word in words:
    word = raw_word.strip(unwanted_chars)
    if word not in wordfreq:
        wordfreq[word] = 0 
    wordfreq[word] += 1

for finer things, look at regular expressions.

Nitriding answered 18/2, 2014 at 11:34 Comment(3)
If the OP wants his word frequency to be case-insensitive, he should convert the case to either upper (or lower)Steamer
if word not in wordfreq: wordfreq[word] = 1Concur
@Pranav, I set it to 0 because it is incremented on next lineNitriding
S
13

Although using Counter from the collections library as suggested by @Michael is a better approach, I am adding this answer just to improve your code. (I believe this will be a good answer for a new Python learner.)

From the comment in your code it seems like you want to improve your code. And I think you are able to read the file content in words (while usually I avoid using read() function and use for line in file_descriptor: kind of code).

As words is a string, in for loop, for i in words: the loop-variable i is not a word but a char. You are iterating over chars in the string instead of iterating over words in the string words. To understand this, notice following code snippet:

>>> for i in "Hi, h r u?":
...  print i
... 
H
i
,
 
h
 
r
 
u
?
>>> 

Because iterating over the given string char by chars instead of word by words is not what you wanted to achieve, to iterate words by words you should use the split method/function from string class in Python.
str.split(str="", num=string.count(str)) method returns a list of all the words in the string, using str as the separator (splits on all whitespace if left unspecified), optionally limiting the number of splits to num.

Notice the code examples below:

Split:

>>> "Hi, how are you?".split()
['Hi,', 'how', 'are', 'you?']

loop with split:

>>> for i in "Hi, how are you?".split():
...  print i
... 
Hi,
how
are
you?

And it looks like something you need. Except for word Hi, because split(), by default, splits by whitespaces so Hi, is kept as a single string (and obviously) you don't want that.

To count the frequency of words in the file, one good solution is to use regex. But first, to keep the answer simple I will be using replace() method. The method str.replace(old, new[, max]) returns a copy of the string in which the occurrences of old have been replaced with new, optionally restricting the number of replacements to max.

Now check code example below to see what I suggested:

>>> "Hi, how are you?".split()
['Hi,', 'how', 'are', 'you?'] # it has , with Hi
>>> "Hi, how are you?".replace(',', ' ').split()
['Hi', 'how', 'are', 'you?'] # , replaced by space then split

loop:

>>> for word in "Hi, how are you?".replace(',', ' ').split():
...  print word
... 
Hi
how
are
you?

Now, how to count frequency:

One way is use Counter as @Michael suggested, but to use your approach in which you want to start from empty an dict. Do something like this code sample below:

words = f.read()
wordfreq = {}
for word in .replace(', ',' ').split():
    wordfreq[word] = wordfreq.setdefault(word, 0) + 1
    #                ^^ add 1 to 0 or old value from dict 

What am I doing? Because initially wordfreq is empty you can't assign it to wordfreq[word] for the first time (it will raise key exception error). So I used setdefault dict method.

dict.setdefault(key, default=None) is similar to get(), but will set dict[key]=default if key is not already in dict. So for the first time when a new word comes, I set it with 0 in dict using setdefault then add 1 and assign to the same dict.

I have written an equivalent code using with open instead of single open.

with open('~/Desktop/file') as f:
    words = f.read()
    wordfreq = {}
    for word in words.replace(',', ' ').split():
        wordfreq[word] = wordfreq.setdefault(word, 0) + 1
print wordfreq

That runs like this:

$ cat file  # file is 
this is the textfile, and it is used to take words and count
$ python work.py  # indented manually 
{'and': 2, 'count': 1, 'used': 1, 'this': 1, 'is': 2, 
 'it': 1, 'to': 1, 'take': 1, 'words': 1, 
 'the': 1, 'textfile': 1}

Using re.split(pattern, string, maxsplit=0, flags=0)

Just change the for loop: for i in re.split(r"[,\s]+", words):, that should produce the correct output.

Edit: better to find all alphanumeric character because you may have more than one punctuation symbols.

>>> re.findall(r'[\w]+', words) # manually indent output  
['this', 'is', 'the', 'textfile', 'and', 
  'it', 'is', 'used', 'to', 'take', 'words', 'and', 'count']

use for loop as: for word in re.findall(r'[\w]+', words):

How would I write code without using read():

File is:

$ cat file
This is the text file, and it is used to take words and count. And multiple
Lines can be present in this file.
It is also possible that Same words repeated in with capital letters.

Code is:

$ cat work.py
import re
wordfreq = {}
with open('file') as f:
    for line in f:
        for word in re.findall(r'[\w]+', line.lower()):
            wordfreq[word] = wordfreq.setdefault(word, 0) + 1
  
print wordfreq

Used lower() to convert an upper letter to lower letter.

output:

$python work.py  # manually strip output  
{'and': 3, 'letters': 1, 'text': 1, 'is': 3, 
 'it': 2, 'file': 2, 'in': 2, 'also': 1, 'same': 1, 
 'to': 1, 'take': 1, 'capital': 1, 'be': 1, 'used': 1, 
 'multiple': 1, 'that': 1, 'possible': 1, 'repeated': 1, 
 'words': 2, 'with': 1, 'present': 1, 'count': 1, 'this': 2, 
 'lines': 1, 'can': 1, 'the': 1}
Sacker answered 18/2, 2014 at 12:25 Comment(1)
your elaboration of how all this fits together is dope, this should be the accepted answer.Minorca
P
11
from collections import Counter
t = 'this is the textfile, and it is used to take words and count'

dict(Counter(t.split()))
>>> {'and': 2, 'is': 2, 'count': 1, 'used': 1, 'this': 1, 'it': 1, 'to': 1, 'take': 1, 'words': 1, 'the': 1, 'textfile,': 1}

Or better with removing punctuation before counting:

dict(Counter(t.replace(',', '').replace('.', '').split()))
>>> {'and': 2, 'is': 2, 'count': 1, 'used': 1, 'this': 1, 'it': 1, 'to': 1, 'take': 1, 'words': 1, 'the': 1, 'textfile': 1}
Penumbra answered 18/2, 2014 at 11:29 Comment(0)
M
2

The following takes the string, splits it into a list with split(), for loops the list and counts the frequency of each item in the sentence with Python's count function count (). The words,i, and its frequency are placed as tuples in an empty list, ls, and then converted into key and value pairs with dict().

sentence = 'this is the textfile, and it is used to take words and count'.split()
ls = []  
for i in sentence:

    word_count = sentence.count(i)  # Pythons count function, count()
    ls.append((i,word_count))       


dict_ = dict(ls)

print dict_

output; {'and': 2, 'count': 1, 'used': 1, 'this': 1, 'is': 2, 'it': 1, 'to': 1, 'take': 1, 'words': 1, 'the': 1, 'textfile,': 1}

Mumble answered 18/2, 2014 at 11:37 Comment(1)
This is quite inefficient, as it iterates over the whole string again for every single word, instead of a single pass. The complexity would be O(n²) instead of O(n).Penumbra
U
1
sentence = "this is the textfile, and it is used to take words and count"

# split the sentence into words.
# iterate thorugh every word

counter_dict = {}
for word in sentence.lower().split():
# add the word into the counter_dict initalize with 0
  if word not in counter_dict:
    counter_dict[word] = 0
# increase its count by 1   
  counter_dict[word] =+ 1
Unselfish answered 27/12, 2017 at 11:3 Comment(0)
V
1
#open your text book,Counting word frequency
File_obj=open("Counter.txt",'r')
w_list=File_obj.read()
print(w_list.split())
di=dict()
for word in w_list.split():


    if word in di:
        di[word]=di[word] + 1

    else:
        di[word]=1



max_count=max(di.values())
largest=-1
maxusedword=''
for k,v in di.items():
    print(k,v)
    if v>largest:
        largest=v
        maxusedword=k

print(maxusedword,largest)
Verlinevermeer answered 5/4, 2018 at 10:50 Comment(0)
A
1

you can also use default dictionaries with int type.

 from collections import defaultdict
 wordDict = defaultdict(int)
 text = 'this is the textfile, and it is used to take words and count'.split(" ")
 for word in text:
    wordDict[word]+=1

explanation: we initialize a default dictionary whose values are of the type int. This way the default value for any key will be 0 and we don't need to check if a key is present in the dictionary or not. we then split the text with the spaces into a list of words. then we iterate through the list and increment the count of the word's count.

Auditor answered 29/6, 2019 at 0:51 Comment(0)
P
1
wordList = 'this is the textfile, and it is used to take words and count'.split()
wordFreq = {}

# Logic: word not in the dict, give it a value of 1. if key already present, +1.
for word in wordList:
    if word not in wordFreq:
        wordFreq[word] = 1
    else:
        wordFreq[word] += 1

print(wordFreq)
Planksheer answered 29/5, 2020 at 5:44 Comment(0)
M
0

My approach is to do few things from ground:

  1. Remove punctuations from the text input.
  2. Make list of words.
  3. Remove empty strings.
  4. Iterate through list.
  5. Make each new word a key into Dictionary with value 1.
  6. If a word is already exist as key then increment it's value by one.

text = '''this is the textfile, and it is used to take words and count'''
word = '' #This will hold each word

wordList = [] #This will be collection of words
for ch in text: #traversing through the text character by character
#if character is between a-z or A-Z or 0-9 then it's valid character and add to word string..
    if (ch >= 'a' and ch <= 'z') or (ch >= 'A' and ch <= 'Z') or (ch >= '0' and ch <= '9'): 
        word += ch
    elif ch == ' ': #if character is equal to single space means it's a separator
        wordList.append(word) # append the word in list
        word = '' #empty the word to collect the next word
wordList.append(word)  #the last word to append in list as loop ended before adding it to list
print(wordList)

wordCountDict = {} #empty dictionary which will hold the word count
for word in wordList: #traverse through the word list
    if wordCountDict.get(word.lower(), 0) == 0: #if word doesn't exist then make an entry into dic with value 1
        wordCountDict[word.lower()] = 1
    else: #if word exist then increment the value by one
        wordCountDict[word.lower()] = wordCountDict[word.lower()] + 1
print(wordCountDict)

Another approach:

text = '''this is the textfile, and it is used to take words and count'''
for ch in '.\'!")(,;:?-\n':
    text = text.replace(ch, ' ')
wordsArray = text.split(' ')
wordDict = {}
for word in wordsArray:
    if len(word) == 0:
        continue
    else:
        wordDict[word.lower()] = wordDict.get(word.lower(), 0) + 1
print(wordDict)
Molybdate answered 14/7, 2018 at 18:23 Comment(0)
I
0

One more function:

def wcount(filename):
    counts = dict()
    with open(filename) as file:
        a = file.read().split()
        # words = [b.rstrip() for b in a]
    for word in a:
        if word in counts:
            counts[word] += 1
        else:
            counts[word] = 1
    return counts
Iniquitous answered 7/7, 2020 at 9:21 Comment(0)
R
0

def play_with_words(input):

input_split = input.split(",")
input_split.sort()
count = {}
for i in input_split:
    if i in count:
        count[i] += 1
    else:
        count[i] = 1

return count

input ="i,am,here,where,u,are"

print(play_with_words(input))

Reis answered 9/1, 2021 at 11:28 Comment(0)
T
0
Write a Python program to create a list of strings by taking input from the user and then create  a dictionary containing each string along with their frequencies. (e.g. if the list is [‘apple’,  ‘banana’, ‘fig’, ‘apple’, ‘fig’, ‘banana’, ‘grapes’, ‘fig’, ‘grapes’, ‘apple’] then output should be  {'apple': 3, 'banana': 2, 'fig': 3, 'grapes': 2}.  

lst = []
d = dict()
print("ENTER ZERO NUMBER FOR EXIT !!!!!!!!!!!!")
while True:
    user = input('enter string element :: -- ')
    if user == "0":
        break
    else:
        lst.append(user)
print("LIST ELEMENR ARE :: ",lst)
l = len(lst)
for i in range(l) :
    c = 0
    for j in range(l) :
        if lst[i] == lst[j ]:
            c += 1
    d[lst[i]] = c
print("dictionary is  :: ",d)
Tabbi answered 14/4, 2021 at 7:5 Comment(0)
S
0

You can also go with this approach. But you need to store the text file's content in a variable as a string first after reading the file. In this way, You don't need to use or import any external libraries.

s = "this is the textfile, and it is used to take words and count"

s = s.split(" ")
d = dict()
for i in s:
  c = ""
  if i.isalpha() == True: 
    if i not in d:
      d[i] = 1
    else:
      d[i] += 1
  else:
    for j in i:
      l = len(j)
      if j.isalpha() == True:
        c+=j    
    if c not in d:
      d[c] = 1
    else:
      d[c] += 1


print(d)

Result:

enter image description here

Seattle answered 23/2, 2022 at 13:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.