Elegant ways to count the frequency of words in a file

Asked 3/2, 2011 at 16:40 Answered 5/1, 2018 at 14:50

What are the elegant and effective ways to count the frequency of each "english" word in a file?

Lawyer answered 3/2, 2011 at 16:40 Comment(8)

Define "word." Do you mean "English words" or "uninterrupted sequences of alphabetic characters" or "uninterrupted sequences of characters," or something else? – Misreport 3/2, 2011 at 16:41

for what purpose - just for fun? – Agrippina 3/2, 2011 at 16:49

Again, what does ""english"" mean? Actual English words or sequences matching [A-Za-z]+? What about hyphenated words or otherwise punctuated words? – Misreport 3/2, 2011 at 17:4

Repeating? So "he went to the the store" is counted as 1 repetition, or as 5 unique words of which one has a count of 2? – Sumerlin 3/2, 2011 at 17:9

If you are on a *nix platform, try the wc utility. Why reinvent the wheel? – Rackrent 3/2, 2011 at 17:52

Do contractions and possessive words count? For example, can't, and The cat's toy.. – Rackrent 3/2, 2011 at 17:55

Do the letter sequences have to be valid English words? For example, a is a valid word, but t is not. – Rackrent 3/2, 2011 at 17:58

@Thomas: But if capitalized, it's a valid English name, e.g. Mr. T ;-) – Misreport 3/2, 2011 at 23:15

First of all, I define letter_only std::locale so as to ignore punctuations coming from the stream, and to read only valid "english" letters from the input stream. That way, the stream will treat the words "ways", "ways." and "ways!" as just the same word "ways", because the stream will ignore punctuations like "." and "!".

struct letter_only: std::ctype<char> 
{
    letter_only(): std::ctype<char>(get_table()) {}

    static std::ctype_base::mask const* get_table()
    {
        static std::vector<std::ctype_base::mask> 
            rc(std::ctype<char>::table_size,std::ctype_base::space);

        std::fill(&rc['A'], &rc['z'+1], std::ctype_base::alpha);
        return &rc[0];
    }
};

Solution 1

int main()
{
     std::map<std::string, int> wordCount;
     ifstream input;
     input.imbue(std::locale(std::locale(), new letter_only())); //enable reading only letters!
     input.open("filename.txt");
     std::string word;
     while(input >> word)
     {
         ++wordCount[word];
     }
     for (std::map<std::string, int>::iterator it = wordCount.begin(); it != wordCount.end(); ++it)
     {
           cout << it->first <<" : "<< it->second << endl;
     }
}

Solution 2

struct Counter
{
    std::map<std::string, int> wordCount;
    void operator()(const std::string & item) { ++wordCount[item]; }
    operator std::map<std::string, int>() { return wordCount; }
};

int main()
{
     ifstream input;
     input.imbue(std::locale(std::locale(), new letter_only())); //enable reading only letters!
     input.open("filename.txt");
     istream_iterator<string> start(input);
     istream_iterator<string> end;
     std::map<std::string, int> wordCount = std::for_each(start, end, Counter());
     for (std::map<std::string, int>::iterator it = wordCount.begin(); it != wordCount.end(); ++it)
     {
          cout << it->first <<" : "<< it->second << endl;
     }
 }

Humism answered 3/2, 2011 at 16:42 Comment(18)

However, the answer made it also clear that "non-whitespace character sequences delimited by whitespace" is not the definition of "word" the OP is after. – Froghopper 3/2, 2011 at 16:51

I think this is the right answer since he want the frequency of repeated words. – Radferd 3/2, 2011 at 16:52

The input loop in the first solution is wrong. The eof flag is set after an input operation that fails due to reaching eof. – Misreport 3/2, 2011 at 16:56

Again, this is not the right answer. The OP is not asking for whitespace-delimited words. This would count "end_of_sentence." and "end_of_sentence!" as two distinct words, which is not what the OP wants. – Froghopper 3/2, 2011 at 16:58

@Nawaz for fully correct answer you probably need to clear tokens from punctuations. – Kolnos 3/2, 2011 at 16:58

@Nawaz: Why not just use the idiomatic while (input >> word)? It's still wrong as written since the other flags aren't checked. – Misreport 3/2, 2011 at 17:23

That first loop is ugly. Though you have correct usage; doing it like that is error prone (fro beginners) we should be encouraging new-bes to put the read into the while condition. while (input >> word) – Aubrey 3/2, 2011 at 17:25

While you are cleaning up. Wy not wrap it all in main() so that a beginner can just cut and paste the code and get a working program. – Aubrey 3/2, 2011 at 17:27

@James: I did that. Also worked on locale also. Please check it out. and let me know if you find anything wrong with the solution! – Humism 3/2, 2011 at 17:34

@sbi, and @Ashot : I fixed that by defining letter_only locale. Please check it out! – Humism 3/2, 2011 at 17:35

@Martin: Please check out the complete solution now. :-) – Humism 3/2, 2011 at 17:37

@Nawaz: imbue() on a file that is already opened silently fails. You need to imbue the file object before you actually open the file (i.e. before providing the filename). – Aubrey 3/2, 2011 at 17:45

@Martin: It doesn't fail. I checked it. locale is used when reading file, so it makes sense! – Humism 3/2, 2011 at 17:47

@Nawaz: Let me re-phrase. It will not work on all implementations. You need to declare the object. imbue() then open(filename). This is done because some locals use characters from the stream to define how they work (like the BOM marker in UTF-8/16 files can be extracted by the local object to define how they work. As a result the usual action of imbue is to fail if the file is already open). – Aubrey 3/2, 2011 at 17:48

@Martin: That is interesting. Can you explain me why? Any reference for that? I didn't get anything of that sort! – Humism 3/2, 2011 at 17:51

@Nawaz: Just stepping through the code. In g++ look at bits/fstream.tcc basic_filebuf::imbue() (the rules are a bit more complex but it always works if the file has NOT been opened. It may fail if the file has been opened (it may depend on what the old local did to the file when it was opened). – Aubrey 3/2, 2011 at 18:3

@Martin: thanks a lot for explaining that. I didn't know that. So I made those changes, as you suggested. Let me know if there is any potential problem. :-) – Humism 3/2, 2011 at 18:5

@Nawaz I find bug in your code.You need to call tolower for words before adding to map (see my solution). – Kolnos 3/2, 2011 at 18:24

Perl is arguably not so elegant, but very effective.
I posted a solution here: Processing huge text files

In a nutshell,

1) If needed, strip punctuation and convert uppercase to lowercase:
perl -pe "s/[^a-zA-Z \t\n']/ /g; tr/A-Z/a-z/" file_raw > file

2) Count the occurrence of each word. Print results sorted first by frequency, and then alphabetically:
perl -lane '$h{$_}++ for @F; END{for $w (sort {$h{$b}<=>$h{$a} || $a cmp $b} keys %h) {print "$h{$w}\t$w"}}' file > freq

I ran this code on a 3.3GB text file with 580,000,000 words.
Perl 5.22 completed in under 3 minutes.

Zarah answered 14/10, 2015 at 1:11 Comment(0)

Here is working solution.This should work with real text (including punctuation) :

#include <iterator>
#include <iostream>
#include <fstream>
#include <map>
#include <string>
#include <cctype>

std::string getNextToken(std::istream &in)
{
    char c;
    std::string ans="";
    c=in.get();
    while(!std::isalpha(c) && !in.eof())//cleaning non letter charachters
    {
        c=in.get();
    }
    while(std::isalpha(c))
    {
        ans.push_back(std::tolower(c));
        c=in.get();
    }
    return ans;
}

int main()
{
    std::map<std::string,int> words;
    std::ifstream fin("input.txt");

    std::string s;
    std::string empty ="";
    while((s=getNextToken(fin))!=empty )
            ++words[s];

    for(std::map<std::string,int>::iterator iter = words.begin(); iter!=words.end(); ++iter)
        std::cout<<iter->first<<' '<<iter->second<<std::endl;
}

Edit: Now my code calling tolower for every letter.

Kolnos answered 3/2, 2011 at 17:27 Comment(2)

This undoubtedly works for english (which is what the OP asked, It know), but not for other languages. I won't also work if there is a number in the input text. – Valued 3/2, 2011 at 17:36

@Valued question asks "english" words.Also is_alpha doesn't return true for digits. – Kolnos 3/2, 2011 at 17:47

My solution is the following one. Firstly, all symbols are converted to spaces. Then, basically the same solution provided here before is used in order to extract words:

const std::string Symbols = ",;.:-()\t!¡¿?\"[]{}&<>+-*/=#'";
typedef std::map<std::string, unsigned int> WCCollection;
void countWords(const std::string fileName, WCCollection &wcc)
    {
        std::ifstream input( fileName.c_str() );

        if ( input.is_open() ) {
            std::string line;
            std::string word;

            while( std::getline( input, line ) ) {
                // Substitute punctuation symbols with spaces
                for(std::string::const_iterator it = line.begin(); it != line.end(); ++it) {
                    if ( Symbols.find( *it ) != std::string::npos ) {
                        *it = ' ';
                    }

                }

                // Let std::operator>> separate by spaces
                std::istringstream filter( line );
                while( filter >> word ) {
                    ++( wcc[word] );
                }
            }
        }

    }

Valued answered 3/2, 2011 at 17:34 Comment(1)

I've improved the algorithm and fixed minor bugs. – Valued 5/2, 2011 at 12:10

Pseudocode for an algorithm which I believe to be close to what you want:

counts = defaultdict(int)
for line in file:
  for word in line.split():
    if any(x.isalpha() for x in word):
      counts[word.toupper()] += 1

freq = sorted(((count, word) for word, count in counts.items()), reversed=True)
for count, word in freq:
  print "%d\t%s" % (count, word)

Case-insensitive comparison is handled naïvely and probably combines words you don't want to combine in an absolutely general sense. Be careful of non-ASCII characters in your implementation of the above. False positives may include "1-800-555-TELL", "0xDEADBEEF", and "42 km", depending on what you want. Missed words include "911 emergency services" (I'd probably want that counted as three words).

In short, natural language parsing is hard: you probably can make due with some approximation depending on your actual use case.

Sumerlin answered 3/2, 2011 at 17:21 Comment(1)

A funny way to answer a C++ question: Providing Python code and then declaring it to be pseudocode. Considering, this uses types from the Python stdlib without importing it, and comprehensions, and that any C++ folks reading this have to guess a lot, I'm surprised this got an upvote. Maybe this is a secret experiment to see how many C++ programmers can be silently & unknowingly converted to Python enthusiasts? – Hornbeck 10/1, 2014 at 13:14

Decide on exactly what you mean by "an English word". The definition should cover things like whether "able-bodied" is one word or two, how to handle apostrophes ("Don't trust 'em!"), whether capitalization is significant, etc.
Create a set of test cases so you can be sure you get all the decisions in step 1 correct.
Create a tokenizer that reads the next word (as defined in step 1) from the input and returns it in a standard form. Depending on how your definition, this might be a simple state machine, a regular expression, or just relying on <istream>'s extraction operators (e.g., std::cin >> word;). Test your tokenizer with all the test cases from step 2.
Choose a data structure for keeping the words and counts. In modern C++, you'd probably end up with something like std::map<std::string, unsigned> or std::unordered_map<std::string, int>.
Write a loop that gets the next word from the tokenizer and increments its count in the histogram until there are no more words in the input.

Spoliation answered 28/8, 2013 at 16:35 Comment(0)

One more simple way is to count the number of spaces in the file till more then one space was found, if you consider only single space between words...

Trovillion answered 3/2, 2011 at 16:45 Comment(0)

string mostCommon( string filename ) {

    ifstream input( filename );
    string line;
    string mostFreqUsedWord;
    string token;
    map< string, int > wordFreq;

    if ( input.is_open() ) {

        while ( true ) {
            input >> token;
            if( input ) {
                wordFreq[ token ]++;
                if ( wordFreq[ token] > wordFreq[ mostFreqUsedWord ] )
                    mostFreqUsedWord = token;
            } else
                break;
        }
        input.close();
    } else {
        cout << "Unable to ope file." << endl;
    }
    return mostFreqUsedWord;
}

Hetti answered 5/1, 2018 at 14:50 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Solution 1

Solution 2

Recommended topics

Hot tags