Elegant ways to count the frequency of words in a file
Asked Answered
L

8

12

What are the elegant and effective ways to count the frequency of each "english" word in a file?

Lawyer answered 3/2, 2011 at 16:40 Comment(8)
Define "word." Do you mean "English words" or "uninterrupted sequences of alphabetic characters" or "uninterrupted sequences of characters," or something else?Misreport
for what purpose - just for fun?Agrippina
Again, what does ""english"" mean? Actual English words or sequences matching [A-Za-z]+? What about hyphenated words or otherwise punctuated words?Misreport
Repeating? So "he went to the the store" is counted as 1 repetition, or as 5 unique words of which one has a count of 2?Sumerlin
If you are on a *nix platform, try the wc utility. Why reinvent the wheel?Rackrent
Do contractions and possessive words count? For example, can't, and The cat's toy..Rackrent
Do the letter sequences have to be valid English words? For example, a is a valid word, but t is not.Rackrent
@Thomas: But if capitalized, it's a valid English name, e.g. Mr. T ;-)Misreport
H
16

First of all, I define letter_only std::locale so as to ignore punctuations coming from the stream, and to read only valid "english" letters from the input stream. That way, the stream will treat the words "ways", "ways." and "ways!" as just the same word "ways", because the stream will ignore punctuations like "." and "!".

struct letter_only: std::ctype<char> 
{
    letter_only(): std::ctype<char>(get_table()) {}

    static std::ctype_base::mask const* get_table()
    {
        static std::vector<std::ctype_base::mask> 
            rc(std::ctype<char>::table_size,std::ctype_base::space);

        std::fill(&rc['A'], &rc['z'+1], std::ctype_base::alpha);
        return &rc[0];
    }
};

Solution 1

int main()
{
     std::map<std::string, int> wordCount;
     ifstream input;
     input.imbue(std::locale(std::locale(), new letter_only())); //enable reading only letters!
     input.open("filename.txt");
     std::string word;
     while(input >> word)
     {
         ++wordCount[word];
     }
     for (std::map<std::string, int>::iterator it = wordCount.begin(); it != wordCount.end(); ++it)
     {
           cout << it->first <<" : "<< it->second << endl;
     }
}

Solution 2

struct Counter
{
    std::map<std::string, int> wordCount;
    void operator()(const std::string & item) { ++wordCount[item]; }
    operator std::map<std::string, int>() { return wordCount; }
};

int main()
{
     ifstream input;
     input.imbue(std::locale(std::locale(), new letter_only())); //enable reading only letters!
     input.open("filename.txt");
     istream_iterator<string> start(input);
     istream_iterator<string> end;
     std::map<std::string, int> wordCount = std::for_each(start, end, Counter());
     for (std::map<std::string, int>::iterator it = wordCount.begin(); it != wordCount.end(); ++it)
     {
          cout << it->first <<" : "<< it->second << endl;
     }
 }
Humism answered 3/2, 2011 at 16:42 Comment(18)
However, the answer made it also clear that "non-whitespace character sequences delimited by whitespace" is not the definition of "word" the OP is after.Froghopper
I think this is the right answer since he want the frequency of repeated words.Radferd
The input loop in the first solution is wrong. The eof flag is set after an input operation that fails due to reaching eof.Misreport
Again, this is not the right answer. The OP is not asking for whitespace-delimited words. This would count "end_of_sentence." and "end_of_sentence!" as two distinct words, which is not what the OP wants.Froghopper
@Nawaz for fully correct answer you probably need to clear tokens from punctuations.Kolnos
@Nawaz: Why not just use the idiomatic while (input >> word)? It's still wrong as written since the other flags aren't checked.Misreport
That first loop is ugly. Though you have correct usage; doing it like that is error prone (fro beginners) we should be encouraging new-bes to put the read into the while condition. while (input >> word)Aubrey
While you are cleaning up. Wy not wrap it all in main() so that a beginner can just cut and paste the code and get a working program.Aubrey
@James: I did that. Also worked on locale also. Please check it out. and let me know if you find anything wrong with the solution!Humism
@sbi, and @Ashot : I fixed that by defining letter_only locale. Please check it out!Humism
@Martin: Please check out the complete solution now. :-)Humism
@Nawaz: imbue() on a file that is already opened silently fails. You need to imbue the file object before you actually open the file (i.e. before providing the filename).Aubrey
@Martin: It doesn't fail. I checked it. locale is used when reading file, so it makes sense!Humism
@Nawaz: Let me re-phrase. It will not work on all implementations. You need to declare the object. imbue() then open(filename). This is done because some locals use characters from the stream to define how they work (like the BOM marker in UTF-8/16 files can be extracted by the local object to define how they work. As a result the usual action of imbue is to fail if the file is already open).Aubrey
@Martin: That is interesting. Can you explain me why? Any reference for that? I didn't get anything of that sort!Humism
@Nawaz: Just stepping through the code. In g++ look at bits/fstream.tcc basic_filebuf::imbue() (the rules are a bit more complex but it always works if the file has NOT been opened. It may fail if the file has been opened (it may depend on what the old local did to the file when it was opened).Aubrey
@Martin: thanks a lot for explaining that. I didn't know that. So I made those changes, as you suggested. Let me know if there is any potential problem. :-)Humism
@Nawaz I find bug in your code.You need to call tolower for words before adding to map (see my solution).Kolnos
Z
4

Perl is arguably not so elegant, but very effective.
I posted a solution here: Processing huge text files

In a nutshell,

1) If needed, strip punctuation and convert uppercase to lowercase:
perl -pe "s/[^a-zA-Z \t\n']/ /g; tr/A-Z/a-z/" file_raw > file

2) Count the occurrence of each word. Print results sorted first by frequency, and then alphabetically:
perl -lane '$h{$_}++ for @F; END{for $w (sort {$h{$b}<=>$h{$a} || $a cmp $b} keys %h) {print "$h{$w}\t$w"}}' file > freq

I ran this code on a 3.3GB text file with 580,000,000 words.
Perl 5.22 completed in under 3 minutes.

Zarah answered 14/10, 2015 at 1:11 Comment(0)
K
2

Here is working solution.This should work with real text (including punctuation) :

#include <iterator>
#include <iostream>
#include <fstream>
#include <map>
#include <string>
#include <cctype>

std::string getNextToken(std::istream &in)
{
    char c;
    std::string ans="";
    c=in.get();
    while(!std::isalpha(c) && !in.eof())//cleaning non letter charachters
    {
        c=in.get();
    }
    while(std::isalpha(c))
    {
        ans.push_back(std::tolower(c));
        c=in.get();
    }
    return ans;
}

int main()
{
    std::map<std::string,int> words;
    std::ifstream fin("input.txt");

    std::string s;
    std::string empty ="";
    while((s=getNextToken(fin))!=empty )
            ++words[s];

    for(std::map<std::string,int>::iterator iter = words.begin(); iter!=words.end(); ++iter)
        std::cout<<iter->first<<' '<<iter->second<<std::endl;
}

Edit: Now my code calling tolower for every letter.

Kolnos answered 3/2, 2011 at 17:27 Comment(2)
This undoubtedly works for english (which is what the OP asked, It know), but not for other languages. I won't also work if there is a number in the input text.Valued
@Valued question asks "english" words.Also is_alpha doesn't return true for digits.Kolnos
V
2

My solution is the following one. Firstly, all symbols are converted to spaces. Then, basically the same solution provided here before is used in order to extract words:

const std::string Symbols = ",;.:-()\t!¡¿?\"[]{}&<>+-*/=#'";
typedef std::map<std::string, unsigned int> WCCollection;
void countWords(const std::string fileName, WCCollection &wcc)
    {
        std::ifstream input( fileName.c_str() );

        if ( input.is_open() ) {
            std::string line;
            std::string word;

            while( std::getline( input, line ) ) {
                // Substitute punctuation symbols with spaces
                for(std::string::const_iterator it = line.begin(); it != line.end(); ++it) {
                    if ( Symbols.find( *it ) != std::string::npos ) {
                        *it = ' ';
                    }

                }

                // Let std::operator>> separate by spaces
                std::istringstream filter( line );
                while( filter >> word ) {
                    ++( wcc[word] );
                }
            }
        }

    }
Valued answered 3/2, 2011 at 17:34 Comment(1)
I've improved the algorithm and fixed minor bugs.Valued
S
1

Pseudocode for an algorithm which I believe to be close to what you want:

counts = defaultdict(int)
for line in file:
  for word in line.split():
    if any(x.isalpha() for x in word):
      counts[word.toupper()] += 1

freq = sorted(((count, word) for word, count in counts.items()), reversed=True)
for count, word in freq:
  print "%d\t%s" % (count, word)

Case-insensitive comparison is handled naïvely and probably combines words you don't want to combine in an absolutely general sense. Be careful of non-ASCII characters in your implementation of the above. False positives may include "1-800-555-TELL", "0xDEADBEEF", and "42 km", depending on what you want. Missed words include "911 emergency services" (I'd probably want that counted as three words).

In short, natural language parsing is hard: you probably can make due with some approximation depending on your actual use case.

Sumerlin answered 3/2, 2011 at 17:21 Comment(1)
A funny way to answer a C++ question: Providing Python code and then declaring it to be pseudocode. Considering, this uses types from the Python stdlib without importing it, and comprehensions, and that any C++ folks reading this have to guess a lot, I'm surprised this got an upvote. Maybe this is a secret experiment to see how many C++ programmers can be silently & unknowingly converted to Python enthusiasts?Hornbeck
S
1
  1. Decide on exactly what you mean by "an English word". The definition should cover things like whether "able-bodied" is one word or two, how to handle apostrophes ("Don't trust 'em!"), whether capitalization is significant, etc.

  2. Create a set of test cases so you can be sure you get all the decisions in step 1 correct.

  3. Create a tokenizer that reads the next word (as defined in step 1) from the input and returns it in a standard form. Depending on how your definition, this might be a simple state machine, a regular expression, or just relying on <istream>'s extraction operators (e.g., std::cin >> word;). Test your tokenizer with all the test cases from step 2.

  4. Choose a data structure for keeping the words and counts. In modern C++, you'd probably end up with something like std::map<std::string, unsigned> or std::unordered_map<std::string, int>.

  5. Write a loop that gets the next word from the tokenizer and increments its count in the histogram until there are no more words in the input.

Spoliation answered 28/8, 2013 at 16:35 Comment(0)
T
0

One more simple way is to count the number of spaces in the file till more then one space was found, if you consider only single space between words...

Trovillion answered 3/2, 2011 at 16:45 Comment(0)
H
0
string mostCommon( string filename ) {

    ifstream input( filename );
    string line;
    string mostFreqUsedWord;
    string token;
    map< string, int > wordFreq;

    if ( input.is_open() ) {

        while ( true ) {
            input >> token;
            if( input ) {
                wordFreq[ token ]++;
                if ( wordFreq[ token] > wordFreq[ mostFreqUsedWord ] )
                    mostFreqUsedWord = token;
            } else
                break;
        }
        input.close();
    } else {
        cout << "Unable to ope file." << endl;
    }
    return mostFreqUsedWord;
}
Hetti answered 5/1, 2018 at 14:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.