What are the elegant and effective ways to count the frequency of each "english" word in a file?
First of all, I define letter_only
std::locale
so as to ignore punctuations coming from the stream, and to read only valid "english" letters from the input stream. That way, the stream will treat the words "ways"
, "ways."
and "ways!"
as just the same word "ways"
, because the stream will ignore punctuations like "."
and "!"
.
struct letter_only: std::ctype<char>
{
letter_only(): std::ctype<char>(get_table()) {}
static std::ctype_base::mask const* get_table()
{
static std::vector<std::ctype_base::mask>
rc(std::ctype<char>::table_size,std::ctype_base::space);
std::fill(&rc['A'], &rc['z'+1], std::ctype_base::alpha);
return &rc[0];
}
};
Solution 1
int main()
{
std::map<std::string, int> wordCount;
ifstream input;
input.imbue(std::locale(std::locale(), new letter_only())); //enable reading only letters!
input.open("filename.txt");
std::string word;
while(input >> word)
{
++wordCount[word];
}
for (std::map<std::string, int>::iterator it = wordCount.begin(); it != wordCount.end(); ++it)
{
cout << it->first <<" : "<< it->second << endl;
}
}
Solution 2
struct Counter
{
std::map<std::string, int> wordCount;
void operator()(const std::string & item) { ++wordCount[item]; }
operator std::map<std::string, int>() { return wordCount; }
};
int main()
{
ifstream input;
input.imbue(std::locale(std::locale(), new letter_only())); //enable reading only letters!
input.open("filename.txt");
istream_iterator<string> start(input);
istream_iterator<string> end;
std::map<std::string, int> wordCount = std::for_each(start, end, Counter());
for (std::map<std::string, int>::iterator it = wordCount.begin(); it != wordCount.end(); ++it)
{
cout << it->first <<" : "<< it->second << endl;
}
}
"end_of_sentence."
and "end_of_sentence!"
as two distinct words, which is not what the OP wants. –
Froghopper while (input >> word)
? It's still wrong as written since the other flags aren't checked. –
Misreport while (input >> word)
–
Aubrey letter_only
locale. Please check it out! –
Humism locale
is used when reading file, so it makes sense! –
Humism Perl is arguably not so elegant, but very effective.
I posted a solution here: Processing huge text files
In a nutshell,
1) If needed, strip punctuation and convert uppercase to lowercase:
perl -pe "s/[^a-zA-Z \t\n']/ /g; tr/A-Z/a-z/" file_raw > file
2) Count the occurrence of each word. Print results sorted first by frequency, and then alphabetically:
perl -lane '$h{$_}++ for @F; END{for $w (sort {$h{$b}<=>$h{$a} || $a cmp $b} keys %h) {print "$h{$w}\t$w"}}' file > freq
I ran this code on a 3.3GB text file with 580,000,000 words.
Perl 5.22 completed in under 3 minutes.
Here is working solution.This should work with real text (including punctuation) :
#include <iterator>
#include <iostream>
#include <fstream>
#include <map>
#include <string>
#include <cctype>
std::string getNextToken(std::istream &in)
{
char c;
std::string ans="";
c=in.get();
while(!std::isalpha(c) && !in.eof())//cleaning non letter charachters
{
c=in.get();
}
while(std::isalpha(c))
{
ans.push_back(std::tolower(c));
c=in.get();
}
return ans;
}
int main()
{
std::map<std::string,int> words;
std::ifstream fin("input.txt");
std::string s;
std::string empty ="";
while((s=getNextToken(fin))!=empty )
++words[s];
for(std::map<std::string,int>::iterator iter = words.begin(); iter!=words.end(); ++iter)
std::cout<<iter->first<<' '<<iter->second<<std::endl;
}
Edit: Now my code calling tolower for every letter.
My solution is the following one. Firstly, all symbols are converted to spaces. Then, basically the same solution provided here before is used in order to extract words:
const std::string Symbols = ",;.:-()\t!¡¿?\"[]{}&<>+-*/=#'";
typedef std::map<std::string, unsigned int> WCCollection;
void countWords(const std::string fileName, WCCollection &wcc)
{
std::ifstream input( fileName.c_str() );
if ( input.is_open() ) {
std::string line;
std::string word;
while( std::getline( input, line ) ) {
// Substitute punctuation symbols with spaces
for(std::string::const_iterator it = line.begin(); it != line.end(); ++it) {
if ( Symbols.find( *it ) != std::string::npos ) {
*it = ' ';
}
}
// Let std::operator>> separate by spaces
std::istringstream filter( line );
while( filter >> word ) {
++( wcc[word] );
}
}
}
}
Pseudocode for an algorithm which I believe to be close to what you want:
counts = defaultdict(int)
for line in file:
for word in line.split():
if any(x.isalpha() for x in word):
counts[word.toupper()] += 1
freq = sorted(((count, word) for word, count in counts.items()), reversed=True)
for count, word in freq:
print "%d\t%s" % (count, word)
Case-insensitive comparison is handled naïvely and probably combines words you don't want to combine in an absolutely general sense. Be careful of non-ASCII characters in your implementation of the above. False positives may include "1-800-555-TELL", "0xDEADBEEF", and "42 km", depending on what you want. Missed words include "911 emergency services" (I'd probably want that counted as three words).
In short, natural language parsing is hard: you probably can make due with some approximation depending on your actual use case.
Decide on exactly what you mean by "an English word". The definition should cover things like whether "able-bodied" is one word or two, how to handle apostrophes ("Don't trust 'em!"), whether capitalization is significant, etc.
Create a set of test cases so you can be sure you get all the decisions in step 1 correct.
Create a tokenizer that reads the next word (as defined in step 1) from the input and returns it in a standard form. Depending on how your definition, this might be a simple state machine, a regular expression, or just relying on <istream>'s extraction operators (e.g.,
std::cin >> word;
). Test your tokenizer with all the test cases from step 2.Choose a data structure for keeping the words and counts. In modern C++, you'd probably end up with something like
std::map<std::string, unsigned>
orstd::unordered_map<std::string, int>
.Write a loop that gets the next word from the tokenizer and increments its count in the histogram until there are no more words in the input.
One more simple way is to count the number of spaces in the file till more then one space was found, if you consider only single space between words...
string mostCommon( string filename ) {
ifstream input( filename );
string line;
string mostFreqUsedWord;
string token;
map< string, int > wordFreq;
if ( input.is_open() ) {
while ( true ) {
input >> token;
if( input ) {
wordFreq[ token ]++;
if ( wordFreq[ token] > wordFreq[ mostFreqUsedWord ] )
mostFreqUsedWord = token;
} else
break;
}
input.close();
} else {
cout << "Unable to ope file." << endl;
}
return mostFreqUsedWord;
}
© 2022 - 2024 — McMap. All rights reserved.
[A-Za-z]+
? What about hyphenated words or otherwise punctuated words? – Misreportwc
utility. Why reinvent the wheel? – Rackrentcan't
, andThe cat's toy.
. – Rackrent