Code Golf: Quickly Build List of Keywords from Text, Including # of Instances

Asked 24/6, 2009 at 13:12 Answered 24/6, 2009 at 13:13

Solved code-golf text-parsing language-agnostic rosetta-stone

I've already worked out this solution for myself with PHP, but I'm curious how it could be done differently - better even. The two languages I'm primarily interested in are PHP and Javascript, but I'd be interested in seeing how quickly this could be done in any other major language today as well (mostly C#, Java, etc).

Return only words with an occurrence greater than X
Return only words with a length greater than Y
Ignore common terms like "and, is, the, etc"
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John")
Return results in a collection/array

Extra Credit

Keep Quoted Statements together, (ie. "They were 'too good to be true' apparently")
Where 'too good to be true' would be the actual statement

Extra-Extra Credit

Can your script determine words that should be kept together based upon their frequency of being found together? This being done without knowing the words beforehand. Example:
*"The fruit fly is a great thing when it comes to medical research. Much study has been done on the fruit fly in the past, and has lead to many breakthroughs. In the future, the fruit fly will continue to be studied, but our methods may change."*
Clearly the word here is "fruit fly," which is easy for us to find. Can your search'n'scrape script determine this too?

Source text: http://sampsonresume.com/labs/c.txt

Answer Format

It would be great to see the results of your code, output, in addition to how long the operation lasted.

Letty answered 24/6, 2009 at 13:13 Comment(4)

I see you've removed the bit about quoted statements. Should the quotes then be ignored, and what about the words inside them? – Scepter 24/6, 2009 at 13:25

I removed the quotes since I hadn't actually accomplished that for myself. If you want to give it a shot, I encourage it :) – Aboulia 24/6, 2009 at 13:27

Us NLP guys do this constantly. It's called cleaning data, building a language model, and then decoding the language model (perhaps using a perplexity metric). For clean initial data, sure, shell scripts. – Goerke 6/7, 2010 at 22:8

I see, it must be flag all code golf questions as spam or offensive day. – Riess 8/7, 2010 at 11:20

GNU scripting

sed -e 's/ /\n/g' | grep -v '^ *$' | sort | uniq -c | sort -nr

Results:

  7 be
  6 to
[...]
  1 2.
  1 -

With occurence greater than X:

sed -e 's/ /\n/g' | grep -v '^ *$' | sort | uniq -c | awk '$1>X'

Return only words with a length greater than Y (put Y+1 dots in second grep):

sed -e 's/ /\n/g' | grep -v '^ *$' | grep .... | sort | uniq -c

Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')

sed -e 's/ /\n/g' | grep -v '^ *$' | grep -vf ignored | sort | uniq -c

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

sed -e 's/[,.:"\']//g;s/ /\n/g' | grep -v '^ *$' | sort | uniq -c

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

Cuthburt answered 24/6, 2009 at 13:13 Comment(2)

Absolutely amazing! About as short as the Perl solution yet completely readable. +1 for doing it completely outside the box. – Cosentino 24/6, 2009 at 21:3

Sometimes when I don't care about any rules, I just do os.popen(r'this kind of shell code') instead of doing everything in python. It is much faster to write for one-shot scripts... and the performance is still pretty good. – Cuthburt 24/6, 2009 at 22:36

Perl in only 43 characters.

perl -MYAML -anE'$_{$_}++for@F;say Dump\%_'

Here is an example of it's use:

echo a a a b b c  d e aa | perl -MYAML -anE'$_{$_}++for@F;say Dump \%_'

---
a: 3
aa: 1
b: 2
c: 1
d: 1
e: 1

If you need to list only the lowercase versions, it requires two more characters.

perl -MYAML -anE'$_{lc$_}++for@F;say Dump\%_'

For it to work on the specified text requires 58 characters.

curl http://sampsonresume.com/labs/c.txt |
perl -MYAML -F'\W+' -anE'$_{lc$_}++for@F;END{say Dump\%_}'

real    0m0.679s
user    0m0.304s
sys     0m0.084s

Here is the last example expanded a bit.

#! perl
use 5.010;
use YAML;

while( my $line = <> ){
  for my $elem ( split '\W+', $line ){
    $_{ lc $elem }++
  }
  END{
    say Dump \%_;
  }
}

Disciplinarian answered 24/6, 2009 at 13:13 Comment(3)

Good lord, man! I'm curious about a break-down explanation if you have time in the future ;) – Aboulia 24/6, 2009 at 19:19

I didn't really read the whole question before writing this example, that's why the first example is so incomplete. – Disciplinarian 24/6, 2009 at 19:33

I used YAML, because it's fairly easy to read, and it helped shorten the code by not having to pretty-print the data myself. – Disciplinarian 2/7, 2009 at 16:3

F#: 304 chars

let f =
    let bad = Set.of_seq ["and";"is";"the";"of";"are";"by";"it"]
    fun length occurrence msg ->
        System.Text.RegularExpressions.Regex.Split(msg, @"[^\w-']+")
        |> Seq.countBy (fun a -> a)
        |> Seq.choose (fun (a, b) -> if a.Length > length && b > occurrence && (not <| bad.Contains a) then Some a else None)

Anitaanitra answered 24/6, 2009 at 13:13 Comment(5)

Awesome. I don't think I've seen a working block of F# to this date :) Glad to finally get acquainted. – Aboulia 24/6, 2009 at 13:40

Nice solution. I was going to try one in F# myself, as I knew the Seq module functions would help (in particular countBy), but this is great. – Scepter 24/6, 2009 at 13:42

Slight problem is that punctuation does mess up some words (i.e. commas, periods as last chars of words). – Scepter 24/6, 2009 at 13:48

Noldorin: Fixed to handle punctuation :) – Anitaanitra 24/6, 2009 at 14:0

Ah, I missed the Regex.Split method somehow. That would be a good way of getting around quotes too. Anyway, deserves the up-vote. :) – Scepter 24/6, 2009 at 14:36

Ruby

When "minified", this implementation becomes 165 characters long. It uses array#inject to give a starting value (a Hash object with a default of 0) and then loop through the elements, which are then rolled into the hash; the result is then selected from the minimum frequency.

Note that I didn't count the size of the words to skip, that being an external constant. When the constant is counted too, the solution is 244 characters long.

Apostrophes and dashes aren't stripped, but included; their use modifies the word and therefore cannot be stripped simply without removal of all information beyond the symbol.

Implementation

CommonWords = %w(the a an but and is not or as of to in for by be may has can its it's)
def get_keywords(text, minFreq=0, minLen=2)
  text.scan(/(?:\b)[a-z'-]{#{minLen},}(?=\b)/i).
    inject(Hash.new(0)) do |result,w|
      w.downcase!
      result[w] += 1 unless CommonWords.include?(w)
      result
    end.select { |k,n| n >= minFreq }
end

Test Rig

require 'net/http'

keywords = get_keywords(Net::HTTP.get('www.sampsonresume.com','/labs/c.txt'), 3)
keywords.sort.each { |name,count| puts "#{name} x #{count} times" }

Test Results

code x 4 times
declarations x 4 times
each x 3 times
execution x 3 times
expression x 4 times
function x 5 times
keywords x 3 times
language x 3 times
languages x 3 times
new x 3 times
operators x 4 times
programming x 3 times
statement x 7 times
statements x 4 times
such x 3 times
types x 3 times
variables x 3 times
which x 4 times

Actinomorphic answered 24/6, 2009 at 13:13 Comment(0)

#! perl
use strict;
use warnings;

while (<>) {
  for my $word (split) {
    $words{$word}++;
  }
}
for my $word (keys %words) {
  print "$word occurred $words{$word} times.";
}

That's the simple form. If you want sorting, filtering, etc.:

while (<>) {
  for my $word (split) {
    $words{$word}++;
  }
}
for my $word (keys %words) {
  if ((length($word) >= $MINLEN) && ($words{$word) >= $MIN_OCCURRENCE) {
    print "$word occurred $words{$word} times.";
  }
}

You can also sort the output pretty easily:

...
for my $word (keys %words) {
  if ((length($word) >= $MINLEN) && ($words{$word) >= $MIN_OCCURRENCE) {
    push @output, "$word occurred $words{$word} times.";
  }
}
$re = qr/occurred (\d+) /;
print sort {
  $a = $a =~ $re;
  $b = $b =~ $re;
  $a <=> $b
} @output;

A true Perl hacker will easily get these on one or two lines each, but I went for readability.

Edit: this is how I would rewrite this last example

...
for my $word (
  sort { $words{$a} <=> $words{$b} } keys %words
){
  next unless length($word) >= $MINLEN;
  last unless $words{$word) >= $MIN_OCCURRENCE;

  print "$word occurred $words{$word} times.";
}

Or if I needed it to run faster I might even write it like this:

for my $word_data (
  sort {
    $a->[1] <=> $b->[1] # numerical sort on count
  } grep {
    # remove values that are out of bounds
    length($_->[0]) >= $MINLEN &&      # word length
    $_->[1] >= $MIN_OCCURRENCE # count
  } map {
    # [ word, count ]
    [ $_, $words{$_} ]
  } keys %words
){
  my( $word, $count ) = @$word_data;
  print "$word occurred $count times.";
}

It uses map for efficiency, grep to remove extra elements, and sort to do the sorting, of course. ( it does so it in that order )

This is a slight variant of the Schwartzian transform.

Providing answered 24/6, 2009 at 13:13 Comment(6)

Makes me want to toy around with Perl a bit! – Aboulia 24/6, 2009 at 14:19

Since this isn't really trying to be as small as possible, I decided to make it a bit more maintainable. Which only increased the length by a few percent. – Disciplinarian 24/6, 2009 at 18:43

@Brad, if only you'd fix the bugs too, like the one in the last line. :-/ – Providing 25/6, 2009 at 13:43

I added my own examples for sorting, and filtering. – Disciplinarian 28/6, 2009 at 18:43

Brad, why didn't you add your own answer instead of editing Alex's? – Alagoas 1/7, 2009 at 10:48

I did post my own answer, and Alex asked me to fix some of the bugs. There were so many changes I would do that I decided to just leave his code alone, and add my own sorting code. – Disciplinarian 2/7, 2009 at 16:11

C# 3.0 (with LINQ)

Here's my solution. It makes use of some pretty nice features of LINQ/extension methods to keep the code short.

public static Dictionary<string, int> GetKeywords(string text, int minCount, int minLength)
{
    var commonWords = new string[] { "and", "is", "the", "as", "of", "to", "or", "in",
        "for", "by", "an", "be", "may", "has", "can", "its"};
    var words = Regex.Replace(text.ToLower(), @"[,.?\/;:\(\)]", string.Empty).Split(' ');
    var occurrences = words.Distinct().Except(commonWords).Select(w =>
        new { Word = w, Count = words.Count(s => s == w) });
    return occurrences.Where(wo => wo.Count >= minCount && wo.Word.Length >= minLength)
        .ToDictionary(wo => wo.Word, wo => wo.Count);
}

This is however far from the most efficient method, being O(n^2) with the number of words, rather than O(n), which is optimal in this case I believe. I'll see if I can creater a slightly longer method that is more efficient.

Here are the results of the function run on the sample text (min occurences: 3, min length: 2).

  3 x such
  4 x code
  4 x which
  4 x declarations
  5 x function
  4 x statements
  3 x new
  3 x types
  3 x keywords
  7 x statement
  3 x language
  3 x expression
  3 x execution
  3 x programming
  4 x operators
  3 x variables

And my test program:

static void Main(string[] args)
{
    string sampleText;
    using (var client = new WebClient())
        sampleText = client.DownloadString("http://sampsonresume.com/labs/c.txt");
    var keywords = GetKeywords(sampleText, 3, 2);
    foreach (var entry in keywords)
        Console.WriteLine("{0} x {1}", entry.Value.ToString().PadLeft(3), entry.Key);
    Console.ReadKey(true);
}

Scepter answered 24/6, 2009 at 13:13 Comment(3)

Thanks. :) I'll see if I can make a few changes for the sake of efficiency though, and also to get the "extra credit". – Scepter 24/6, 2009 at 13:36

I'll also try to get some statistics from the sample text. – Scepter 24/6, 2009 at 13:37

@Kamarey: Ah yes, I missed out brackets from the punctuation list. Well spotted. – Scepter 24/6, 2009 at 21:38

Another Python solution, at 247 chars. The actual code is a single line of highly dense Python line of 134 chars that computes the whole thing in a single expression.

x=3;y=2;W="and is the as of to or in for by an be may has can its".split()
from itertools import groupby as gb
d=dict((w,l)for w,l in((w,len(list(g)))for w,g in
    gb(sorted(open("c.txt").read().lower().split())))
    if l>x and len(w)>y and w not in W)

A much longer version with plenty of comments for you reading pleasure:

# High and low count boundaries.
x = 3
y = 2

# Common words string split into a list by spaces.
Words = "and is the as of to or in for by an be may has can its".split()

# A special function that groups similar strings in a list into a 
# (string, grouper) pairs. Grouper is a generator of occurences (see below).
from itertools import groupby

# Reads the entire file, converts it to lower case and splits on whitespace 
# to create a list of words
sortedWords = sorted(open("c.txt").read().lower().split())

# Using the groupby function, groups similar words together.
# Since grouper is a generator of occurences we need to use len(list(grouper)) 
# to get the word count by first converting the generator to a list and then
# getting the length of the list.
wordCounts = ((word, len(list(grouper))) for word, grouper in groupby(sortedWords))

# Filters the words by number of occurences and common words using yet another 
# list comprehension.
filteredWordCounts = ((word, count) for word, count in wordCounts if word not in Words and count > x and len(word) > y)

# Creates a dictionary from the list of tuples.
result = dict(filteredWordCounts)

print result

The main trick here is using the itertools.groupby function to count the occurrences on a sorted list. Don't know if it really saves characters, but it does allow all the processing to happen in a single expression.

Results:

{'function': 4, 'operators': 4, 'declarations': 4, 'which': 4, 'statement': 5}

Rameriz answered 24/6, 2009 at 13:13 Comment(3)

I'm so impressed with these languages. Thanks for the contribution! Nice work! – Aboulia 24/6, 2009 at 23:46

You cannot safely assume that 'may' should not be indexed. – Leonleona 16/3, 2010 at 1:41

you only refer to gb once(after importing), you could change from itertools import groupby as gb to from itertools import* and then replace the other gb with groupby – Dicotyledon 16/3, 2010 at 1:44

C# code:

IEnumerable<KeyValuePair<String, Int32>> ProcessText(String text, int X, int Y)
{
    // common words, that will be ignored
    var exclude = new string[] { "and", "is", "the", "as", "of", "to", "or", "in", "for", "by", "an", "be", "may", "has", "can", "its" }.ToDictionary(word => word);
    // regular expression to find quoted text
    var regex = new Regex("\"[^\"]\"", RegexOptions.Compiled);

    return
        // remove quoted text (it will be processed later)
        regex.Replace(text, "")
        // remove case dependency
        .ToLower()
        // split text by all these chars
        .Split(".,'\\/[]{}()`~@#$%^&*-=+?!;:<>| \n\r".ToCharArray())
        // add quoted text
        .Concat(regex.Matches(text).Cast<Match>().Select(match => match.Value))
        // group words by the word and count them
        .GroupBy(word => word, (word, words) => new KeyValuePair<String, Int32>(word, words.Count()))
        // apply filter(min word count and word length) and remove common words 
        .Where(pair => pair.Value >= X && pair.Key.Length >= Y && !exclude.ContainsKey(pair.Key));
}

Output for ProcessText(text, 3, 2) call:

3 x languages
3 x such
4 x code
4 x which
3 x based
3 x each
4 x declarations
5 x function
4 x statements
3 x new
3 x types
3 x keywords
3 x variables
7 x statement
4 x expression
3 x execution
3 x programming
3 x operators

Ogpu answered 24/6, 2009 at 13:13 Comment(0)

Python (258 chars as is, including 66 chars for first line and 30 chars for punctuation removal) :

W="and is the as of to or in for by an be may has can its".split()
x=3;y=2;d={}
for l in open('c.txt') :
    for w in l.lower().translate(None,',.;\'"!()[]{}').split() :
        if w not in W: d[w]=d.get(w,0)+1
for w,n in d.items() :
    if n>y and len(w)>x : print n,w

output :

4 code
3 keywords
3 languages
3 execution
3 each
3 language
4 expression
4 statements
3 variables
7 statement
5 function
4 operators
4 declarations
3 programming
4 which
3 such
3 types

Evacuate answered 24/6, 2009 at 13:13 Comment(1)

I love seeing these tiny solutions - impresses me how amazing programming can be :) – Aboulia 24/6, 2009 at 23:42

REBOL

Verbose, perhaps, so definitely not a winner, but gets the job done.

min-length: 0
min-count: 0

common-words: [ "a" "an" "as" "and" "are" "by" "for" "from" "in" "is" "it" "its" "the" "of" "or" "to" "until" ]

add-word: func [
    word [string!]
    /local
        count
        letter
        non-letter
        temp
        rules
        match
][    
    ; Strip out punctuation
    temp: copy {}
    letter: charset [ #"a" - #"z" #"A" - #"Z" #" " ]
    non-letter: complement letter
    rules: [
        some [
            copy match letter (append temp match)
            |
            non-letter
        ]
    ]
    parse/all word rules
    word: temp

    ; If we end up with nothing, bail
    if 0 == length? word [
        exit
    ]

    ; Check length
    if min-length > length? word [
        exit
    ]

    ; Ignore common words
    ignore: 
    if find common-words word [
        exit
    ]

    ; OK, its good. Add it.
    either found? count: select words word [
        words/(word): count + 1
    ][
        repend words [word 1]
    ]
]

rules: [
    some [
        {"}
        copy word to {"} (add-word word)
        {"}
        |
        copy word to { } (add-word word)
        { }
    ]
    end
]

words: copy []
parse/all read %c.txt rules

result: copy []
foreach word words [
    if string? word [
        count: words/:word
        if count >= min-count [
            append result word
        ]
    ]
]

sort result
foreach word result [ print word ]

The output is:

act
actions
all
allows
also
any
appear
arbitrary
arguments
assign
assigned
based
be
because
been
before
below
between
braces
branches
break
builtin
but
C
C like any other language has its blemishes Some of the operators have the wrong precedence some parts of the syntax could be better
call
called
calls
can
care
case
char
code
columnbased
comma
Comments
common
compiler
conditional
consisting
contain
contains
continue
control
controlflow
criticized
Cs
curly brackets
declarations
define
definitions
degree
delimiters
designated
directly
dowhile
each
effect
effects
either
enclosed
enclosing
end
entry
enum
evaluated
evaluation
evaluations
even
example
executed
execution
exert
expression
expressionExpressions
expressions
familiarity
file
followed
following
format
FORTRAN
freeform
function
functions
goto
has
high
However
identified
ifelse
imperative
include
including
initialization
innermost
int
integer
interleaved
Introduction
iterative
Kernighan
keywords
label
language
languages
languagesAlthough
leave
limit
lineEach
loop
looping
many
may
mimicked
modify
more
most
name
needed
new
next
nonstructured
normal
object
obtain
occur
often
omitted
on
operands
operator
operators
optimization
order
other
perhaps
permits
points
programmers
programming
provides
rather
reinitialization
reliable
requires
reserve
reserved
restrictions
results
return
Ritchie
say
scope
Sections
see
selects
semicolon
separate
sequence
sequence point
sequential
several
side
single
skip
sometimes
source
specify
statement
statements
storage
struct
Structured
structuresAs
such
supported
switch
syntax
testing
textlinebased
than
There
This
turn
type
types
union
Unlike
unspecified
use
used
uses
using
usually
value
values
variable
variables
variety
which
while
whitespace
widespread
will
within
writing

Syne answered 24/6, 2009 at 13:13 Comment(1)

I think you also need the count in the output... or am I missing something? – Panpipe 24/6, 2009 at 19:50

In C#:

Use LINQ, specifically groupby, then filter by group count, and return a flattened (selectmany) list.
Use LINQ, filter by length.
Use LINQ, filter with 'badwords'.Contains.

Soni answered 24/6, 2009 at 13:13 Comment(1)

You want them all together? As seen with the less than optimal solution below, you can easily do that :) – Soni 24/6, 2009 at 13:33

This is not going to win any golfing awards but it does keep quoted phrases together and takes into account stop words (and leverages CPAN modules Lingua::StopWords and Text::ParseWords).

In addition, I use to_S from Lingua::EN::Inflect::Number to count only the singular forms of words.

You might also want to look at Lingua::CollinsParser.

#!/usr/bin/perl

use strict; use warnings;

use Lingua::EN::Inflect::Number qw( to_S );
use Lingua::StopWords qw( getStopWords );
use Text::ParseWords;

my $stop = getStopWords('en');

my %words;

while ( my $line = <> ) {
    chomp $line;
    next unless $line =~ /\S/;
    next unless my @words = parse_line(' ', 1, $line);

    ++ $words{to_S $_} for
        grep { length and not $stop->{$_} }
        map { s!^[[:punct:]]+!!; s![[:punct:]]+\z!!; lc }
        @words;
}

print "=== only words appearing 4 or more times ===\n";
print "$_ : $words{$_}\n" for sort {
    $words{$b} <=> $words{$a}
} grep { $words{$_} > 3 } keys %words;

print "=== only words that are 12 characters or longer ===\n";
print "$_ : $words{$_}\n" for sort {
    $words{$b} <=> $words{$a}
} grep { 11 < length } keys %words;

Output:

=== only words appearing 4 or more times ===
statement : 11
function : 7
expression : 6
may : 5
code : 4
variable : 4
operator : 4
declaration : 4
c : 4
type : 4
=== only words that are 12 characters or longer ===
reinitialization : 2
control-flow : 1
sequence point : 1
optimization : 1
curly brackets : 1
text-line-based : 1
non-structured : 1
column-based : 1
initialization : 1

Drinker answered 24/6, 2009 at 13:13 Comment(0)

Here is my variant, in PHP:

$str = implode(file('c.txt'));
$tok = strtok($str, " .,;()\r\n\t");

$splitters = '\s.,\(\);?:'; // string splitters
$array = preg_split( "/[" . $splitters . "]*\\\"([^\\\"]+)\\\"[" . $splitters . "]*|[" . $splitters . "]+/", $str, 0, PREG_SPLIT_DELIM_CAPTURE );

foreach($array as $key) {
    $res[$key] = $res[$key]+1;
}

$splitters = '\s.,\(\)\{\};?:'; // string splitters
$array = preg_split( "/[" . $splitters . "]*\\\"([^\\\"]+)\\\"[" . $splitters . "]*|[" . $splitters . "]+/", $str, 0, PREG_SPLIT_DELIM_CAPTURE );

foreach($array as $key) {
    $res[$key] = $res[$key]+1;
}

unset($res['the']);
unset($res['and']);
unset($res['to']);
unset($res['of']);
unset($res['by']);
unset($res['a']);
unset($res['as']);
unset($res['is']);
unset($res['in']);
unset($res['']);

arsort($res);
//var_dump($res); // concordance
foreach ($res AS $word => $rarity)
    echo $word . ' <b>x</b> ' . $rarity . '<br/>';

foreach ($array as $word) { // words longer than n (=5)
//    if(strlen($word) > 5)echo $word.'<br/>';
}

And output:

statement x 7
be x 7
C x 5
may x 5
for x 5
or x 5
The x 5
as x 5
expression x 4
statements x 4
code x 4
function x 4
which x 4
an x 4
declarations x 3
new x 3
execution x 3
types x 3
such x 3
variables x 3
can x 3
languages x 3
operators x 3
end x 2
programming x 2
evaluated x 2
functions x 2
definitions x 2
keywords x 2
followed x 2
contain x 2
several x 2
side x 2
most x 2
has x 2
its x 2
called x 2
specify x 2
reinitialization x 2
use x 2
either x 2
each x 2
all x 2
built-in x 2
source x 2
are x 2
storage x 2
than x 2
effects x 1
including x 1
arguments x 1
order x 1
even x 1
unspecified x 1
evaluations x 1
operands x 1
interleaved x 1
However x 1
value x 1
branches x 1
goto x 1
directly x 1
designated x 1
label x 1
non-structured x 1
also x 1
enclosing x 1
innermost x 1
loop x 1
skip x 1
There x 1
within x 1
switch x 1
Expressions x 1
integer x 1
variety x 1
see x 1
below x 1
will x 1
on x 1
selects x 1
case x 1
executed x 1
based x 1
calls x 1
from x 1
because x 1
many x 1
widespread x 1
familiarity x 1
C's x 1
mimicked x 1
Although x 1
reliable x 1
obtain x 1
results x 1
needed x 1
other x 1
syntax x 1
often x 1
Introduction x 1
say x 1
Programming x 1
Language x 1
C, like any other language, has its blemishes. Some of the operators have the wrong precedence; some parts of the syntax could be better. x 1
Ritchie x 1
Kernighan x 1
been x 1
criticized x 1
For x 1
example x 1
care x 1
more x 1
leave x 1
return x 1
call x 1
&& x 1
|| x 1
entry x 1
include x 1
next x 1
before x 1
sequence point x 1
sequence x 1
points x 1
comma x 1
operator x 1
but x 1
compiler x 1
requires x 1
programmers x 1
exert x 1
optimization x 1
object x 1
This x 1
permits x 1
high x 1
degree x 1
occur x 1
Structured x 1
using x 1
struct x 1
union x 1
enum x 1
define x 1
Declarations x 1
file x 1
contains x 1
Function x 1
turn x 1
assign x 1
perhaps x 1
Keywords x 1
char x 1
int x 1
Sections x 1
name x 1
variable x 1
reserve x 1
usually x 1
writing x 1
type x 1
Each x 1
line x 1
format x 1
rather x 1
column-based x 1
text-line-based x 1
whitespace x 1
arbitrary x 1
FORTRAN x 1
77 x 1
free-form x 1
allows x 1
restrictions x 1
Comments x 1
C99 x 1
following x 1
// x 1
until x 1
*/ x 1
/* x 1
appear x 1
between x 1
delimiters x 1
enclosed x 1
braces x 1
supported x 1
if x 1
-else x 1
conditional x 1
Unlike x 1
reserved x 1
sequential x 1
provides x 1
control-flow x 1
identified x 1
do-while x 1
while x 1
any x 1
omitted x 1
break x 1
continue x 1
expressions x 1
testing x 1
iterative x 1
looping x 1
separate x 1
initialization x 1
normal x 1
modify x 1
control x 1
structures x 1
As x 1
imperative x 1
single x 1
act x 1
sometimes x 1
curly brackets x 1
limit x 1
scope x 1
language x 1
uses x 1
evaluation x 1
assigned x 1
values x 1
To x 1
effect x 1
semicolon x 1
actions x 1
common x 1
consisting x 1
used x 1

var_dump statement simply displays concordance. This variant preserves double-quoted expressions.

For supplied file this code finishes in 0.047 seconds. Though larger file will consume lots of memory (because of file function).

Pilot answered 24/6, 2009 at 13:13 Comment(2)

Looks like you treated "The" and "the" (and other 'ignore-words') as different terms. I see a couple 'ignore-words' in your output. :) – Aboulia 24/6, 2009 at 18:10

Yeah :) Well, we always can "lowercase" all words, but it'll ruin case in names and quotes. Proper stemming would be a good idea, thoough it's a bit harder to implement. – Pilot 25/6, 2009 at 9:10

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++