WordCount: how inefficient is McIlroy's solution?

Asked 21/9, 2014 at 9:36 Answered 9/7, 2019 at 17:59

algorithm shell sorting word-frequency knuth

Long story short: in 1986 an interviewer asked Donald Knuth to write a program that takes a text and a number N in input, and lists the N most used words sorted by their frequencies. Knuth produced a 10-pages Pascal program, to which Douglas McIlroy replied with the following 6-lines shell script:

tr -cs A-Za-z '\n' |
tr A-Z a-z |
sort |
uniq -c |
sort -rn |
sed ${1}q

Read the full story at http://www.leancrew.com/all-this/2011/12/more-shell-less-egg/ .

Of course they had very different goals: Knuth was showing his concepts of literate programming and built everything from scratch, while McIlroy used a few common UNIX utilities to achieve the shortest source code.

My question is: how bad is that?
(Purely from a runtime-speed point of view, since I'm pretty sure we all agree that 6 lines of code is easier to understand/maintain than 10 pages, literate programming or not.)

I can understand that sort -rn | sed ${1}q may not be the most efficient way to extract the common words, but what's wrong with tr -sc A-za-z '\n' | tr A-Z a-z? It looks pretty good to me. About sort | uniq -c, is that a terribly slow way to determine the frequencies?

A few considerations:

tr should be linear time (?)
sort I'm not sure of, but I'm assuming it's not that bad
uniq should be linear time too
spawning processes should be linear time (in the number of processes)

Occasionally answered 21/9, 2014 at 9:36 Comment(2)

A problem with McIlroy's solution is that it requires the whole data to fit in memory (the first sort loads everything). A well-written solution, such as the python program in the answer, reads the data in a streaming fashion, and only holds the table of unique individual words in memory, which is (for texts in human languages) much smaller than the whole input data. – Monstrous 17/3, 2021 at 13:23

"A well-written solution, such as the python program in the answer, reads the data in a streaming fashion..." McIlroy's solution is fully "well-written" if one is never processing huge texts! Context, user, context! – Evasive 6/2 at 19:25

The Unix script has a few linear operations and 2 sorts. It will be calculation order O(n log(n)).

For Knuth algorithm for taking only the top N: http://en.wikipedia.org/wiki/Selection_algorithm Where you can have a few options in time and space complexity of the algorithm, but theoretically they can be faster for some typical examples with large number of (different) words.

So Knuth could be faster. Certainly because the English dictionary has limited size. It could turn log(n) in some large constant, though maybe consuming a lot of memory.

But maybe this question is better suited for https://cstheory.stackexchange.com/

Colorant answered 21/9, 2014 at 10:8 Comment(0)

Doug McIlroy's solution has time complexity O(T log T), where T is the total number of words. This is due to the first sort.

For comparison, here are four faster solutions of the same problem:

Here is a C++ implementation with the upper bound time complexity O((T + N) log N), but practically – nearly linear, close to O(T + N log N).

Below is a fast Python implementation. Internally, it uses hash dictionary and heap with time complexity O(T + N log Q), where Q is the number of unique words:

import collections, re, sys

filename = sys.argv[1]
k = int(sys.argv[2]) if len(sys.argv)>2 else 10
reg = re.compile('[a-z]+')

counts = collections.Counter()
for line in open(filename):
    counts.update(reg.findall(line.lower()))
for i, w in counts.most_common(k):
    print(i, w)

And another Unix shell solution using AWK. It has time complexity O(T + Q log Q):

awk -v FS="[^a-zA-Z]+" '
{
    for (i=1; i<=NF; i++)
        freq[tolower($i)]++;
}
END {
    for (word in freq)
        print(freq[word] " " word)
}
' | sort -rn | head -10

Here is an extremely fast solution in Rust by Anders Kaseorg.

CPU time comparison (in seconds):

                                     bible32       bible256       Asymptotical
Rust (prefix tree)                   0.632         5.284          O(?)
C++ (prefix tree + heap)             4.838         38.587         O((T + N) log N)
Python (Counter)                     14.366        115.855        O(T + N log Q)
AWK + sort                           21.548        176.411        O(T + Q log Q)
McIlroy (tr + sort + uniq)           60.531        690.906        O(T log T)

Notes:

T >= Q, typically Q >> N (N is a small constant)
bible32 is Bible concatenated 32 times (135 MB), bible256 – 256 times respectively (1.1 GB)
AWK timing might greatly vary depending on which version of AWK you are using (gawk, nawk or mawk)

As you can see, McIlroy's solution runs roughly 100 times slower than the fastest known program! However, his solution is still very elegant, easy to debug and, after all, it is not that terrible in performance either, unless you start using it for gigabyte files. Bad implementations of more complex algorithms in C/C++ or Haskell could easily run much slower than his pipeline (I've witnessed that).

Adamski answered 9/7, 2019 at 17:59 Comment(0)

Recommended topics

Hot tags