Long story short: in 1986 an interviewer asked Donald Knuth to write a program that takes a text and a number N in input, and lists the N most used words sorted by their frequencies. Knuth produced a 10-pages Pascal program, to which Douglas McIlroy replied with the following 6-lines shell script:
tr -cs A-Za-z '\n' |
tr A-Z a-z |
sort |
uniq -c |
sort -rn |
sed ${1}q
Read the full story at http://www.leancrew.com/all-this/2011/12/more-shell-less-egg/ .
Of course they had very different goals: Knuth was showing his concepts of literate programming and built everything from scratch, while McIlroy used a few common UNIX utilities to achieve the shortest source code.
My question is: how bad is that?
(Purely from a runtime-speed point of view, since I'm pretty sure we all agree that 6 lines of code is easier to understand/maintain than 10 pages, literate programming or not.)
I can understand that sort -rn | sed ${1}q
may not be the most efficient way to extract the common words, but what's wrong with tr -sc A-za-z '\n' | tr A-Z a-z
? It looks pretty good to me.
About sort | uniq -c
, is that a terribly slow way to determine the frequencies?
A few considerations:
tr
should be linear time (?)sort
I'm not sure of, but I'm assuming it's not that baduniq
should be linear time too- spawning processes should be linear time (in the number of processes)
sort
loads everything). A well-written solution, such as the python program in the answer, reads the data in a streaming fashion, and only holds the table of unique individual words in memory, which is (for texts in human languages) much smaller than the whole input data. – Monstrous