Concordance of text
Asked Answered
G

2

7

I have been reading the cookbook for Linux to get a hang of it. I am fairly new to it.

I cam across a topic called Concordance of text. Now I understand what it is, but I am not able to get a sequence of commands using tr, sort and uniq ( That's what the cookbook says ) that would generate the concordance.

Can someone tell me how to create a basic concordance? i.e. just sort and display word frequency for each unique word.

The idea presented in the cookbook to use tr to translate all spaces to newline characters so that each word goes into a new line, which is then passed to the sorter, and then passed to the uniq with the -c flag to make a count of the unique terms.

I am not able to figure out the correct parameters though. Can someone explain please while explaining what each parameter does?

I have googled out for this but I am not able to get a clearly defined answer to my problem.

Any help is much appreciated!

Griskin answered 29/1, 2012 at 21:10 Comment(0)
T
0
tr ' ' '\n' <input | sort | uniq -c

If I understand your comment correctly, you want the total of all words over all files in a directory. You can do that like this:

find mydir -type f -exec cat {} + | tr ' ' '\n' | sort | uniq -c

find will recursively search mydir for files that match its arguments: -type f tells it to only keep normal files (as opposed to directories or a couple other types you shouldn't have to worry about yet), then find will execute cat, giving it all the file names as arguments; cat concatenates files, printing all their contents as if it were one big file. That output then goes through the same tr/sort/uniq pipeline to actually calculate the concordance.

Tuscan answered 29/1, 2012 at 21:19 Comment(5)
I've got to run now (literally) but I'll explain when I get back if no one else does. Meanwhile, read the man pages.Tuscan
Thanks a lot. That works. I'll try to de-construct and understand it.Griskin
Okay. I got it. One question though, can we extend this to make a concordance of some files in a single directory? One way, to go about this is to store the output of "ls" in a file and then for each line of that file is a filename, run the above command to append the concordance of that file in "result". Then do a concordance over "result" again. This is working but, is there a simpler, elegant way to accomplish this?Griskin
Do you mean you want the sum over all files, or a separate list for each file?Tuscan
Yes, Your post was really helpful. Thanks a lot again! :)Griskin
D
1

There are many ways to do this, but this is my solution. It uses different commands than you mention, but, through the use of sed and a final `sort, it may produce more desirable output.

find . -type f -print0 | xargs -0 cat | sed 's/[[:punct:]]//g' | sed -r 's/\s+/\n/g' | sort | uniq -c | sort -n

find . -type f -print0 will recursively search all the folders and files from your current directory downwards. -type f will return only files. -print0 will use the special \0 character to end file names so that spaces aren't confusing to the next the command in the pipe.

xargs takes input and turns it into arguments for a command, in this case cat. cat will print the contents of all files given to it as arguments. The -0 tells xargs that its input is delimited by the special \0 character, not by spaces.

sed is a pattern-matching stream editor. The first sed command subsitutes (s) all punctuation using the [[:punct:]] pattern and replaces the punctuation with nothing. It matches all such patterns in each line given to it (g).

The second sed command turns all instances of 1 or more spaces in a row (\s+) into newlines (\n) through the input string (g).

sort organizes the words alphabetically.

uniq -c eliminates adjacent duplicates in the output list while counting how many there were.

sort -n sorts this output numerically yielding a list of words sorted by word frequency.

sed and xargs are very powerful commands, especially if used in conjunction. But, as another poster has noted, find also has almost unbridled power. tr is useful, but is more specific than sed.

Dunleavy answered 11/3, 2012 at 17:6 Comment(1)
This is great! Thanks for the improved functionality (over the other answer).Oxidation
T
0
tr ' ' '\n' <input | sort | uniq -c

If I understand your comment correctly, you want the total of all words over all files in a directory. You can do that like this:

find mydir -type f -exec cat {} + | tr ' ' '\n' | sort | uniq -c

find will recursively search mydir for files that match its arguments: -type f tells it to only keep normal files (as opposed to directories or a couple other types you shouldn't have to worry about yet), then find will execute cat, giving it all the file names as arguments; cat concatenates files, printing all their contents as if it were one big file. That output then goes through the same tr/sort/uniq pipeline to actually calculate the concordance.

Tuscan answered 29/1, 2012 at 21:19 Comment(5)
I've got to run now (literally) but I'll explain when I get back if no one else does. Meanwhile, read the man pages.Tuscan
Thanks a lot. That works. I'll try to de-construct and understand it.Griskin
Okay. I got it. One question though, can we extend this to make a concordance of some files in a single directory? One way, to go about this is to store the output of "ls" in a file and then for each line of that file is a filename, run the above command to append the concordance of that file in "result". Then do a concordance over "result" again. This is working but, is there a simpler, elegant way to accomplish this?Griskin
Do you mean you want the sum over all files, or a separate list for each file?Tuscan
Yes, Your post was really helpful. Thanks a lot again! :)Griskin

© 2022 - 2024 — McMap. All rights reserved.