How can I count the number of words in a directory recursively?
Asked Answered
I

5

12

I'm trying to calculate the number of words written in a project. There are a few levels of folders and lots of text files within them.

Can anyone help me find out a quick way to do this?

bash or vim would be good!

Thanks

Immune answered 22/2, 2016 at 17:9 Comment(2)
How do you decide whether a file is a text file? Common extension?Cello
Possible duplicate of How to count all the lines of code in a directory recursively?Berga
H
14

use find the scan the dir tree and wc will do the rest

$ find path -type f | xargs wc -w | tail -1

last line gives the totals.

Heteropterous answered 22/2, 2016 at 17:20 Comment(1)
Why wouldn't wc just support a -r switch tho?Tanagra
S
5

tldr;

$ find . -type f -exec wc -w {} + | awk '/total/{print $1}' | paste -sd+ | bc

Explanation:

The find . -type f -exec wc -w {} + will run wc -w on all the files (recursively) contained by . (the current working directory). find will execute wc as few times as possible but as many times as is necessary to comply with ARG_MAX --- the system command length limit. When the quantity of files (and/or their constituent lengths) exceeds ARG_MAX, then find invokes wc -w more than once, giving multiple total lines:

$ find . -type f -exec wc -w {} + | awk '/total/{print $0}'
  8264577 total
  654892 total
 1109527 total
 149522 total
 174922 total
 181897 total
 1229726 total
 2305504 total
 1196390 total
 5509702 total
  9886665 total

Isolate these partial sums by printing only the first whitespace-delimited field of each total line:

$ find . -type f -exec wc -w {} + | awk '/total/{print $1}'
8264577
654892
1109527
149522
174922
181897
1229726
2305504
1196390
5509702
9886665

paste the partial sums with a + delimiter to give an infix summation:

$ find . -type f -exec wc -w {} + | awk '/total/{print $1}' | paste -sd+
8264577+654892+1109527+149522+174922+181897+1229726+2305504+1196390+5509702+9886665

Evaluate the infix summation using bc, which supports both infix expressions and arbitrary precision:

$ find . -type f -exec wc -w {} + | awk '/total/{print $1}' | paste -sd+ | bc
30663324

References:

Saffren answered 30/12, 2017 at 19:24 Comment(0)
S
4

You could find and print all the content and pipe to wc:

find path -type f -exec cat {} \; -exec echo \; | wc -w

Note: the -exec echo \; is needed in case a file doesn't end with a newline character, in which case the last word of one file and the first word of the next will not be separated.

Or you could find and wc and use awk to aggregate the counts:

find . -type f -exec wc -w {} \; | awk '{ sum += $1 } END { print sum }'
Skelton answered 22/2, 2016 at 17:15 Comment(0)
B
3

If there's one thing I've learned from all the questions on SO, it's that a filename with a space will mess you up. This script will work even if you have whitespace in the file names.

#!/usr/bin/env bash

shopt -s globstar
count=0
for f in **/*.txt
do
    words=$(wc -w "$f" | awk '{print $1}')
    count=$(($count + $words))
done
echo $count
Berga answered 22/2, 2016 at 17:47 Comment(0)
J
0

Assuming you don't need to recursively count the words and that you want to include all the files in the current directory , you can use a simple approach such as:

wc -l *


10  000292_0
500 000297_0
510 total

If you want to count the words for only a specific extension in the current directory , you could try :

cat *.txt | wc -l
Judd answered 20/10, 2018 at 0:51 Comment(2)
This answer does not handle multiple subdirectories (i.e., no recursion), and it assumes every file in the folder is a text file.Kolivas
While this code may solve the question, including an explanation of how and why this solves the problem would really help to improve the quality of your post, and probably result in more up-votes. Remember that you are answering the question for readers in the future, not just the person asking now. Please edit your answer to add explanation, and give an indication of what limitations and assumptions apply.Commissionaire

© 2022 - 2024 — McMap. All rights reserved.