How can I quickly sum all numbers in a file?
Asked Answered
G

34

257

I have a file which contains several thousand numbers, each on its own line:

34
42
11
6
2
99
...

I'm looking to write a script which will print the sum of all numbers in the file. I've got a solution, but it's not very efficient. (It takes several minutes to run.) I'm looking for a more efficient solution. Any suggestions?

Gig answered 23/4, 2010 at 23:36 Comment(6)
What was your slow solution? Maybe we can help you figure out what was slow about it. :)Blowzed
@brian d foy, I'm too embarrassed to post it. I know why it's slow. It's because I call "cat filename | head -n 1" to get the top number, add it to a running total, and call "cat filename | tail..." to remove the top line for the next iteration... I have a lot to learn about programming!!!Gig
That's...very systematic. Very clear and straight forward, and I love it for all that it is a horrible abomination. Built, I assume, out of the tools that you knew when you started, right?Legault
full duplicate: #451299Johst
@MarkRoberts It must have taken you a long while to work that out. It's a very cleaver problem solving technique, and oh so wrong. It looks like a classic case of over think. Several of Glen Jackman's solutions shell scripting solutions (and two are pure shell that don't use things like awk and bc). These all finished adding a million numbers up in less than 10 seconds. Take a look at those and see how it can be done in pure shell.Gotcher
@ Mark Roberts 1place, https://mcmap.net/q/109376/-how-can-i-quickly-sum-all-numbers-in-a-file )))Congius
B
121

For a Perl one-liner, it's basically the same thing as the awk solution in Ayman Hourieh's answer:

 % perl -nle '$sum += $_ } END { print $sum'

If you're curious what Perl one-liners do, you can deparse them:

 %  perl -MO=Deparse -nle '$sum += $_ } END { print $sum'

The result is a more verbose version of the program, in a form that no one would ever write on their own:

BEGIN { $/ = "\n"; $\ = "\n"; }
LINE: while (defined($_ = <ARGV>)) {
    chomp $_;
    $sum += $_;
}
sub END {
    print $sum;
}
-e syntax OK

Just for giggles, I tried this with a file containing 1,000,000 numbers (in the range 0 - 9,999). On my Mac Pro, it returns virtually instantaneously. That's too bad, because I was hoping using mmap would be really fast, but it's just the same time:

use 5.010;
use File::Map qw(map_file);

map_file my $map, $ARGV[0];

$sum += $1 while $map =~ m/(\d+)/g;

say $sum;
Blowzed answered 23/4, 2010 at 23:49 Comment(6)
Wow, that shows a deep understanding on what code -nle actually wraps around the string you give it. My initial thought was that you shouldn't post while intoxicated but then I noticed who you were and remembered some of your other Perl answers :-)Peers
-n and -p just put characters around the argument to -e, so you can use those characters for whatever you want. We have a lot of one-liners that do interesting things with that in Effective Perl Programming (which is about to hit the shelves).Blowzed
Nice, what are these non-matching curly braces about?Bitumen
-n adds the while { } loop around your program. If you put } ... { inside, then you have while { } ... { }. Evil? Slightly.Decorative
Big bonus for highlighting the -MO=Deparse option! Even though on a separate topic.Corliss
The cumulative version of your one-liner (a rolling sum, printing the current sum for each line): perl -nle '$sum += $_; print $sum} END {'Viglione
C
464

You can use awk:

awk '{ sum += $1 } END { print sum }' file
Cassycast answered 23/4, 2010 at 23:39 Comment(6)
program exceeded: maximum number of field sizes: 32767Blackington
With the -F '\t' option if your fields contain spaces and are separated by tabs.Denouement
Please mark this as the best answer. It also works if you want to sum the first value in each row, inside a TSV (tab-separated value) file.Smetana
If you have big numbers: awk 'BEGIN {OFMT = "%.0f"} { sum += $1 } END { print sum }' filenameHarney
@EthanFurman I actually have a tab delimited file as you explained but not able to make -F '\t' do the magic. Where exactly is the option meant to be inserted? I have it like this awk -F '\t' '{ sum += $0 } END { print sum }' fileHadhramaut
@CYNTHIABlessing: Please ask that as a new question. Thanks!Denouement
C
143

None of the solution thus far use paste. Here's one:

paste -sd+ filename | bc

If the file has a trailing newline, a trailing + will incur a syntax error. Fix the error by removing the trailing +:

paste -sd+ fiilename | sed 's/+$//g' | bc

As an example, calculate Σn where 1<=n<=100000:

$ seq 100000 | paste -sd+ | bc -l
5000050000

(For the curious, seq n would print a sequence of numbers from 1 to n given a positive number n.)

Cooperman answered 7/12, 2013 at 5:27 Comment(2)
seq 100000 | paste -sd+ - | bc -l on Mac OS X Bash shell. And this is by far the sweetest and the unixest solution!Cleocleobulus
@SimoA. I vote that we use the term unixiest in place of unixest because to the sexiest solution is always the unixiest ;)Indonesia
B
121

For a Perl one-liner, it's basically the same thing as the awk solution in Ayman Hourieh's answer:

 % perl -nle '$sum += $_ } END { print $sum'

If you're curious what Perl one-liners do, you can deparse them:

 %  perl -MO=Deparse -nle '$sum += $_ } END { print $sum'

The result is a more verbose version of the program, in a form that no one would ever write on their own:

BEGIN { $/ = "\n"; $\ = "\n"; }
LINE: while (defined($_ = <ARGV>)) {
    chomp $_;
    $sum += $_;
}
sub END {
    print $sum;
}
-e syntax OK

Just for giggles, I tried this with a file containing 1,000,000 numbers (in the range 0 - 9,999). On my Mac Pro, it returns virtually instantaneously. That's too bad, because I was hoping using mmap would be really fast, but it's just the same time:

use 5.010;
use File::Map qw(map_file);

map_file my $map, $ARGV[0];

$sum += $1 while $map =~ m/(\d+)/g;

say $sum;
Blowzed answered 23/4, 2010 at 23:49 Comment(6)
Wow, that shows a deep understanding on what code -nle actually wraps around the string you give it. My initial thought was that you shouldn't post while intoxicated but then I noticed who you were and remembered some of your other Perl answers :-)Peers
-n and -p just put characters around the argument to -e, so you can use those characters for whatever you want. We have a lot of one-liners that do interesting things with that in Effective Perl Programming (which is about to hit the shelves).Blowzed
Nice, what are these non-matching curly braces about?Bitumen
-n adds the while { } loop around your program. If you put } ... { inside, then you have while { } ... { }. Evil? Slightly.Decorative
Big bonus for highlighting the -MO=Deparse option! Even though on a separate topic.Corliss
The cumulative version of your one-liner (a rolling sum, printing the current sum for each line): perl -nle '$sum += $_; print $sum} END {'Viglione
E
100

Just for fun, let's benchmark it:

$ for ((i=0; i<1000000; i++)) ; do echo $RANDOM; done > random_numbers

$ time perl -nle '$sum += $_ } END { print $sum' random_numbers
16379866392

real    0m0.226s
user    0m0.219s
sys     0m0.002s

$ time awk '{ sum += $1 } END { print sum }' random_numbers
16379866392

real    0m0.311s
user    0m0.304s
sys     0m0.005s

$ time { { tr "\n" + < random_numbers ; echo 0; } | bc; }
16379866392

real    0m0.445s
user    0m0.438s
sys     0m0.024s

$ time { s=0;while read l; do s=$((s+$l));done<random_numbers;echo $s; }
16379866392

real    0m9.309s
user    0m8.404s
sys     0m0.887s

$ time { s=0;while read l; do ((s+=l));done<random_numbers;echo $s; }
16379866392

real    0m7.191s
user    0m6.402s
sys     0m0.776s

$ time { sed ':a;N;s/\n/+/;ta' random_numbers|bc; }
^C

real    4m53.413s
user    4m52.584s
sys 0m0.052s

I aborted the sed run after 5 minutes


I've been diving to , and it is speedy:

$ time lua -e 'sum=0; for line in io.lines() do sum=sum+line end; print(sum)' < random_numbers
16388542582.0

real    0m0.362s
user    0m0.313s
sys     0m0.063s

and while I'm updating this, ruby:

$ time ruby -e 'sum = 0; File.foreach(ARGV.shift) {|line| sum+=line.to_i}; puts sum' random_numbers
16388542582

real    0m0.378s
user    0m0.297s
sys     0m0.078s

Heed Ed Morton's advice: using $1

$ time awk '{ sum += $1 } END { print sum }' random_numbers
16388542582

real    0m0.421s
user    0m0.359s
sys     0m0.063s

vs using $0

$ time awk '{ sum += $0 } END { print sum }' random_numbers
16388542582

real    0m0.302s
user    0m0.234s
sys     0m0.063s
Eutrophic answered 22/8, 2013 at 13:46 Comment(4)
+1: For coming up with a bunch of solutions, and benchmarking them.Gotcher
time cat random_numbers|paste -sd+|bc -l real 0m0.317s user 0m0.310s sys 0m0.013sHyperesthesia
that should be just about identical to the tr solution.Eutrophic
Your awk script should execute a bit faster if you use $0 instead of $1 since awk does field splitting (which obviously takes time) if any field is specifically mentioned in the script but doesn't otherwise.Heptane
L
38

Another option is to use jq:

$ seq 10|jq -s add
55

-s (--slurp) reads the input lines into an array.

Lombard answered 6/12, 2015 at 15:8 Comment(1)
Wonderful solution. I had a tab delimited file where I wanted to sum column 6. Did that with the following command: awk '{ print $6 }' myfile.log | jq -s addTelpherage
T
11

This is straight Bash:

sum=0
while read -r line
do
    (( sum += line ))
done < file
echo $sum
Tamishatamma answered 24/4, 2010 at 1:4 Comment(1)
And it's probably one of the slowest solutions and therefore not so suitable for large amounts of numbers.Prosecute
S
11

I prefer to use GNU datamash for such tasks because it's more succinct and legible than perl or awk. For example

datamash sum 1 < myfile

where 1 denotes the first column of data.

Surplusage answered 13/9, 2016 at 10:34 Comment(2)
This does not appear to be a standard component as I do not see it in my Ubuntu installation. Would like to see it benchmarked, though.Isoniazid
It seems the fastest from general purpose progs by far to me! For seq 10000000, awk with $0 takes 2.1 sec, python 1.9 sec, perl 1.5 sec, but datamash amazing 0.9 sec. Only the custom written C answer was better at 0.8 sec.Fireback
A
9

Raku

say sum lines
~$ raku -e '.say for 0..1000000' > test.in

~$ raku -e 'say sum lines' < test.in
500000500000

The way this works is that lines produces a sequence of strings which are the input lines.
sum takes that sequence, turns each line into a number and adds them together.
All that is left is for say to print out that value followed by a newline. (It could have been print or put, but say is more alliterative.)

Aspect answered 3/11, 2017 at 18:2 Comment(0)
N
8

I prefer to use R for this:

$ R -e 'sum(scan("filename"))'
Neomineomycin answered 4/1, 2015 at 20:22 Comment(1)
I'm a fan of R for other applications but it's not good for performance in this way. File I/O is a major issue. I've tested passing args to a script which can be sped up using the vroom package. I'll post more details when I've benchmarked some other scripts on the same server.Mohamed
C
7

Here's another one-liner

( echo 0 ; sed 's/$/ +/' foo ; echo p ) | dc

This assumes the numbers are integers. If you need decimals, try

( echo 0 2k ; sed 's/$/ +/' foo ; echo p ) | dc

Adjust 2 to the number of decimals needed.

Colombo answered 26/4, 2010 at 11:34 Comment(0)
F
6
$ perl -MList::Util=sum -le 'print sum <>' nums.txt
Falkirk answered 13/3, 2014 at 12:59 Comment(0)
O
5

C always wins for speed:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv) {
    ssize_t read;
    char *line = NULL;
    size_t len = 0;
    double sum = 0.0;

    while (read = getline(&line, &len, stdin) != -1) {
        sum += atof(line);
    }

    printf("%f\n", sum);
    return 0;
}

Timing for 1M numbers (same machine/input as my python answer):

$ gcc sum.c -o sum && time ./sum < numbers 
5003371677.000000
real    0m0.188s
user    0m0.180s
sys     0m0.000s
Ondrea answered 22/8, 2013 at 12:25 Comment(2)
Best answer! Best speed)Congius
Using sum.c and GNU Parallel: seq 1077139031 > 10gb; time parallel --pipepart --block -1 -a 10gb sum | sum = 10 secs or ~100M numbers/sec on a 64 core machine.Longicorn
R
5

More succinct:

# Ruby
ruby -e 'puts open("random_numbers").map(&:to_i).reduce(:+)'

# Python
python -c 'print(sum(int(l) for l in open("random_numbers")))'
Rauch answered 13/9, 2015 at 19:43 Comment(1)
Converting to float seems to be about twice as fast on my system (320 vs 640 ms). time python -c "print(sum([float(s) for s in open('random_numbers','r')]))"Dominicdominica
S
5

I couldn't just pass by... Here's my Haskell one-liner. It's actually quite readable:

sum <$> (read <$>) <$> lines <$> getContents

Unfortunately there's no ghci -e to just run it, so it needs the main function, print and compilation.

main = (sum <$> (read <$>) <$> lines <$> getContents) >>= print

To clarify, we read entire input (getContents), split it by lines, read as numbers and sum. <$> is fmap operator - we use it instead of usual function application because sure this all happens in IO. read needs an additional fmap, because it is also in the list.

$ ghc sum.hs
[1 of 1] Compiling Main             ( sum.hs, sum.o )
Linking sum ...
$ ./sum 
1
2
4
^D
7

Here's a strange upgrade to make it work with floats:

main = ((0.0 + ) <$> sum <$> (read <$>) <$> lines <$> getContents) >>= print
$ ./sum 
1.3
2.1
4.2
^D
7.6000000000000005
Sinegold answered 30/10, 2019 at 13:57 Comment(0)
T
4
cat nums | perl -ne '$sum += $_ } { print $sum'

(same as brian d foy's answer, without 'END')

Tetrabrach answered 23/2, 2013 at 16:56 Comment(2)
I like this, but could you explain the curly brackets? It's weird to see } without { and vice versa.Trappist
@Trappist see @brian d foy's answer above with perl -MO=Deparse to see how perl parses the program. or the docs for perlrun: perldoc.perl.org/perlrun.html (search for -n). perl wraps your code with { } if you use -n so it becomes a complete program.Tetrabrach
L
4

Just for fun, lets do it with PDL, Perl's array math engine!

perl -MPDL -E 'say rcols(shift)->sum' datafile

rcols reads columns into a matrix (1D in this case) and sum (surprise) sums all the element of the matrix.

Lallygag answered 23/2, 2013 at 19:55 Comment(2)
How fix Can't locate PDL.pm in @INC (you may need to install the PDL module) (@INC contains: /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.22.1 ?)) for fun of course=)Congius
You have to install PDL first, it isn't a Perl native module.Lallygag
O
3

Here is a solution using python with a generator expression. Tested with a million numbers on my old cruddy laptop.

time python -c "import sys; print sum((float(l) for l in sys.stdin))" < file

real    0m0.619s
user    0m0.512s
sys     0m0.028s
Ondrea answered 22/8, 2013 at 12:5 Comment(1)
A simple list comprehension with a named function is a nice use-case for map(): map(float, sys.stdin)Passing
S
3

C++ "one-liner":

#include <iostream>
#include <iterator>
#include <numeric>
using namespace std;

int main() {
    cout << accumulate(istream_iterator<int>(cin), istream_iterator<int>(), 0) << endl;
}
Sinegold answered 17/3, 2020 at 1:29 Comment(0)
M
2
sed ':a;N;s/\n/+/;ta' file|bc
Mcclellan answered 24/4, 2010 at 2:32 Comment(0)
M
2

Running R scripts

I've written an R script to take arguments of a file name and sum the lines.

#! /usr/local/bin/R
file=commandArgs(trailingOnly=TRUE)[1]
sum(as.numeric(readLines(file)))

This can be sped up with the "data.table" or "vroom" package as follows:

#! /usr/local/bin/R
file=commandArgs(trailingOnly=TRUE)[1]
sum(data.table::fread(file))
#! /usr/local/bin/R
file=commandArgs(trailingOnly=TRUE)[1]
sum(vroom::vroom(file))

Benchmarking

Same benchmarking data as @glenn jackman.

for ((i=0; i<1000000; i++)) ; do echo $RANDOM; done > random_numbers

In comparison to the R call above, running R 3.5.0 as a script is comparable to other methods (on the same Linux Debian server).

$ time R -e 'sum(scan("random_numbers"))'  
 0.37s user
 0.04s system
 86% cpu
 0.478 total

R script with readLines

$ time Rscript sum.R random_numbers
  0.53s user
  0.04s system
  84% cpu
  0.679 total

R script with data.table

$ time Rscript sum.R random_numbers     
 0.30s user
 0.05s system
 77% cpu
 0.453 total

R script with vroom

$ time Rscript sum.R random_numbers     
  0.54s user 
  0.11s system
  93% cpu
  0.696 total

Comparison with other languages

For reference here as some other methods suggested on the same hardware

Python 2 (2.7.13)

$ time python2 -c "import sys; print sum((float(l) for l in sys.stdin))" < random_numbers 
 0.27s user 0.00s system 89% cpu 0.298 total

Python 3 (3.6.8)

$ time python3 -c "import sys; print(sum((float(l) for l in sys.stdin)))" < random_number
0.37s user 0.02s system 98% cpu 0.393 total

Ruby (2.3.3)

$  time ruby -e 'sum = 0; File.foreach(ARGV.shift) {|line| sum+=line.to_i}; puts sum' random_numbers
 0.42s user
 0.03s system
 72% cpu
 0.625 total

Perl (5.24.1)

$ time perl -nle '$sum += $_ } END { print $sum' random_numbers
 0.24s user
 0.01s system
 99% cpu
 0.249 total

Awk (4.1.4)

$ time awk '{ sum += $0 } END { print sum }' random_numbers
 0.26s user
 0.01s system
 99% cpu
 0.265 total
$ time awk '{ sum += $1 } END { print sum }' random_numbers
 0.34s user
 0.01s system
 99% cpu
 0.354 total

C (clang version 3.3; gcc (Debian 6.3.0-18) 6.3.0 )

 $ gcc sum.c -o sum && time ./sum < random_numbers   
 0.10s user
 0.00s system
 96% cpu
 0.108 total

Update with additional languages

Lua (5.3.5)

$ time lua -e 'sum=0; for line in io.lines() do sum=sum+line end; print(sum)' < random_numbers 
 0.30s user 
 0.01s system
 98% cpu
 0.312 total

tr (8.26) must be timed in bash, not compatible with zsh

$time { { tr "\n" + < random_numbers ; echo 0; } | bc; }
real    0m0.494s
user    0m0.488s
sys 0m0.044s

sed (4.4) must be timed in bash, not compatible with zsh

$  time { head -n 10000 random_numbers | sed ':a;N;s/\n/+/;ta' |bc; }
real    0m0.631s
user    0m0.628s
sys     0m0.008s
$  time { head -n 100000 random_numbers | sed ':a;N;s/\n/+/;ta' |bc; }
real    1m2.593s
user    1m2.588s
sys     0m0.012s

note: sed calls seem to work faster on systems with more memory available (note smaller datasets used for benchmarking sed)

Julia (0.5.0)

$ time julia -e 'print(sum(readdlm("random_numbers")))'
 3.00s user 
 1.39s system 
 136% cpu 
 3.204 total
$  time julia -e 'print(sum(readtable("random_numbers")))'
 0.63s user 
 0.96s system 
 248% cpu 
 0.638 total

Notice that as in R, file I/O methods have different performance.

Mohamed answered 24/6, 2019 at 6:25 Comment(0)
W
2

Bash variant

raw=$(cat file)
echo $(( ${raw//$'\n'/+} ))

$ wc -l file
10000 file

$ time ./test
323390

real    0m3,096s
user    0m3,095s
sys     0m0,000s

What is happening here? Read the content of a file into $raw var. Then create math statement from this var by changing all new lines into '+'

Walloon answered 3/3, 2020 at 6:31 Comment(0)
P
2

As long as there only integer-numbers i basically translate the file into an bash math expression and execute it. It is similar to the solution with 'bc' from further above, but faster. Observe the zero at the end of the inner expression is needed for an argument of the final line. I have tested it with 475.000 lines and it is less than a second.

echo $(($(cat filename | tr '\n' '+')0))
Piero answered 23/9, 2023 at 17:13 Comment(1)
There is no "above" or "below"; the answers are sorted according to each visitor's personal preference.Vellicate
M
1

Another for fun

sum=0;for i in $(cat file);do sum=$((sum+$i));done;echo $sum

or another bash only

s=0;while read l; do s=$((s+$l));done<file;echo $s

But awk solution is probably best as it's most compact.

Milewski answered 25/10, 2012 at 16:54 Comment(0)
R
1

With Ruby:

ruby -e "File.read('file.txt').split.inject(0){|mem, obj| mem += obj.to_f}"
Roose answered 11/5, 2014 at 1:12 Comment(1)
Another option (when input is from STDIN) is ruby -e'p readlines.map(&:to_f).reduce(:+)'.Lombard
O
1

In Go:

package main

import (
    "bufio"
    "fmt"
    "os"
    "strconv"
)

func main() {
    scanner := bufio.NewScanner(os.Stdin)
    sum := int64(0)
    for scanner.Scan() {
        v, err := strconv.ParseInt(scanner.Text(), 10, 64)
        if err != nil {
            fmt.Fprintf(os.Stderr, "Not an integer: '%s'\n", scanner.Text())
            os.Exit(1)
        }
        sum += v
    }
    fmt.Println(sum)
}
Ondrea answered 3/3, 2020 at 1:12 Comment(2)
What is "64"? "10" I suppose is base?Sinegold
Yes, 10 is the base. 64 is the number of bits, if the resulting int can't be represented with that many bits then an error is returned. See golang.org/pkg/strconv/#ParseIntOndrea
D
0

I don't know if you can get a lot better than this, considering you need to read through the whole file.

$sum = 0;
while(<>){
   $sum += $_;
}
print $sum;
Disaccord answered 23/4, 2010 at 23:38 Comment(5)
Very readable. For perl. But yeah, it's going to have to be something like that...Legault
$_ is the default variable. The line input operator, <>, puts it's result in there by default when you use <> in while.Blowzed
@Mark, $_ is the topic variable--it works like the 'it'. In this case <> assigns each line to it. It gets used in a number of places to reduce code clutter and help with writing one-liners. The script says "Set the sum to 0, read each line and add it to the sum, then print the sum."Appellee
@Stefan, with warnings and strictures off, you can skip declaring and initializing $sum. Since this is so simple, you can even use a statement modifier while: $sum += $_ while <>; print $sum;Appellee
for the rest of us who can't easily, how about you indicate which language this is in? PHP? Perl?Insincerity
I
0

I have not tested this but it should work:

cat f | tr "\n" "+" | sed 's/+$/\n/' | bc

You might have to add "\n" to the string before bc (like via echo) if bc doesn't treat EOF and EOL...

Inhabited answered 24/4, 2010 at 2:3 Comment(2)
It doesn't work. bc issues a syntax error because of the trailing "+" and lack of newline at the end. This will work and it eliminates a useless use of cat: { tr "\n" "+" | sed 's/+$/\n/'| bc; } < numbers2.txt or <numbers2.txt tr "\n" "+" | sed 's/+$/\n/'| bcTamishatamma
tr "\n" "+" <file | sed 's/+$/\n/' | bcMcclellan
R
0

Here's another:

open(FIL, "a.txt");

my $sum = 0;
foreach( <FIL> ) {chomp; $sum += $_;}

close(FIL);

print "Sum = $sum\n";
Rayon answered 20/2, 2013 at 6:24 Comment(0)
W
0

You can do it with Alacon - command-line utility for Alasql database.

It works with Node.js, so you need to install Node.js and then Alasql package:

To calculate sum from TXT file you can use the following command:

> node alacon "SELECT VALUE SUM([0]) FROM TXT('mydata.txt')"
Waine answered 21/12, 2014 at 16:45 Comment(0)
M
0

It is not easier to replace all new lines by +, add a 0 and send it to the Ruby interpreter?

(sed -e "s/$/+/" file; echo 0)|irb

If you do not have irb, you can send it to bc, but you have to remove all newlines except the last one (of echo). It is better to use tr for this, unless you have a PhD in sed .

(sed -e "s/$/+/" file|tr -d "\n"; echo 0)|bc
Montserrat answered 15/4, 2018 at 12:58 Comment(0)
D
0

In shell using awk, I have used below script to do so:

    #!/bin/bash


total=0;

for i in $( awk '{ print $1; }' <myfile> )
do
 total=$(echo $total+$i | bc )
 ((count++))
done
echo "scale=2; $total " | bc
Douglas answered 29/3, 2020 at 7:48 Comment(0)
P
0

One in tcl:

#!/usr/bin/env tclsh
set sum 0
while {[gets stdin num] >= 0} { incr sum $num }
puts $sum
Planogamete answered 24/4, 2020 at 7:41 Comment(0)
D
0

GNU Parallel can presumably be used to improve many of the above answers by spreading the workload across multiple cores.

In the example below we send chunks of 500 numbers (--max-lines=500) to bc processes which are executed in parallel 4 at a time (-j 4). The results are then aggregated by a final bc.

time parallel --max-lines=500 -j 4 --pipe "paste -sd+ - | bc" < random_numbers | paste -sd+ - | bc

The optimal choice of work size and number of parallel processes depends on the machine and problem. Note that this solution only really shines when there's a large number of parallel processes with substantial work each.

Dominicdominica answered 16/5, 2020 at 20:2 Comment(0)
W
0

UPDATE :

gnu-parallel benchmarking pre-made file over -pipe-part:

(parallel --pipe-part --argfile "${DT}/temptestpipepartinput.txt" | gpaste )  

  1. Exactly like command above: 61.57s user 76.92s system 424% cpu 32.609 total

  2. -j 2 27.883 total

  3. -j 4 21.850 total

  4. -j 6 21.221 total <—- min point (didn't check 5 or 7)

  5. -j 8 25.133 total

  6. -j 10 30.734 total

  7. -j 12 36.279 total

Using the pre-made file

  • mawk1.9.9.6 :: 6.953 secs using its own file I/O, and 7.128 secs piped-in.

  • perl 5.36.1 :: 8.786 secs using its own file I/O, and 8.925 secs piped in.

  • python3.11.5 :: here's the strange beast - apparently summing via int(_) instead of float(_) is a 17.98 % slow down penalty:

8.468 secs

  python3 -c 'import sys; print(int(sum((float(_) for _ in sys.stdin))))'

9.991 secs

  python3 -c 'import sys; print(int(sum((  int(_) for _ in sys.stdin))))' 

Side note: this set of integers created a file with perfect digit uniformity when it came to stats from gnu-wc:

99,999,999 888,888,888 888,888,888

A perfect chain of eight 9s for row count, and chain of nine 8s for byte count. The digits-only count after backing out all the \n(ewlines):

788,888,889 


In awk, just getting a 2nd column with cumulative sum is far less syntax than saving it towards the end:

jot 20 61111111889 - 799973766543 | 

mawk '$2=_+=$1'        # skips rows with zero(0) as its value
gawk '($2=_+=$1)_'     # no rows left behind

61111111889 61111111889
861084878432 922195990321
1661058644975 2583254635296
2461032411518 5044287046814

3261006178061 8305293224875
4060979944604 12366273169479
4860953711147 17227226880626
5660927477690 22888154358316

6460901244233 29349055602549
7260875010776 36609930613325
8060848777319 44670779390644
8860822543862 53531601934506

9660796310405 63192398244911
10460770076948 73653168321859
11260743843491 84913912165350
12060717610034 96974629775384

12860691376577 109835321151961
13660665143120 123495986295081
14460638909663 137956625204744
15260612676206 153217237880950

For all practical purposes, perl5 python3 and mawk2 are tied for speed summing up from 1 to 99,999,999 ::


(echo '99999999' | mawk2 '$++NF = (__=+$++_)*++__/++_'

99999999 4999999950000000

(All input digits were re-generated on the fly and piped in to eliminate any potential cache access advantage):

      in0:  847MiB 0:00:10 [81.1MiB/s] [81.1MiB/s] [ <=> ]
     1  4999999950000000

(python3 -c 'import sys; print(int(sum((float(_) for _ in sys.stdin))))')  
      19.14s user 0.55s system 188% cpu 10.473 total
gcat -b  0.00s user 0.00s system 0% cpu 10.473 total

      in0:  847MiB 0:00:10 [81.0MiB/s] [81.0MiB/s] [ <=> ]
     1  4999999950000000

(perl536 -nle '$sum += $_ } END { print $sum')
          19.37s user 0.55s system 190% cpu 10.472 total
gcat -b      0.00s user 0.00s system 0% cpu 10.472 total

      in0:  847MiB 0:00:10 [81.1MiB/s] [81.1MiB/s] [ <=>]
     1  4999999950000000

(mawk1996 '{ _+=$__ } END { print _ }')

      17.51s user 0.57s system 172% cpu 10.463 total
gcat -b  0.00s user 0.00s system 0% cpu 10.463 total

However, once you eliminate the pipe and hashing speed factors and ask them to sum it among itself, perl5.36 is some 52% slower:

( time (
 mawk2 'BEGIN { for(___=_-=_=__=((_+=++_)+(_*=_+_))^_; ++_<__;)___+=_
        print ___ }' 
                     ) | gcat -b ) | lgp3 ;

( time ( 
 perl5 -e '$y = $x = 0; $z = 10**8; while(++$x < $z) { $y += $x } print $y' 
         ) | gcat -b ) | lgp3 ;

     1  4999999950000000

( mawk2 ; )  1.97s user 0.01s system 99% cpu 1.981 total
gcat -b  0.00s user 0.00s system 0% cpu 1.979 total

( perl5 -e '$y = $x = 0; $z = 10**8; while(++$x < $z) { $y += $x } print $y';   2.98s user 0.03s system 99% cpu 3.015 total
gcat -b  0.00s user 0.00s system 0% cpu 3.014 total
     1  4999999950000000

As for gnu-parallel, they're more than half an order of magnitude slower

  • 36 concurrent jobs with 5,000,000 rows per job and very generous 100 MB size upper cap running on M1 Max with 64 GB ram and it still took nearly 53 seconds compare to about 10.5 secs for the other 3.

  ( time ( mawk2 'BEGIN { for(_-=_=__=((_+=++_)+(_*=_+_))^_; ++_ < __; ) print _ }' |
   pvE0 | 
   parallel --block 100M -N 5000000 -j 36 --pipe "gpaste -sd+ - | bc" | gpaste -sd+ - | bc 
   ) |  gcat -b ) | lgp3 | lgp3 -1;

   in0:  847MiB 0:00:47 [17.8MiB/s] [17.8MiB/s] [  <=> ]
   1    4999999950000000

   0.00s user 0.00s system 0% cpu 52.895 total

======================

reference code for massively loop unrolled summations (this variant is 512 numbers per while()-loop round :

( gawk -p- -be  "${DT}/temptestpipepartinput.txt"; )  
8.50s user 1.46s system 99% cpu 9.970 total

     1  4999999950000000

     2      # gawk profile, created Sat Oct 21 04:25:20 2023
     3      # BEGIN rule(s)
     4      BEGIN {
     5       1 CONVFMT="%.250g"
     6       1 FS=RS
     7       1 RS="^$"
     8      }
     9      # END rule(s)
    10      END {
    11       1 print ______()
    12      }
    13      # Functions, listed alphabetically
    14       1 function ______(_, __, ___)
    15      {

    16       1 ___=(__=_=_<_)+NF

    17 196079 while (_<___)
    18        __ += $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    19  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    20  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    21  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    22  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    23  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    24  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    25  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    26  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    27  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    28  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    29  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    30  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    31  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    32  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    33  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    34  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    35  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    36  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    37  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    38  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    39  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    40  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    41  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    42  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    43  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    44  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    45  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    46  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    47  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    48  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    49  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    50  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    51  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    52  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    53  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    54  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    55  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    56  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    57  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    58  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    59  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    60  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    61  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    62  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    63  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    64  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    65  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    66  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    67  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    68  + $++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_+$++_
    69  + $++_+$++_

    71     return __
    73  }
Worldlywise answered 12/8, 2023 at 15:9 Comment(8)
-N is slow with --pipe because you force GNU Parallel to count lines, so remove that. If data is in a file, use --pipe-part - which can deliver ~1 GB/s per core.Longicorn
@OleTange : You ran exactly my code, stored it first in a file, and benchmarked gnu-parallel to 1GB/s per core, so you're just quoting their theoretical ? It wasn't gnu-parallel that was slow in my benchmarking - bc wasn't happy when i tried piping in a single string of 847 MBWorldlywise
This looks like an impressive effort, but the messy formatting makes it pretty hard to read. Perhaps the exposition would benefit from fewer separator lines and more explanations of what exactly you did and what it means.Vellicate
@Vellicate : it's basically the same set of tests, against a pre-made file, just to see how much gnu-parallel could benefit from. It's just ( time ( parallel --pipe-part -j 6 --argfile "${DT}/temptestpipepartinput.txt" 'gpaste -sd+ - | bc' ) | pvZL9 ) | gpaste -sd+ - | bc | gcat -b. pvZL9 is just a line-counting timer of pv for second-level timer, but ultimately it's time command's values being shared above. Long story short - yes —pipe-part helped gnu-parallel quite a bit, but a very wide gap still persists between it and the awk/perl/pythonmawk2 remains 3.05x fasterWorldlywise
@Vellicate : ….. gnu-parallel needed that gpaste -sd+ - | bc command twice between once it's for the internal job splits, which still need 1 final aggregation because it came back in 60 chunks, simplying gnu-parallel auto-split roughly 3 jobs every 5 million lines.Worldlywise
oh i finally figured out the right parameters for gnu-parallel to be competitive :::::::: ( time ( parallel --round-robin -j 8 --blocksize=1M --pipe-part -a "${DT}/temptestpipepartinput.txt" 'gpaste -sd+ - | bc' ) | pvE9 ) | gpaste -sd+ - | bc :::: : ( parallel --round-robin -j 8 --blocksize=1M --pipe-part -a ; ) 36.57s user 8.65s system 562% cpu 8.035 total 4999999950000000 :::::: only 8.035 secs now.Worldlywise
So it's theoretically possible to achieve same performance with gnu-parallel, but requires knowing some hand-tuning of parameters. It's also kinda extreme - 8.0 secs hand-tuned, but 21.2 secs letting gnu-parallel figure it out itself via --pipe-part, and a lovely 52 secs via --pipe instead of --arg-file. awk/perl/python were all designed with piping in mind, so their performance gap was much smaller - roughly 7-8 secs pre-made file vs. 10-11 secs pipeWorldlywise
combining parallel --pipe-part -a "$file" -j 5 --blocksize=200M 'mawk2 … with some massively loop-unrolled summation function, i managed to get it down to :::::::::::::::: :::::::::::::::::::::: ::::::::::::::: 6.15s user 2.95s system 283% cpu 3.207 total :::::::::: 1 4999999950000000. For another experiment, loading the entire file at once with 1 instance of mawk2, and pairing it with a recursive adder handling at most 256 columns each, was maybe 5.7-5.9 secs-ish. But loop unrolling has limitations when length of execution pipeline start to become a factorWorldlywise

© 2022 - 2024 — McMap. All rights reserved.