Most elegant unix shell one-liner to sum list of numbers of arbitrary precision?
Asked Answered
D

6

3

As regards integer adding one-liners, several proposed shell scripting solutions exist;
however, on closer look at each of the solutions chosen, there are inherent limitations:

  • awk ones would choke at arbitrary precision and integer size (it behaves C-like, afterall)
  • bc ones would rather be unhappy with arbitrarily long inputs: (sed 's/$/+\\/g';echo 0)|bc

Understanding that there may be issues of portability on top of that across platforms (see [1] [2]) which is undesired, is there a generic solution which is a winner on both practicality and brevity?

Hint: SunOS & MacOSX are examples where portability would be an issue.
fi. could dc command permit to handle arbitrarily large 2^n, integer or otherwise, inputs?

[1] awk: https://mcmap.net/q/53044/-shell-command-to-sum-integers-one-per-line or https://mcmap.net/q/53044/-shell-command-to-sum-integers-one-per-line or Printing long integers in awk

[2] bc: Bash command to sum a column of numbers

Dagon answered 27/6, 2017 at 1:30 Comment(6)
At what point does bc complain about long input?Obtain
If you use later versions of gawk arbitrary precision is supported.Kopeck
Please clarify: (1) What is a unix one-liner? Unix is not a programming language. (2) What do you mean by arbitrary precision? The term means usually exact fractional arithmetic, but it is not clear from your posting, whether you mean it too. (3) How is the input provided? And: Is it two numbers or a list of numbers?Lina
OP here: question rewritten for clarity, providing references of where awk and bc might fail. I hope it now clarifies the shortcomings of either.Dagon
How then about Ruby? For instance ruby -e "p 777777710000000000000000000000000000000001+3" prints 777777710000000000000000000000000000000004 to stdout.Lina
@Kopeck beware of gawk arbitrary precision. I've checked and gawk produces incorrect output for most values compared to bc, perl or python, thus I've just filed a bug reportUredo
E
6

An optimal solution for dc(1) sums the inputs as they are read:

$ jot 1000000 | sed '2,$s/$/+/;$s/$/p/' | dc
500000500000
Enid answered 16/9, 2017 at 0:27 Comment(2)
OP here, @kdhp: * wow, this is really good, this is what I was after! * Very good performance too: > $ time jot 1000000 | sed '2,$s/$/+/;$s/$/p/' | dc 500000500000 > real 0m2.987s > user 0m3.971s > sys 0m0.035sDagon
@Dagon : 3.971 seconds is fast ?? awks do it in 0.169 s time jot 1000000 | mawk2 '{__+=$_}END{print __}' 500000500000 jot 1000000 0.16s user 0.00s system 98% cpu 0.170 total mawk2 '{__+=$_}END{print __}' 0.07s user 0.00s system 45% cpu 0.169 totalChanellechaney
O
3

The one I usually use is paste -sd+|bc:

$ time seq 1 20000000 | paste -sd+|bc
200000010000000

real    0m10.092s
user    0m10.854s
sys     0m0.481s

(For strict Posix compliance, paste needs to be provided with an explicit argument: paste -sd+ -|bc. Apparently that is necessary with the BSD paste implementation installed by default on OS X.)

However, that will fail for larger inputs, because bc buffers an entire expression in memory before evaluating it. On my system, bc ran out of memory trying to add 100 million numbers, although it was able to do 70 million. But other systems may have smaller capacities.

Since bc has variables, you could avoid long lines by repetitively adding to a variable instead of constructing a single long expression. This is (as far as I know) 100% Posix compliant, but there is a 3x time penalty:

$ time seq 1 20000000|sed -e's/^/s+=/;$a\' -es|bc
200000010000000

real    0m29.224s
user    0m44.119s
sys     0m0.820s

Another way to handle the case where the input size exceeds bc's buffering capacity would be to use the standard xargs tool to add the numbers in groups:

$ time seq 1 100000000 |
> IFS=+ xargs sh -c 'echo "$*"' _ | bc | paste -sd+ | bc
5000000050000000

real    1m0.289s
user    1m31.297s
sys     0m19.233s

The number of input lines used by each xargs evaluation will vary from system to system, but it will normally be in the hundreds and it might be much more. Obviously, the xargs | bc invocations could be chained arbitrarily to increase capacity.

It might be necessary to limit the size of the xargs expansion using the -s switch, on systems where ARG_MAX exceeds the capacity of the bc command. Aside from performing an experiment to establish the bc buffer limit, there is no portable way to establish what that limit might be but it certainly should be no less than LINE_MAX which is guaranteed to be at least 2048. Even with 100-digit addends, that will allow a reduction by a factor of 20, so a chain of 10 xargs|bc pipes would handle over 1013 addends assuming you were prepared to wait a couple of months for that to complete.

As an alternative to constructing a large fixed-length pipeline, you could use a function to recursively pipe the output from xargs|bc until only one value is produced:

radd () { 
    if read a && read b; then
        { printf '%s\n%s\n' "$a" "$b"; cat; } |
          IFS=+ xargs -s $MAXLINE sh -c 'echo "$*"' _ |
          bc | radd
    else
        echo "$a"
    fi
}

If you use a very conservative value for MAXLINE, the above is quite slow, but with plausible larger values it is not much slower than the simple paste|bc solution:

$ time seq 1 20000000 | MAXLINE=2048 radd
200000010000000

real    1m38.850s
user    0m46.465s
sys     1m34.503s

$ time seq 1 20000000 | MAXLINE=60000 radd 
200000010000000

real    0m12.097s
user    0m17.452s
sys     0m5.090s

$ time seq 1 100000000 | MAXLINE=60000 radd 
5000000050000000

real    1m3.972s
user    1m31.394s
sys     0m27.946s

As well as the bc solutions, I timed some other possibilities. As shown above, with an input of 20 million numbers, paste|bc took 10 seconds. That's almost identical to the time used by adding 20 million numbers with

gawk -M '{s+=$0} END{print s}'

Programming languages such as python and perl proved to be faster:

# 9.2 seconds to sum 20,000,000 integers
python -c $'import sys\nprint(sum(int(x) for x in sys.stdin))'
# 5.1 seconds
perl -Mbignum -lne '$s+=$_; END{print $s}'

I was unable to test dc -f - -e '[+z1<r]srz1<rp' on large inputs, since its performance appears to be quadratic (or worse); it summed 25 thousand numbers in 3 seconds, but it took 19 seconds to sum 50 thousand and 90 seconds to do 100 thousand.

Although bc is not the fastest and memory limitations require awkward workarounds, it has the advantage of working out of the box on Posix-compliant systems without the necessity to install enhanced versions of any standard utility (awk) or programming languages not required by Posix (perl and python).

Obtain answered 27/6, 2017 at 1:38 Comment(7)
That paste -sd+| can be replaced with seq's separator switch -s, i.e.: time seq -s+ 1 20000000 | bc.Fidelfidela
@agc: I don't believe that the input is likely to be seq and if it were, you could just compute the sum directly. I just used seq because it shows that bc can handle twenty million values.Obtain
OP here: I like the elegance of paste|bc, however bc needs to deal with a finite buffer (at least on certain platforms), so it's not the generic solution we are after. Please, do prove me wrong if you know otherwise.Dagon
@rici: I have also come to think that xargs|bc incantations chained in a tree fashion (you can do more levels than 2, think like i-nodes) c/would permit super huge numbers lists processing; however, the drawback is that the limits we put in there are arbitrary and bc input buffer size changes from system to system, so it is very unclear how to break down the inputs, in a generic way.Dagon
@fgeorgatos: I mentioned the chaining possibility in my answer; I have now made it explicit with a version which doesn't need to know how long a chain to construct (although I'm not sure that it still falls in the category of "elegant one-liners"). I also added a cumulating bc solution using a variable which is not competitive in execution time (approximately 3x slow-down) but should be completely portable as long as you don't exceed bc's line limit with a single integer.Obtain
@rici: I find your radd() solution with MAXLINE=60000 the best bet for high portability, with the least of constraints (MAXLINE may give compatibility jitter across platforms, though). Happy to accept that as the best answer so far, although it is entertaining that it is not as simple as people might have thought at first sight.Dagon
@fgeorgatos: The cumulating bc solution is probably the most portable but the slow-down is dramatic. radd() is, if nothing else, an interesting case study of recursive pipelining, which has certain other applications.Obtain
K
1

You can use gawk with the -M flag:

$ seq 1 20000000 | gawk -M '{s+=$0} END{print s}'
200000010000000

Or Perl with bignum enabled:

$ seq 1 20000000 | perl -Mbignum -lne '$s+=$_; END{print $s}'
200000010000000
Kopeck answered 27/6, 2017 at 3:15 Comment(10)
OP here: interesting offer, didn't know about the -M flag of awk; however, this won't work on several MacOSX versions and I doubt that SunOS would be happy about it either - unless you refresh the GNU tools installation to a modern one, which defeats the original question's description! **** Perl: although i'm Perl-averse, this might be a not too bad pick, given that Perl has been available across several platforms for 20+ years. What about compatibility of -Mbignum though?Dagon
defeats the original question's description??? You specifically said Understanding that there will be issues of portability across platforms [1] [2], is there some generic solution which is a winner on both practicality and brevity?. So you effectively said "don't worry about portability as I understand that's an issue" and then complained now that the posted solution isn't portable (unless you install the specific tool). You specifically asked for a practical and brief solution and got one with gawk -M.Disobey
Well I ran those two examples on MacOS. gawk is trivial to install with Brew. Apple itself will not be updating to GNU tools because of incompatibility with with GPL 3.0 but they are trivial to install. bignum dates back to 2002 and has been part of the Perl disto since maybe 2010. The perl example should run on any platform that has perl. btw: | paste -sd+|bc does not run on MacOS.Kopeck
@dawg: What happens when you use paste -sd+|bc on Mac OS? paste and bc should both be available...Obtain
@rici: The command on BSD / OS X would either be seq 1 10 | paste -sd+ - | bc (note the - for the file) or paste -sd+ <(seq 1 10) | bc In either case, with larger sequences, bc gives (standard_in) 1: parse error Works fine for smaller sequences.Kopeck
@dawg: Huh. I've certainly used that command on a Mac but it would have been five years ago and it is quite possible that I had gnu coreutils installed on it. Can you quantify "larger"?Obtain
@dawg: thanks for clarifying the perl situation; so even that is not that much of common ground, after all - a sigh of relief from this end ;-) ==== Ed Morton: I've made it more clear in the description that portability issues are undesired. After all, we like to write scripts which are write-once run-everywhere, aren't we? ==== all: most people have correctly assumed that we are after a posix-compliant solution, please keep making that assumption, because it is the right thing.Dagon
@rici: bigger than seq 1 50 but smaller than seq 1 20000000 I also tried the xargs approach but that did not solve it.Kopeck
@rici: Turns out the issue is seq on OS X. Try $ seq 999999 1000000 which produces 999999\n1e06 Now try seq 999999 1000000 | paste -sd+ - |bc which produces (standard_in) 1: parse error because of the 1e+06 So seq or bc have thier own platform issues. bc since it does not support 'e' scientific notation.Kopeck
@dawg: good to know, thanks. I tried compiling bsd bc from source and still couldn't repro a problem. Although you clearly hit a memory limit eventually.Obtain
D
1

$ seq 1000|(sum=0;while read num; do sum=`echo $sum+$num|bc -l`;done;echo $sum) 500500

Also, this one will not win a top-speed prize, however it IS:

  • oneliner, yes.
  • portable
  • adds lists of any length
  • adds numbers of any precision (each number's length limited only by MAXLINE)
  • does not rely on external tools such as python/perl/awk/R etc

with a stretch, you may call it elegant too ;-) come on guys, show the better way to do this!

Dagon answered 12/7, 2017 at 16:58 Comment(1)
If we just need to add large integers, Bash is enough. For example, to add the size of all images in a directory: ls -l *.png | cut -d ' ' -f 5 | (sum=0; while read num; do sum=$[$sum+$num]; done; echo $sum)Upshot
D
0

It seems that the following does the trick:

$ seq 1000|dc -f - -e '[+z1<r]srz1<rp'
500500

but, is it the optimal solution?

Dagon answered 27/6, 2017 at 1:30 Comment(2)
This cannot be optimal because it is too slow to be practical on input sizes far smaller than the memory limit of bc. See my answer.Obtain
@rici: interesting discovery about the quadratic performance. Not entirely surprised, but I thought it would be possible to do smarter things with a stack based language - perhaps there is another dc incantation... tbd.Dagon
C
0

jot really slows u down :

( time ( jot 100000000 | pvZ -i 0.2 -l -cN in0 | 

 mawk2 '{ __+=$_ } END { print __ }' FS='\n' ) )

      in0:  100M 0:00:17 [5.64M/s] [ <=> ]

( jot 100000000 | nice pv -pteba -i 1 -i 0.2 -l -cN in0 | 
mawk2  FS='\n'; )

26.43s user 0.78s system 153% cpu 17.730 total
5000000050000000

using another awk instance to generate the sequence shaves off 39.7% :

 ( time ( 

       mawk2 -v __='100000000' '
       BEGIN { for(_-=_=__=+__;_<__;) { 
                       print ++_ } }' | 

  pvZ -i 0.2 -l -cN in0 | 

       mawk2 '{ __+=$_ } END{ print __ }' FS='\n' ))

      in0:  100M 0:00:10 [9.37M/s] [ <=> ]

 ( mawk2 -v __='100000000' 'BEGIN {…}' | )

 19.44s user 0.68s system 188% cpu 10.687 total
5000000050000000

for the bc option, gnu-paste is quite a bit faster than bsd-paste in this regard, but both absolutely pale compared to awk, while perl is only slightly behind :

time jot 15000000 | pvE9 | mawk2 '{ _+=$__ } END { print _ }'
     out9:  118MiB 0:00:02 [45.0MiB/s] [45.0MiB/s] [ <=> ]

112500007500000

jot 15000000   2.60s user 0.03s system 99% cpu 2.640 total
pvE 0.1 out9   0.01s user 0.05s system  2% cpu 2.640 total
mawk2 '{...}'  
               1.09s user 0.03s system 42% cpu 2.639 total
perl -Mbignum -lne '$s+=$_; END{print $s}'             # perl 5.36

               1.36s user 0.03s system 52% cpu 2.662 total
time jot 15000000 | pvE9 | gpaste -sd+ -|bc
     out9:  118MiB 0:00:02 [45.3MiB/s] [45.3MiB/s] [ <=> ]

112500007500000

jot 15000000   2.59s user 0.03s system 99% cpu 2.627 total
pvE 0.1 out9   0.01s user 0.05s system  2% cpu 2.626 total
gpaste -sd+ -  0.27s user 0.03s system 11% cpu 2.625 total # gnu-paste
bc             4.55s user 0.46s system 66% cpu 7.544 total
time jot 15000000 | pvE9 | paste -sd+ -|bc
     out9:  118MiB 0:00:05 [22.7MiB/s] [22.7MiB/s] [ <=>  ]

112500007500000

jot 15000000  2.63s user 0.03s system 51% cpu  5.207 total
pvE 0.1 out9  0.01s user 0.06s system  1% cpu  5.209 total
paste -sd+ -  5.14s user 0.05s system 99% cpu  5.211 total  # bsd-paste
bc            4.53s user 0.40s system 49% cpu 10.029 total
Chanellechaney answered 28/12, 2022 at 21:41 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.