How to improve the speed of reading a large file in Common Lisp line by line
Asked Answered
S

2

6

Recently I have a task to process a large file, the file size is 460MB, and contains 5777672 lines. When I use the linux built-in command 'wc' to calculate the file line numbers, it is blazing fast:

time wc -l large_ess_test.log
5777672 large_ess_test.log

real    0m0.144s
user    0m0.052s
sys     0m0.084s

Then I use following codes to calculate the line numbers in Common Lisp (SBCL 1.3.7 64bits)

#!/usr/local/bin/sbcl --script
(defparameter filename (second *posix-argv*))
(format t "nline: ~D~%"
        (with-open-file (in filename)
          (loop for l = (read-line in nil nil)
             while l
             count l)))

The result make me disappointment, since it is really slow comparing to the 'wc' command. We just calculate the line number, even without any other operations yet:

time ./test.lisp large_ess_test.log
nline: 5777672

real    0m3.994s
user    0m3.808s
sys     0m0.152s

I know SBCL provide the C function interface, with which we can call the C procedures directly. I believe if I call the C functions directly, the performance will improve, so I write following codes:

#!/usr/local/bin/sbcl --script
(define-alien-type pointer (* char))
(define-alien-type size_t  unsigned-long)
(define-alien-type ssize_t long)
(define-alien-type FILE*   pointer)

(define-alien-routine fopen FILE*
  (filename c-string)
  (modes    c-string))

(define-alien-routine fclose int
  (stream FILE*))

(define-alien-routine getline ssize_t
  (lineptr (* (* char)))
  (n       (* size_t))
  (stream  FILE*))

;; The key to improve the performance:
(declaim (inline getline))
(declaim (inline read-a-line))

(defparameter filename (second *posix-argv*))

(defun read-a-line (fp)
  (with-alien ((lineptr (* char))
               (size    size_t))
    (setf size 0)
    (prog1
        (getline (addr lineptr) (addr size) fp)
      (free-alien lineptr))))

(format t "nline: ~D~%"
        (let ((fp (fopen filename "r"))
              (nline 0))
          (unwind-protect
               (loop
                  (if (= -1 (read-a-line fp))
                      (return)
                      (incf nline)))
            (unless (null-alien fp)
              (fclose fp)))
          nline))

Beware there are two 'declaim' lines. If we do not write that two lines, the performance is nearly the same as previous version:

;; Before declaim inline:

;; time ./test2.lisp large_ess_test.log
;; nline: 5777672

;; real 0m3.774s
;; user 0m3.604s
;; sys  0m0.148s

But if we write that two lines, the performance increased dramatically:

;; After delaim inline:

;; time ./test2.lisp large_ess_test.log
;; nline: 5777672

;; real 0m0.767s
;; user 0m0.616s
;; sys  0m0.136s

I think the performance issue of the first version is that 'read-line' do many other things than just read a line from the stream. Also if we can get a inline version of the 'read-line' the speed will increase. The question is can we do that? Is there any other (standard) way to improve the read performance without rely on the FFI (not standard)?

Soloist answered 29/7, 2016 at 21:37 Comment(0)
D
5

The wc utility is specialized towards this task (e.g. it uses fadvise). If I had to perform the task quickly, I would probably consider using it from Lisp:

CL-USER> (time (parse-integer
                 (trivial-shell:shell-command "wc -l /tmp/large") 
                 :junk-allowed t))
Evaluation took:
  0.160 seconds of real time
  0.007343 seconds of total run time (0.000000 user, 0.007343 system)
  4.38% CPU
  381,646,599 processor cycles
  2,176 bytes consed

5777672
7

Here below is a 2.8 times slower Common Lisp version (SBCL 1.3.7) which:

  1. uses a buffer of (UNSIGNED-BYTE 8) elements and searches for 10 (LF)
  2. relies on READ-SEQUENCE
  3. counts elements explicitly (no COUNT)
  4. adds optimization declarations

As explained in comments, this assumes a particular encoding of newlines which won't work in all cases (that's bad, but here we replicate how wc works).

Use case

I made a file with the required number of lines and random large numbers on each line.

$ head /tmp/large
40721464513295045164409764141337171283743839234004114007016385954846624941161940739262754532145351336011544635983803337802
302688650332823972161024925841738216684275519674144853512935484321121382058207767892999110099
12127138342525644979456951336948881438967488255401497749747122531372644240417582283720034330082860221222236934955
28004461699214617943893203751119815181262623130442209320081054856344182547684
2368224648283244549917005208294446715375229403128245954161044012485784650329544448732041119652238003906938784265044644012743487917338526
10187414801460188523874389448625131601828345073853512891
18139254731161634077170374183629006496541918416200333307681019211073598374443624027089513206284736438073440343464515605950135369987
264133633737591502517649433121708413001893239265224973146093724444415999323412026140148811107315275274514969546676171233513940820
266634202314513982469064052528307445611038540754445234380948245264834237744595384991230031062233083375534272384684213524515821
17743431383885515663346469524228524653280663312275122927140858199583669032542409846791571021743570930576483101689249445164712663940464

$ time wc -l /tmp/large
5777672 /tmp/large

real    0m0.180s
user    0m0.119s
sys 0m0.061s

$ du -h /tmp/large
388M    /tmp/large

Counting lines

(defun count-lines (file &optional (buffer-size 32768))
  (declare (optimize (speed 3) (debug 0) (safety 0))
           (type fixnum buffer-size))
  (let ((buffer
         (make-array buffer-size
                     :element-type #1='(unsigned-byte 8)))
        (sum 0)
        (end 0))
    (declare (type fixnum sum end))
    (with-open-file (in file :element-type #1#)
      (loop
         (setf end (read-sequence buffer in))
         (when (= end 0)
           (return sum))
         (dotimes (i end)
           (declare (type fixnum i)
                    (dynamic-extent i))
           (when (= 10
                    (aref buffer i))
             (incf sum)))))))

Test

CL-USER> (time(count-lines #P"/tmp/large"))

Evaluation took:
  0.493 seconds of real time
  0.493113 seconds of total run time (0.409636 user, 0.083477 system)
  100.00% CPU
  1,179,393,504 processor cycles
  1,248 bytes consed

5777672

If you need to do something else with the line, use a string buffer instead and reuse it directly without doing copies. You probably need to copy the last chunk of characters (after the last newline in the buffer) to the beginning in order to fill the buffer again, though.

Debauchery answered 30/7, 2016 at 20:49 Comment(2)
it would be slightly more complicated for crlf eol (WIndows, HTTP, ...)Vocalise
Thanks your code sample. I have the same idea before, but I did not use the '(unsigned-byte 8)' as the element-type, it leads to a poor performance. When I switch to the '(unsigned-byte 8)' element type, I'm very satisfied the performance. Thanks again.Soloist
V
4

One of the main problems of READ-LINE is that it allocates a new string for each call. This can cost time, depending of the implementation.

The Common Lisp standard lacks a function, which reads a line into a string buffer.

Some implementations provide a solution for a function which reads a line into a buffer. For example the function READ-LINE-INTO in Allegro CL.

Usually implementations provide streams which buffer the input. Searching for newlines can be implemented on top of that, but the code for that might be implementation specific (or use some of the stream abstractions) and/or complex.

I don't know if there is an official implementation of such functionality, but something like that can be found here - looks complex for SBCL:

read-line-into-buffer in https://github.com/ExaScience/cl-elprep/blob/master/buffer.lisp

Vocalise answered 29/7, 2016 at 22:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.