Recently I have a task to process a large file, the file size is 460MB, and contains 5777672 lines. When I use the linux built-in command 'wc' to calculate the file line numbers, it is blazing fast:
time wc -l large_ess_test.log
5777672 large_ess_test.log
real 0m0.144s
user 0m0.052s
sys 0m0.084s
Then I use following codes to calculate the line numbers in Common Lisp (SBCL 1.3.7 64bits)
#!/usr/local/bin/sbcl --script
(defparameter filename (second *posix-argv*))
(format t "nline: ~D~%"
(with-open-file (in filename)
(loop for l = (read-line in nil nil)
while l
count l)))
The result make me disappointment, since it is really slow comparing to the 'wc' command. We just calculate the line number, even without any other operations yet:
time ./test.lisp large_ess_test.log
nline: 5777672
real 0m3.994s
user 0m3.808s
sys 0m0.152s
I know SBCL provide the C function interface, with which we can call the C procedures directly. I believe if I call the C functions directly, the performance will improve, so I write following codes:
#!/usr/local/bin/sbcl --script
(define-alien-type pointer (* char))
(define-alien-type size_t unsigned-long)
(define-alien-type ssize_t long)
(define-alien-type FILE* pointer)
(define-alien-routine fopen FILE*
(filename c-string)
(modes c-string))
(define-alien-routine fclose int
(stream FILE*))
(define-alien-routine getline ssize_t
(lineptr (* (* char)))
(n (* size_t))
(stream FILE*))
;; The key to improve the performance:
(declaim (inline getline))
(declaim (inline read-a-line))
(defparameter filename (second *posix-argv*))
(defun read-a-line (fp)
(with-alien ((lineptr (* char))
(size size_t))
(setf size 0)
(prog1
(getline (addr lineptr) (addr size) fp)
(free-alien lineptr))))
(format t "nline: ~D~%"
(let ((fp (fopen filename "r"))
(nline 0))
(unwind-protect
(loop
(if (= -1 (read-a-line fp))
(return)
(incf nline)))
(unless (null-alien fp)
(fclose fp)))
nline))
Beware there are two 'declaim' lines. If we do not write that two lines, the performance is nearly the same as previous version:
;; Before declaim inline:
;; time ./test2.lisp large_ess_test.log
;; nline: 5777672
;; real 0m3.774s
;; user 0m3.604s
;; sys 0m0.148s
But if we write that two lines, the performance increased dramatically:
;; After delaim inline:
;; time ./test2.lisp large_ess_test.log
;; nline: 5777672
;; real 0m0.767s
;; user 0m0.616s
;; sys 0m0.136s
I think the performance issue of the first version is that 'read-line' do many other things than just read a line from the stream. Also if we can get a inline version of the 'read-line' the speed will increase. The question is can we do that? Is there any other (standard) way to improve the read performance without rely on the FFI (not standard)?