Improve speed of string manipulations

Asked 13/7, 2013 at 11:47 Answered 18/7, 2013 at 19:11

This is a follow-up question, sort of, to this one: Write an efficient string replacement function? .

In (albeit distant) future I hope to get to do natural language processing. Of course speed of strings manipulation is important because of that. Accidentally, I've stumbled over this test: http://raid6.com.au/~onlyjob/posts/arena/ - all tests are biased, this is no exception. However, it raised important question for me. And so I wrote a few tests to see how am I doing:

This was my first attempt (I'll call it #A):

(defun test ()
  (declare (optimize (debug 0) (safety 0) (speed 3)))
  (loop with addidtion = (concatenate 'string "abcdefgh" "efghefgh")
     and initial = (get-internal-real-time)
     for i from 0 below (+ (* (/ 1024 (length addidtion)) 1024 4) 1000)
     for ln = (* (length addidtion) i)
     for accumulated = addidtion
     then (loop with concatenated = (concatenate 'string accumulated addidtion)
             for start = (search "efgh" concatenated)
             while start do (replace concatenated "____" :start1 start)
             finally (return concatenated))
     when (zerop (mod ln (* 1024 256))) do
       (format t "~&~f s | ~d Kb" (/ (- (get-internal-real-time) initial) 1000) (/ ln 1024)))
  (values))

(test)

Baffled with the results, I tried to use cl-ppcre - I don't know what I was hoping for, but the results came out as really bad... Here's the code I used for testing:

(ql:quickload "cl-ppcre")

(defun test ()
  (declare (optimize (debug 0) (safety 0) (speed 3)))
  (loop with addidtion = (concatenate 'string "abcdefgh" "efghefgh")
     and initial = (get-internal-real-time)
     for i from 0 below (+ (* (/ 1024 (length addidtion)) 1024 4) 1000)
     for ln = (* (length addidtion) i)
     for accumulated = addidtion
     then (cl-ppcre:regex-replace-all "efgh" (concatenate 'string accumulated addidtion) "____")
     when (zerop (mod ln (* 1024 256))) do
       (format t "~&~f s | ~d Kb" (/ (- (get-internal-real-time) initial) 1000) (/ ln 1024)))
  (values))

(test)

Well, then, in hopes to maybe side-step some generalizations, I decided to write my own, albeit somewhat naive version:

(defun replace-all (input match replacement)
  (declare (type string input match replacement)
           (optimize (debug 0) (safety 0) (speed 3)))
  (loop with pattern fixnum = (1- (length match))
     with i fixnum = pattern
     with j fixnum = i
     with len fixnum = (length input) do
       (cond
         ((>= i len) (return input))
         ((zerop j)
          (loop do
               (setf (aref input i) (aref replacement j) i (1+ i))
               (if (= j pattern)
                   (progn (incf i pattern) (return))
                   (incf j))))
         ((char= (aref input i) (aref match j))
          (decf i) (decf j))
         (t (setf i (+ i 1 (- pattern j)) j pattern)))))

(defun test ()
  (declare (optimize (debug 0) (safety 0) (speed 3)))
  (loop with addidtion string = (concatenate 'string "abcdefgh" "efghefgh")
     and initial = (get-internal-real-time)
     for i fixnum from 0 below (+ (* (/ 1024 (length addidtion)) 1024 4) 1000)
     for ln fixnum = (* (length addidtion) i)
     for accumulated string = addidtion
     then (replace-all (concatenate 'string accumulated addidtion) "efgh" "____")
     when (zerop (mod ln (* 1024 256))) do
       (format t "~&~f s | ~d Kb" (/ (- (get-internal-real-time) initial) 1000) (/ ln 1024)))
  (values))

(test)

Almost as slow as cl-ppcre! Now, that's incredible! There isn't anything I can spot here such that would result in such poor performance... And still it does suck :(

Realizing that the standard functions performed the best so far, I looked into SBCL source and after some reading I came up with this:

(defun replace-all (input match replacement &key (start 0))
  (declare (type simple-string input match replacement)
           (type fixnum start)
           (optimize (debug 0) (safety 0) (speed 3)))
  (loop with input-length fixnum = (length input)
     and match-length fixnum = (length match)
     for i fixnum from 0 below (ceiling (the fixnum (- input-length start)) match-length) do
       (loop with prefix fixnum = (+ start (the fixnum (* i match-length)))
          for j fixnum from 0 below match-length do
            (when (<= (the fixnum (+ prefix j match-length)) input-length)
              (loop for k fixnum from (+ prefix j) below (the fixnum (+ prefix j match-length))
                 for n fixnum from 0 do
                   (unless (char= (aref input k) (aref match n)) (return))
                 finally
                   (loop for m fixnum from (- k match-length) below k
                      for o fixnum from 0 do
                        (setf (aref input m) (aref replacement o))
                      finally
                        (return-from replace-all
                          (replace-all input match replacement :start k))))))
       finally (return input)))

(defun test ()
  (declare (optimize (debug 0) (safety 0) (speed 3)))
  (loop with addidtion string = (concatenate 'string "abcdefgh" "efghefgh")
     and initial = (get-internal-real-time)
     for i fixnum from 0 below (+ (* (/ 1024 (length addidtion)) 1024 4) 1000)
     for ln fixnum = (* (length addidtion) i)
     for accumulated string = addidtion
     then (replace-all (concatenate 'string accumulated addidtion) "efgh" "____")
     when (zerop (mod ln (* 1024 256))) do
       (format t "~&~f s | ~d Kb" (/ (- (get-internal-real-time) initial) 1000) (/ ln 1024)))
  (values))

(test)

Finally, I can win, although a tiny fraction of performance against the standard library - yet it is still very-very bad compared to almost everything else...

Here's the table with the results:

| SBCL #A   | SBCL #B   | SBCL #C    | SBCL #D   | C gcc 4 -O3 | String size |
|-----------+-----------+------------+-----------+-------------+-------------|
| 17.463 s  | 166.254 s | 28.924 s   | 16.46 s   | 1 s         | 256 Kb      |
| 68.484 s  | 674.629 s | 116.55 s   | 63.318 s  | 4 s         | 512 Kb      |
| 153.99 s  | gave up   | 264.927 s  | 141.04 s  | 10 s        | 768 Kb      |
| 275.204 s | . . . . . | 474.151 s  | 251.315 s | 17 s        | 1024 Kb     |
| 431.768 s | . . . . . | 745.737 s  | 391.534 s | 27 s        | 1280 Kb     |
| 624.559 s | . . . . . | 1079.903 s | 567.827 s | 38 s        | 1536 Kb     |

Now, the question: What did I do wrong? Is this something inherent to Lisp strings? Can this probably be mitigated through... what?

In the long shot, I'd even consider writing a specialized library for string processing. If the problem isn't my bad code, but rather the implementation. Would it make sense to do so? If yes, what language would you suggest for doing it?

EDIT: Just for the record, I'm now trying to use this library: https://github.com/Ramarren/ropes to deal with strings concatenation. Unfortunately, it doesn't have a replace function in it and doing multiple replaces isn't very trivial. But I'll keep this post updated when I have something.

I've tried to slightly change huaiyuan's variant to use array's fill-pointers instead of string concatenation (to achieve something similar to StringBuilder suggested by Paulo Madeira. It probably can be optimized further, but I'm not sure about the types / which will method be faster / will it be worth to redefine types for * and + to get them to only operate on fixnum or signed-byte. Anyway, here's the code and the benchmark:

(defun test/e ()
  (declare (optimize speed))
  (labels ((min-power-of-two (num)
             (declare (type fixnum num))
             (decf num)
             (1+
              (progn
                (loop for i fixnum = 1 then (the (unsigned-byte 32) (ash i 1))
                   while (< i 17) do
                     (setf num
                           (logior
                            (the fixnum
                                 (ash num (the (signed-byte 32)
                                               (+ 1 (the (signed-byte 32)
                                                         (lognot i)))))) num))) num)))
           (join (x y)
             (let ((capacity (array-dimension x 0))
                   (desired-length (+ (length x) (length y)))
                   (x-copy x))
               (declare (type fixnum capacity desired-length)
                        (type (vector character) x y x-copy))
               (when (< capacity desired-length)
                 (setf x (make-array
                          (min-power-of-two desired-length)
                          :element-type 'character
                          :fill-pointer desired-length))
                 (replace x x-copy))
               (replace x y :start1 (length x))
               (setf (fill-pointer x) desired-length) x))
           (seek (old str pos)
             (let ((q (position (aref old 0) str :start pos)))
               (and q (search old str :start2 q))))
           (subs (str old new)
             (loop for p = (seek old str 0) then (seek old str p)
                while p do (replace str new :start1 p))
             str))
    (declare (inline min-power-of-two join seek subs)
             (ftype (function (fixnum) fixnum) min-power-of-two))
    (let* ((builder
            (make-array 16 :element-type 'character
                        :initial-contents "abcdefghefghefgh"
                        :fill-pointer 16))
           (ini (get-internal-real-time)))
      (declare (type (vector character) builder))
      (loop for i fixnum below (+ 1000 (* 4 1024 1024 (/ (length builder))))
         for j = builder then
           (subs (join j builder) "efgh" "____")
         for k fixnum = (* (length builder) i)
         when (= 0 (mod k (* 1024 256)))
         do (format t "~&~8,2F sec ~8D kB"
                    (/ (- (get-internal-real-time) ini) 1000)
                    (/ k 1024))))))

    1.68 sec      256 kB
    6.63 sec      512 kB
   14.84 sec      768 kB
   26.35 sec     1024 kB
   41.01 sec     1280 kB
   59.55 sec     1536 kB
   82.85 sec     1792 kB
  110.03 sec     2048 kB

Burtburta answered 13/7, 2013 at 11:47 Comment(7)

How are you measuring the function timing? – Rashid 13/7, 2013 at 13:54

@Rashid that's already in the code examples (the calls to get-interlal-real-time - in SBCL it is in milliseconds), but other than that I normally use time macro. I was just trying to keep as close to the original examples as possible. – Burtburta 13/7, 2013 at 14:12

In test #A wouldn't the search and replace loop complete faster if you started SEARCH after the last position found? – Cuellar 14/7, 2013 at 9:10

@Cuellar oh, true, but not much faster - at each iteration there are only two matches, so there shouldn't be a significant difference (I hope). Will check soon. – Burtburta 14/7, 2013 at 9:57

@Cuellar I've made the search function to restart from the saved position, but it didn't influence the outcome at all... – Burtburta 14/7, 2013 at 19:10

Hum strange it does improve on my side, I get a factor of 2.6 which is non neglectable I guess. But yeah, as @huaiyuan said the bottleneck is the search function anyways. The LispWorks profiler reports that 98% of time is spent in that function. It seems SBCL has a more efficient search function. – Cuellar 14/7, 2013 at 21:0

@wvxvw, by the way I would be really curious to know how it compares when using ropes, it sounds like a fun approach. – Cuellar 14/7, 2013 at 21:3

The bottle-neck is the search function, which is perhaps not optimized in SBCL. The following version uses position to help it skip over impossible region and is about 10 times as fast as your version #A on my machine:

(defun test/e ()
  (declare (optimize speed))
  (labels ((join (x y)
             (concatenate 'simple-base-string x y))
           (seek (old str pos)
             (let ((q (position (char old 0) str :start pos)))
               (and q (search old str :start2 q))))
           (subs (str old new)
             (loop for p = (seek old str 0) then (seek old str p)
                   while p do (replace str new :start1 p))
             str))
    (declare (inline join seek subs))
    (let* ((str (join "abcdefgh" "efghefgh"))
           (ini (get-internal-real-time)))
      (loop for i below (+ 1000 (* 4 1024 1024 (/ (length str))))
            for j = str then (subs (join j str) "efgh" "____")
            for k = (* (length str) i)
            when (= 0 (mod k (* 1024 256)))
              do (format t "~&~8,2F sec ~8D kB"
                         (/ (- (get-internal-real-time) ini) 1000)
                         (/ k 1024))))))

Heavierthanair answered 14/7, 2013 at 17:10 Comment(1)

Yup, this is really much better, for me it's about 5-6 times better then the best one so far. I'm now trying to figure out what could be so wrong about the original search. – Burtburta 14/7, 2013 at 19:12

The tests in that page are indeed biased, so let's see by how much. The author claims to test string manipulation, but here's what the programs in that page test:

String concatenation
Memory management, either explicit (C) or implicit
In some languages, regular expressions
In others, string search algorithms and substring replacement
- Memory access, which has bounds checks on several languages

There are way too many aspects just here. Here's how it's being measured:

Real time in seconds

This is unfortunate, since the computer had to be completely dedicated to running just this test for reasonable values, without any other processes whatsoever, such as services, antiviruses, browsers, even a waiting *nix shell. CPU time would be much more useful, you could even run the tests in a virtual machine.

Another aspect is that characters in C, C++, Perl, Python, PHP and Ruby are 8-bit, but they're 16-bit in many of the other tested languages. This means that memory usage is stressed in very different amounts, by at least a factor of 2. Here, cache misses are much more noticeable.

I suspect the reason Perl is so fast is that it checks its arguments once before invoking a C function, instead of constantly checking bounds. Other languages with 8-bit strings are not so fast, but are still reasonably fast.

JavaScript V8 has strings that are ASCII if possible, so if the appended and replaced token was "ëfgh", you'd pay a lot more in that implementation.

Python 3 is almost three times slower than Python 2, and my guess is it's due to the wchar_t * vs char * internal representation of strings.

JavaScript SpiderMonkey uses 16-bit strings. I didn't dig the sourced much, but the file jsstr.h mentions ropes.

Java is so slow because Strings are immutable, and so for this benchmark, it's definitely not the appropriate data type. You're paying the price of generating a huge string after each .replace(). I haven't tested, but probably a StringBuffer would be much faster.

So, this benchmark is to be taken not only with a grain of salt, but with a handful of it.

In Common Lisp, bounds checking and type dispatching in aref and its setf are probably the bottlenecks.

For good performance, you would have to ditch generic string sequences and use simple-strings or simple-vectors, whichever your implementation optimizes best. Then, you should have a way of making calls to schar or svref and their setfable forms that bypass bounds checking. From here, you can implement your own simple-string-search or simple-character-vector-search (and replace-simple-string or replace-simple-vector, although they play a much smaller role in this particular example) with full speed optimization and type declarations, with bounds checking at the head of each call instead of at each array access.

A sufficiently smart compiler™ would do all of this for you given "proper" declarations. The problem is, you'd have to use (concatenate 'simple-string/simple-vector ...), because neither simple strings nor simple vectors are adjustable.

With a compacting/moving GC, there's always a penalty in these cases (e.g. array/object copying), and choosing between array adjustment and concatenation must really depend on profiling tests. Otherwise, adjustment can be way faster than concatenation, while there's enough free memory to grow the array.

You could use adjustable arrays, if the implementation would access the actual elements directly after a brief bounds checking at the head of optimized calls to/expansions of search and replace with adjustable arrays (e.g. by having internal definitions that take the actual displaced vector/array and start and end offsets).

But I'm speculating a lot here, you have to compile, inspect the compilation and profile in each implementation for real-world facts.

As a side note, the C example code is full of bugs, such as off-by-one (-1, actually) allocations (the strcat calls write an extra byte, the zero-terminated string terminator), an uninitialized zero-terminated string gstr (the first strcat works by luck, because the memory might not be initialized to 0), conversions from size_t and time_t to int and assumption of these types in a printf format string, an unused variable pos_c that is initialized with the first allocation for gstr which is incremented without taking into account that realloc may move the buffer, and no error handling whatsoever.

Declamation answered 18/7, 2013 at 19:11 Comment(8)

To be fair, most other tests (not only C) are horrible there. In JavaScript they use something like parseInt(date1 - date2) even though - is overloaded for Date to return a number, etc. But the thing is... this is more like that mmm... I don't watch the TV, what's it called? The funny kind of restling when big guys in run to the sides of the ring pull the rubber bands and then jump at each other? :) The idea was to produce the fastest result possible by any means possible. Eventually it was educational - discovering how search isn't super fast... – Burtburta 18/7, 2013 at 20:39

Also, I think it was specifically crafted against languages with immutable strings, but StringBuilder wouldn't have helped you in Java - there's no way to search on it, not as far as I know. BTW. I'm not marking this question as answered because I still want to try ropes, but haven't get the time to do it. – Burtburta 18/7, 2013 at 20:42

@wvxvw, no problem, I added this answer for posterity. I might add a sample I used to test several implementations. I didn't get to the point of no-bounds-checking aref (or svref/schar), as I suspect that no implementation provides that with type declarations alone. About StringBuilder, it has indexOf() and replace() methods, so I guess they could be faster. – Declamation 18/7, 2013 at 21:46

Oh, true, for some reason I thought it stored the chunks as added and concatenated them all when it needs to write, but now I see that's not true (its indexOf is technically just a call to String.indexOf on the value of the builder). – Burtburta 19/7, 2013 at 6:10

Actually, in that same page, the author states that a StringBuilder approach, recommended by a reader, is significantly faster, along the lines of C/C++ and Perl. Search for "June 2011 Update". The conclusions is again very sarcasting against Java, at least based only in this benchmark. – Declamation 19/7, 2013 at 15:43

I've made yet another test with fill-pointer working similar to StringBuilder, to see how much it is possible to save on concatenation. – Burtburta 19/7, 2013 at 22:37

@wvxvw, the sample I have (I'll try to clean it up and post tomorrow) shows that Allegro CL, LispWorks and SBCL do direct access without bounds checking given proper declarations, such as (declare (type (mod #.array-dimension-limit) i) (type (simple-array character (#.array-dimension-limit)) a)) (aref a i)). However, your version strikes what I found out to be the bottleneck in Allegro CL and SBCL: growing the vector. The least you do it (e.g. number of times), the better. LispWorks doesn't seem to suffer from this problem. – Declamation 23/7, 2013 at 21:33

I was getting a notice from SBCL about having to do runtime bounds checking (for arrays). But... I don't know... is it really all that costly? I'd imagine that Lisp arrays store length with the array, so it's just comparing two integers (probably already known to be non-negative fixnums). – Burtburta 23/7, 2013 at 22:18

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags