I became curious to understand the internals of how string comparison works in python when I was solving the following example algorithm problem:
Given two strings, return the length of the longest common prefix
Solution 1: charByChar
My intuition told me that the optimal solution would be to start with one cursor at the beginning of both words and iterate forward until the prefixes no longer match. Something like
def charByChar(smaller, bigger):
assert len(smaller) <= len(bigger)
for p in range(len(smaller)):
if smaller[p] != bigger[p]:
return p
return len(smaller)
To simplify the code, the function assumes that the length of the first string, smaller
, is always smaller than or equal to the length of the second string, bigger
.
Solution 2: binarySearch
Another method is to bisect the two strings to create two prefix substrings. If the prefixes are equal, we know that the common prefix point is at least as long as the midpoint. Otherwise the common prefix point is at least no bigger than the midpoint. We can then recurse to find the prefix length.
Aka binary search.
def binarySearch(smaller, bigger):
assert len(smaller) <= len(bigger)
lo = 0
hi = len(smaller)
# binary search for prefix
while lo < hi:
# +1 for even lengths
mid = ((hi - lo + 1) // 2) + lo
if smaller[:mid] == bigger[:mid]:
# prefixes equal
lo = mid
else:
# prefixes not equal
hi = mid - 1
return lo
At first I assumed that that binarySearch
would be slower because string comparison would compare all characters several times rather than just the prefix characters as in charByChar
.
Surpisingly, the binarySearch
turned out to be much faster after some preliminary benchmarking.
Figure A
Above shows how performance is affected as prefix length is increased. Suffix length remains constant at 50 characters.
This graph shows two things:
- As expected, both algorithms perform linearly worse as prefix length increases.
- Performance of
charByChar
degrades at a much faster rate.
Why is binarySearch
so much better? I think it is because
- The string comparison in
binarySearch
is presumably optimized by the interpreter / CPU behind the scenes.charByChar
actually creates new strings for each character accessed and this produces significant overhead.
To validate this I benchmarked the performance of comparing and slicing a string, labelled cmp
and slice
respectively below.
Figure B
This graph show two important things:
- As expected, comparing and slicing increase linearly with length.
- The cost of comparing and slicing increase very slowly with length relative to algorithm performance, Figure A. Note both figures go up to strings of length 1 Billion characters. Therefore, the cost of comparing 1 character 1 Billion times is much much greater than comparing 1 Billion characters once. But this still doesn't answer why ...
Cpython
In an effort to find out how the cpython interpreter optimizes string comparison I generated the byte code for the following function.
In [9]: def slice_cmp(a, b): return a[0] == b[0]
In [10]: dis.dis(slice_cmp)
0 LOAD_FAST 0 (a)
2 LOAD_CONST 1 (0)
4 BINARY_SUBSCR
6 LOAD_FAST 1 (b)
8 LOAD_CONST 1 (0)
10 BINARY_SUBSCR
12 COMPARE_OP 2 (==)
14 RETURN_VALUE
I poked around the cpython code and found the following two pieces of code but I'm not sure this is where string comparison occurs.
The question
- Where in the cpython does string comparison occur?
- Is there a CPU optimization? Is there a special x86 instruction which does string comparison? How can I see what assembly instructions are generated by cpython? You may assume I am using python3 latest, Intel Core i5, OS X 10.11.6.
- Why is comparing a long string so much faster than comparing each of it's characters?
Bonus question: When is charByChar more performant?
If the prefix is sufficiently small in comparison to the length rest of the string, at some point the cost of creating substrings in charByChar
becomes less than the cost of comparing the substrings in binarySearch
.
To describe this relationship I delved into runtime analysis.
Runtime analysis
To simplify the below equations, let's assume that smaller
and bigger
are the same size and I will refer to them as s1
and s2
.
charByChar
charByChar(s1, s2) = costOfOneChar * prefixLen
Where the
costOfOneChar = cmp(1) + slice(s1Len, 1) + slice(s2Len, 1)
Where cmp(1)
is the cost of comparing two strings of length 1 char.
slice
is the cost of accessing a char, the equivalent of charAt(i)
. Python has immutable strings and accessing a char actually creates a new string of length 1. slice(string_len, slice_len)
is the cost of slicing a string of length string_len
to a slice of size slice_len
.
So
charByChar(s1, s2) = O((cmp(1) + slice(s1Len, 1)) * prefixLen)
binarySearch
binarySearch(s1, s2) = costOfHalfOfEachString * log_2(s1Len)
log_2
is the number of times to divide the strings in half until reaching a string of length 1. Where
costOfHalfOfEachString = slice(s1Len, s1Len / 2) + slice(s2Len, s1Len / 2) + cmp(s1Len / 2)
So the big-O of binarySearch
will grow according to
binarySearch(s1, s2) = O((slice(s2Len, s1Len) + cmp(s1Len)) * log_2(s1Len))
Based on our previous analysis of the cost of
If we assume that costOfHalfOfEachString
is approximately the costOfComparingOneChar
then we can refer to them both as x
.
charByChar(s1, s2) = O(x * prefixLen)
binarySearch(s1, s2) = O(x * log_2(s1Len))
If we equate them
O(charByChar(s1, s2)) = O(binarySearch(s1, s2))
x * prefixLen = x * log_2(s1Len)
prefixLen = log_2(s1Len)
2 ** prefixLen = s1Len
So O(charByChar(s1, s2)) > O(binarySearch(s1, s2)
when
2 ** prefixLen = s1Len
So plugging in the above formula I regenerated tests for Figure A but with strings of total length 2 ** prefixLen
expecting the performance of the two algorithms to be roughly equal.
However, clearly charByChar
performs much better. With a bit of trial and error, the performance of the two algorithms are roughly equal when s1Len = 200 * prefixLen
Why is the relationship 200x?
smaller[:mid] == bigger[:mid]
and how it interpretssmaller[p] != bigger[p]
. – Outlawry