Real file objects slower than StringIO and cStringIO?
Asked Answered
B

1

4

StringIO has the following notes in its code:

Notes:
- Using a real file is often faster (but less convenient).
- There's also a much faster implementation in C, called cStringIO, but
  it's not subclassable.

The "real file is often faster" line seemed really odd to me: how could writing to disk beat writing to memory? I tried profiling these different cases and got results that contradict these docs, as well as the answer to this question. This other question does explain why cStringIO is slower under some circumstances, though I'm not doing any concatenating here. The test writes a given amount of data to a file, then seeks to the beginning and reads it back out. On the "new" tests, I created a new object each time, and on the "same" ones I truncate and reuse the same object for each repetition to rule out that source of overhead. That overhead mattered for using tempfiles with small data sizes but not large ones.

Code is here.

Using 1000 passes with size 1.0KiB
New StringIO:   0.0026 0.0025 0.0034
Same StringIO:  0.0026 0.0023 0.0030
New cStringIO:  0.0009 0.0010 0.0008
Same cStringIO: 0.0009 0.0009 0.0009
New tempfile:   0.0679 0.0554 0.0542
Same tempfile:  0.0069 0.0064 0.0070
==============================================================
Using 1000 passes with size 100.0KiB
New StringIO:   0.0093 0.0099 0.0108
Same StringIO:  0.0109 0.0090 0.0086
New cStringIO:  0.0130 0.0139 0.0120
Same cStringIO: 0.0118 0.0115 0.0124
New tempfile:   0.1006 0.0905 0.0899
Same tempfile:  0.0573 0.0526 0.0523
==============================================================
Using 1000 passes with size 1.0MiB
New StringIO:   0.0727 0.0700 0.0717
Same StringIO:  0.0740 0.0735 0.0712
New cStringIO:  0.1484 0.1399 0.1470
Same cStringIO: 0.1493 0.1393 0.1465
New tempfile:   0.6576 0.6750 0.6821
Same tempfile:  0.5951 0.5870 0.5678
==============================================================
Using 1000 passes with size 10.0MiB
New StringIO:   1.0965 1.1129 1.1079
Same StringIO:  1.1206 1.2979 1.1932
New cStringIO:  2.2532 2.2162 2.2482
Same cStringIO: 2.2624 2.2225 2.2377
New tempfile:   6.8350 6.7924 6.8481
Same tempfile:  6.8424 7.8114 7.8404
==============================================================

The two StringIO implementations were pretty comparable, though cStringIO slowed down significantly for large data sizes. But the tempfile.TemporaryFile always took 3 times as long as the slowest StringIO.

Bagasse answered 3/2, 2016 at 18:37 Comment(3)
I don't believe you're comparing the right things here. These answers are talking about your basic Python file wrapper -- file -- not tempfile.TemporaryFile.Arleen
Probably because file IO both highly optimized by the OS and goes through an in-memory cache that's all likely to be faster in some common cases than whatever the Python module is doing in the background.Papillon
@Two-BitAlchemist I wouldn't expect TemporaryFile to be much different than file for reading and writing. And looking at tempfile, TemporaryFile is actually a function that returns files, at least on POSIX (which is what I used). Any overhead from creation differences should be taken care of by the "Same tempfile" cases.Bagasse
J
5

It all depends on what "often" means. StringIO is implemented by keeping your writes in a list and then joining the list to a string on read. Your test case - a series of writes followed by a read - is its best scenario. If I tweak the test case to do 50 random writes/reads in the file, then cStringIO tends to win with the file system in second place.

The comment seems to reflect a system programmer's bias to let the c libraries plus operating system do file system things because its hard to guess in a general sense what performs best under all conditions.

def write_and_read_test_data(flo):
    fsize = len(closure['test_data'])
    flo.write(closure['test_data'])
    for _ in range(50):
        flo.seek(random.randint(0, fsize-1))
        flo.write('x')
        flo.read(1)
    flo.seek(0)
    closure['output'] = flo.read()

The 10meg test case took longer than my attention span...

Using 1000 passes with size 1.0KiB
New StringIO:   0.9551 0.9467 0.9366
Same StringIO:  0.9252 0.9228 0.9207
New cStringIO:  0.3274 0.3280 0.3251
Same cStringIO: 0.3182 0.3231 0.3280
New tempfile:   1.1833 1.1853 1.1650
Same tempfile:  0.9563 0.9414 0.9504
==============================================================
Using 1000 passes with size 100.0KiB
New StringIO:   5.6253 5.6589 5.6025
Same StringIO:  5.5799 5.5608 5.5589
New cStringIO:  0.4157 0.4133 0.4140
Same cStringIO: 0.4078 0.4076 0.4088
New tempfile:   2.0420 2.0391 2.0408
Same tempfile:  1.5722 1.5749 1.5693
==============================================================
Using 1000 passes with size 1.0MiB
New StringIO:   105.2350 106.3904 107.5411
Same StringIO:  108.3744 109.4510 105.6012
New cStringIO:  2.4698 2.4781 2.4165
Same cStringIO: 2.4699 2.4600 2.4451
New tempfile:   6.6086 6.5783 6.5916
Same tempfile:  6.1420 6.1614 6.1366
Josettejosey answered 3/2, 2016 at 19:27 Comment(1)
D'oh! Random access must cause StringIO to copy huge amounts of data, since it's backed by immutable strings. Interesting: the other answer claims it's due to Python's interpreted nature, but it seems that's not actually the culprit. I just added _pyio.BytesIO to the test: it's implemented in Python, like StringIO, but uses mutable bytearrays. It only took twice as long as cStringIO and still beat TemporaryFile despite being interpreted. I think I'll add an answer about that to the other question. Thanks!Bagasse

© 2022 - 2024 — McMap. All rights reserved.