Confusing about StringIO, cStringIO and ByteIO
Asked Answered
A

1

32

I have googled and also search on SO for the difference between these buffer modules. However, I still don't understand very well and I think some of the posts I read are out of date.

In Python 2.7.11, I downloaded a binary file of a specific format using r = requests.get(url). Then I passed StringIO.StringIO(r.content), cStringIO.StringIO(r.content) and io.BytesIO(r.content) to a function designed for parsing the content.

All these three methods are available. I mean, even if the file is binary, it's still feasible to use StringIO. Why?

Another thing is concerning their efficiency.

In [1]: import StringIO, cStringIO, io

In [2]: from numpy import random

In [3]: x = random.random(1000000)

In [4]: %timeit y = cStringIO.StringIO(x)
1000000 loops, best of 3: 736 ns per loop

In [5]: %timeit y = StringIO.StringIO(x)
1000 loops, best of 3: 283 µs per loop

In [6]: %timeit y = io.BytesIO(x)
1000 loops, best of 3: 1.26 ms per loop

As illustrated above, cStringIO > StringIO > BytesIO.

I found someone mentioned that io.BytesIO always makes a new copy which costs more time. But there are also some posts mentioned that this was fixed in later Python versions.

So, can anyone make a thorough comparison between these IOs, in both latest Python 2.x and 3.x?


Some of the reference I found:

  • https://trac.edgewall.org/ticket/12046

    io.StringIO requires a unicode string. io.BytesIO requires a bytes string. StringIO.StringIO allows either unicode or bytes string. cStringIO.StringIO requires a string that is encoded as a bytes string.

But cStringIO.StringIO('abc') doesn't raise any error.

There is a fix patch in this post in 2014.

  • Lots of SO posts not listed here.

Here are the Python 2.7 results for Eric's example

%timeit cStringIO.StringIO(u_data)
1000000 loops, best of 3: 488 ns per loop
%timeit cStringIO.StringIO(b_data)
1000000 loops, best of 3: 448 ns per loop
%timeit StringIO.StringIO(u_data)
1000000 loops, best of 3: 1.15 µs per loop
%timeit StringIO.StringIO(b_data)
1000000 loops, best of 3: 1.19 µs per loop
%timeit io.StringIO(u_data)
1000 loops, best of 3: 304 µs per loop
# %timeit io.StringIO(b_data)
# error
# %timeit io.BytesIO(u_data)
# error
%timeit io.BytesIO(b_data)
10000 loops, best of 3: 77.5 µs per loop

As for 2.7, cStringIO.StringIO and StringIO.StringIO are far more efficient than io.

Aliquant answered 26/5, 2016 at 13:20 Comment(3)
Can you label each of your snippets as python 2 or python 3?Pedanticism
@Eric, I did all my tests in Python 2.7.11. It seems (c)StringIO is replaced by io in 3. I mainly use 2.7. But I think it would be meaningful for other readers to discuss both versions.Aliquant
io is in python 2 as wellPedanticism
P
26

You should use io.StringIO for handling unicode objects and io.BytesIO for handling bytes objects in both python 2 and 3, for forwards-compatibility (this is all 3 has to offer).


Here's a better test (for python 2 and 3), that doesn't include conversion costs from numpy to str/bytes

import numpy as np
import string
b_data = np.random.choice(list(string.printable), size=1000000).tobytes()
u_data = b_data.decode('ascii')
u_data = u'\u2603' + u_data[1:]  # add a non-ascii character

And then:

import io
%timeit io.StringIO(u_data)
%timeit io.StringIO(b_data)
%timeit io.BytesIO(u_data)
%timeit io.BytesIO(b_data)

In python 2, you can also test:

import StringIO, cStringIO
%timeit cStringIO.StringIO(u_data)
%timeit cStringIO.StringIO(b_data)
%timeit StringIO.StringIO(u_data)
%timeit StringIO.StringIO(b_data)

Some of these will crash, complaining about non-ascii characters


Python 3.5 results:

>>> %timeit io.StringIO(u_data)
100 loops, best of 3: 8.61 ms per loop
>>> %timeit io.StringIO(b_data)
TypeError: initial_value must be str or None, not bytes
>>> %timeit io.BytesIO(u_data)
TypeError: a bytes-like object is required, not 'str'
>>> %timeit io.BytesIO(b_data)
The slowest run took 6.79 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 344 ns per loop

Python 2.7 results (run on a different machine):

>>> %timeit io.StringIO(u_data)
1000 loops, best of 3: 304 µs per loop
>>> %timeit io.StringIO(b_data)
TypeError: initial_value must be unicode or None, not str
>>> %timeit io.BytesIO(u_data)
TypeError: 'unicode' does not have the buffer interface
>>> %timeit io.BytesIO(b_data)
10000 loops, best of 3: 77.5 µs per loop
>>> %timeit cStringIO.StringIO(u_data)
UnicodeEncodeError: 'ascii' codec cant encode character u'\u2603' in position 0: ordinal not in range(128)
>>> %timeit cStringIO.StringIO(b_data)
1000000 loops, best of 3: 448 ns per loop
>>> %timeit StringIO.StringIO(u_data)
1000000 loops, best of 3: 1.15 µs per loop
>>> %timeit StringIO.StringIO(b_data)
1000000 loops, best of 3: 1.19 µs per loop
Pedanticism answered 26/5, 2016 at 14:1 Comment(14)
So in 3.x, BytesIO is distinct from and much faster than StringIO, in contrast to in 2.x.Aliquant
io.BytesIO and io.StringIO are not comparable, as one only works on binary input, and the other only works on unicode stringsPedanticism
I complemented the 2.7 tests. Maybe you can put them in your post?Aliquant
cStringIO.StringIO(u_data) errors on my machine, Did you run the line that inserts a unicode snowman?Pedanticism
u_data[0] is \tAliquant
@Lee: No way. u_data[0] is clearly u'\u2603', because that's what I set it to. Did you forget to run my last line?Pedanticism
I ran your old codes. I remember u'\u2603' wasn't in your last post. Now an exception is raised. UnicodeEncodeError: 'ascii' codec can't encode character u'\u2603' in position 0: ordinal not in range(128)Aliquant
@Lee: Right, that's what I'd expect. cStringIO is unsafe to use on unicode data, as it can fail at runtime depending on the value.Pedanticism
So BytesIO is both safer and faster to use in python 3x compared to StringIO !Bari
No @stormfield, it is not safer. One is for str, the other is for bytes. Choose based on what type of data you're writing, not on speed.Pedanticism
These time tests are not very helpful, because you only measure the time for creating the StringIO object but not the time for using it, which is what really matters. A better test would be to perform multiple reads or call a function such as readlines().Tint
What I don't understand is if I want to process the file byte-by-byte with Python 2.7 regardless of the encoding or anything, so I use io.BytesIO, then why the hell on the Earth does it complain about Unicode? Why does it matter if the binary data is Unicode or not? This is retarded. I just want to access the bytes...Scarlatti
@CsabaToth: I have no idea what you're trying to say but I think 1) it stems from a confusion between unicode code points and bytes, and 2) would be far easier to understand if you asked it as a new questionPedanticism
@CsabaToth If you simply want to store a bunch of text data and read it all at once, and you're okay with encoding your string as a binary object with the .encode() method, then BytesIO works just fine (remember to .decode the resulting binary back to a string). But if you plan to work on it like a Unicode text file, and want .seek(1) to seek by actually one character, then you have to use StringIO.Parlour

© 2022 - 2024 — McMap. All rights reserved.