Confusing about StringIO, cStringIO and ByteIO

Asked 26/5, 2016 at 13:20 Answered 26/5, 2016 at 14:1

Solved python stringio bytesio cstringio

I have googled and also search on SO for the difference between these buffer modules. However, I still don't understand very well and I think some of the posts I read are out of date.

In Python 2.7.11, I downloaded a binary file of a specific format using r = requests.get(url). Then I passed StringIO.StringIO(r.content), cStringIO.StringIO(r.content) and io.BytesIO(r.content) to a function designed for parsing the content.

All these three methods are available. I mean, even if the file is binary, it's still feasible to use StringIO. Why?

Another thing is concerning their efficiency.

In [1]: import StringIO, cStringIO, io

In [2]: from numpy import random

In [3]: x = random.random(1000000)

In [4]: %timeit y = cStringIO.StringIO(x)
1000000 loops, best of 3: 736 ns per loop

In [5]: %timeit y = StringIO.StringIO(x)
1000 loops, best of 3: 283 µs per loop

In [6]: %timeit y = io.BytesIO(x)
1000 loops, best of 3: 1.26 ms per loop

As illustrated above, cStringIO > StringIO > BytesIO.

I found someone mentioned that io.BytesIO always makes a new copy which costs more time. But there are also some posts mentioned that this was fixed in later Python versions.

So, can anyone make a thorough comparison between these IOs, in both latest Python 2.x and 3.x?

Some of the reference I found:

https://trac.edgewall.org/ticket/12046

io.StringIO requires a unicode string. io.BytesIO requires a bytes string. StringIO.StringIO allows either unicode or bytes string. cStringIO.StringIO requires a string that is encoded as a bytes string.

But cStringIO.StringIO('abc') doesn't raise any error.

https://review.openstack.org/#/c/286926/1

The StringIO class is the wrong class to use for this, especially considering that subunit v2 is binary and not a string.
http://comments.gmane.org/gmane.comp.python.devel/148717

cStringIO.StringIO(b'data') didn't copy the data while io.BytesIO(b'data') makes a copy (even if the data is not modified later).

There is a fix patch in this post in 2014.

Lots of SO posts not listed here.

Here are the Python 2.7 results for Eric's example

%timeit cStringIO.StringIO(u_data)
1000000 loops, best of 3: 488 ns per loop
%timeit cStringIO.StringIO(b_data)
1000000 loops, best of 3: 448 ns per loop
%timeit StringIO.StringIO(u_data)
1000000 loops, best of 3: 1.15 µs per loop
%timeit StringIO.StringIO(b_data)
1000000 loops, best of 3: 1.19 µs per loop
%timeit io.StringIO(u_data)
1000 loops, best of 3: 304 µs per loop
# %timeit io.StringIO(b_data)
# error
# %timeit io.BytesIO(u_data)
# error
%timeit io.BytesIO(b_data)
10000 loops, best of 3: 77.5 µs per loop

As for 2.7, cStringIO.StringIO and StringIO.StringIO are far more efficient than io.

Aliquant answered 26/5, 2016 at 13:20 Comment(3)

Can you label each of your snippets as python 2 or python 3? – Pedanticism 26/5, 2016 at 13:45

@Eric, I did all my tests in Python 2.7.11. It seems (c)StringIO is replaced by io in 3. I mainly use 2.7. But I think it would be meaningful for other readers to discuss both versions. – Aliquant 26/5, 2016 at 13:55

io is in python 2 as well – Pedanticism 26/5, 2016 at 14:2

You should use io.StringIO for handling unicode objects and io.BytesIO for handling bytes objects in both python 2 and 3, for forwards-compatibility (this is all 3 has to offer).

Here's a better test (for python 2 and 3), that doesn't include conversion costs from numpy to str/bytes

import numpy as np
import string
b_data = np.random.choice(list(string.printable), size=1000000).tobytes()
u_data = b_data.decode('ascii')
u_data = u'\u2603' + u_data[1:]  # add a non-ascii character

And then:

import io
%timeit io.StringIO(u_data)
%timeit io.StringIO(b_data)
%timeit io.BytesIO(u_data)
%timeit io.BytesIO(b_data)

In python 2, you can also test:

import StringIO, cStringIO
%timeit cStringIO.StringIO(u_data)
%timeit cStringIO.StringIO(b_data)
%timeit StringIO.StringIO(u_data)
%timeit StringIO.StringIO(b_data)

Some of these will crash, complaining about non-ascii characters

Python 3.5 results:

>>> %timeit io.StringIO(u_data)
100 loops, best of 3: 8.61 ms per loop
>>> %timeit io.StringIO(b_data)
TypeError: initial_value must be str or None, not bytes
>>> %timeit io.BytesIO(u_data)
TypeError: a bytes-like object is required, not 'str'
>>> %timeit io.BytesIO(b_data)
The slowest run took 6.79 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 344 ns per loop

Python 2.7 results (run on a different machine):

>>> %timeit io.StringIO(u_data)
1000 loops, best of 3: 304 µs per loop
>>> %timeit io.StringIO(b_data)
TypeError: initial_value must be unicode or None, not str
>>> %timeit io.BytesIO(u_data)
TypeError: 'unicode' does not have the buffer interface
>>> %timeit io.BytesIO(b_data)
10000 loops, best of 3: 77.5 µs per loop

>>> %timeit cStringIO.StringIO(u_data)
UnicodeEncodeError: 'ascii' codec cant encode character u'\u2603' in position 0: ordinal not in range(128)
>>> %timeit cStringIO.StringIO(b_data)
1000000 loops, best of 3: 448 ns per loop
>>> %timeit StringIO.StringIO(u_data)
1000000 loops, best of 3: 1.15 µs per loop
>>> %timeit StringIO.StringIO(b_data)
1000000 loops, best of 3: 1.19 µs per loop

Pedanticism answered 26/5, 2016 at 14:1 Comment(14)

So in 3.x, BytesIO is distinct from and much faster than StringIO, in contrast to in 2.x. – Aliquant 26/5, 2016 at 14:15

io.BytesIO and io.StringIO are not comparable, as one only works on binary input, and the other only works on unicode strings – Pedanticism 26/5, 2016 at 14:16

I complemented the 2.7 tests. Maybe you can put them in your post? – Aliquant 26/5, 2016 at 14:38

cStringIO.StringIO(u_data) errors on my machine, Did you run the line that inserts a unicode snowman? – Pedanticism 26/5, 2016 at 15:9

u_data[0] is \t – Aliquant 26/5, 2016 at 15:51

@Lee: No way. u_data[0] is clearly u'\u2603', because that's what I set it to. Did you forget to run my last line? – Pedanticism 26/5, 2016 at 23:33

I ran your old codes. I remember u'\u2603' wasn't in your last post. Now an exception is raised. UnicodeEncodeError: 'ascii' codec can't encode character u'\u2603' in position 0: ordinal not in range(128) – Aliquant 27/5, 2016 at 11:40

@Lee: Right, that's what I'd expect. cStringIO is unsafe to use on unicode data, as it can fail at runtime depending on the value. – Pedanticism 27/5, 2016 at 11:47

So BytesIO is both safer and faster to use in python 3x compared to StringIO ! – Bari 27/8, 2017 at 17:3

No @stormfield, it is not safer. One is for str, the other is for bytes. Choose based on what type of data you're writing, not on speed. – Pedanticism 27/8, 2017 at 23:12

These time tests are not very helpful, because you only measure the time for creating the StringIO object but not the time for using it, which is what really matters. A better test would be to perform multiple reads or call a function such as readlines(). – Tint 14/2, 2018 at 16:46

What I don't understand is if I want to process the file byte-by-byte with Python 2.7 regardless of the encoding or anything, so I use io.BytesIO, then why the hell on the Earth does it complain about Unicode? Why does it matter if the binary data is Unicode or not? This is retarded. I just want to access the bytes... – Scarlatti 13/12, 2018 at 23:44

@CsabaToth: I have no idea what you're trying to say but I think 1) it stems from a confusion between unicode code points and bytes, and 2) would be far easier to understand if you asked it as a new question – Pedanticism 14/12, 2018 at 5:42

@CsabaToth If you simply want to store a bunch of text data and read it all at once, and you're okay with encoding your string as a binary object with the .encode() method, then BytesIO works just fine (remember to .decode the resulting binary back to a string). But if you plan to work on it like a Unicode text file, and want .seek(1) to seek by actually one character, then you have to use StringIO. – Parlour 25/7, 2022 at 21:26

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags