convert io.StringIO to io.BytesIO

Asked 28/4, 2019 at 10:47 Answered 9/6, 2020 at 21:34

Solved python python-3.x encoding io stream

original question: i got a StringIO object, how can i convert it into BytesIO?

update: The more general question is, how to convert a binary (encoded) file-like object into decoded file-like object in python3?

the naive approach i got is:

import io
sio = io.StringIO('wello horld')
bio = io.BytesIO(sio.read().encode('utf8'))
print(bio.read())  # prints b'wello horld'

is there more efficient and elegant way of doing this? the above code just reads everything into memory, encodes it instead of streaming the data in chunks.

for example, for the reverse question (BytesIO -> StringIO) there exist a class - io.TextIOWrapper which does exactly that (see this answer)

Veritable answered 28/4, 2019 at 10:47 Comment(3)

Does “more elegant” include implementing it yourself without such a bulk copy? – Obsolesce 30/4, 2019 at 14:11

i hope there is something better, if not that should be better than naive approach so yes. – Veritable 30/4, 2019 at 15:48

Please be aware that in the original question you ask for BytesIO -> StringIO and in the update StringIO -> BytesIO. And the example continues with BytesIO -> StringIO. – Resultant 4/5, 2019 at 5:52

It's interesting that though the question might seem reasonable, it's not that easy to figure out a practical reason why I would need to convert a StringIO into a BytesIO. Both are basically buffers and you usually need only one of them to make some additional manipulations either with the bytes or with the text.

I may be wrong, but I think your question is actually how to use a BytesIO instance when some code to which you want to pass it expects a text file.

In which case, it is a common question and the solution is codecs module.

The two usual cases of using it are the following:

Compose a File Object to Read

In [16]: import codecs, io

In [17]: bio = io.BytesIO(b'qwe\nasd\n')

In [18]: StreamReader = codecs.getreader('utf-8')  # here you pass the encoding

In [19]: wrapper_file = StreamReader(bio)

In [20]: print(repr(wrapper_file.readline()))
'qwe\n'

In [21]: print(repr(wrapper_file.read()))
'asd\n'

In [26]: bio.seek(0)
Out[26]: 0

In [27]: for line in wrapper_file:
    ...:     print(repr(line))
    ...:
'qwe\n'
'asd\n'

Compose a File Object to Write To

In [28]: bio = io.BytesIO()

In [29]: StreamWriter = codecs.getwriter('utf-8')  # here you pass the encoding

In [30]: wrapper_file = StreamWriter(bio)

In [31]: print('жаба', 'цап', file=wrapper_file)

In [32]: bio.getvalue()
Out[32]: b'\xd0\xb6\xd0\xb0\xd0\xb1\xd0\xb0 \xd1\x86\xd0\xb0\xd0\xbf\n'

In [33]: repr(bio.getvalue().decode('utf-8'))
Out[33]: "'жаба цап\\n'"

Uther answered 5/5, 2019 at 23:4 Comment(4)

One reason you need a BytesIO instead of a StringIO can be to upload an in-memory file to a S3 bucket using the upload_fileobj. More info here – Tart 21/7, 2020 at 16:43

OutputStreamWriter is an equivalent of the requested wrapper in Java. As of early 2021 Github search yields 1M usages of it. That's for the "practicality" of it. – Moline 11/5, 2021 at 8:42

very funny strung 'Жаба цап гадюку' =) – Donnelly 4/8, 2021 at 13:29

A full example using StringIO and BytesIO: bytes_io = io.BytesIO(string_io.getvalue().encode()) – Tintoretto 7/2, 2023 at 23:58

@foobarna answer can be improved by inheriting some io base-class

import io
sio = io.StringIO('wello horld')


class BytesIOWrapper(io.BufferedReader):
    """Wrap a buffered bytes stream over TextIOBase string stream."""

    def __init__(self, text_io_buffer, encoding=None, errors=None, **kwargs):
        super(BytesIOWrapper, self).__init__(text_io_buffer, **kwargs)
        self.encoding = encoding or text_io_buffer.encoding or 'utf-8'
        self.errors = errors or text_io_buffer.errors or 'strict'

    def _encoding_call(self, method_name, *args, **kwargs):
        raw_method = getattr(self.raw, method_name)
        val = raw_method(*args, **kwargs)
        return val.encode(self.encoding, errors=self.errors)

    def read(self, size=-1):
        return self._encoding_call('read', size)

    def read1(self, size=-1):
        return self._encoding_call('read1', size)

    def peek(self, size=-1):
        return self._encoding_call('peek', size)


bio = BytesIOWrapper(sio)
print(bio.read())  # b'wello horld'

Lyophilic answered 3/5, 2019 at 21:42 Comment(2)

UTF8 is not always single byte. this is incorrect: BytesIOWrapper(io.StringIO('אבגד')).read(1) returns two bytes: b'\xd7\x90' – Veritable 7/5, 2019 at 13:14

@ShmulikA, yeah, it returns 1 "character". To really return 1 byte "intermediate" buffer should be implemented – Lyophilic 7/5, 2019 at 14:11

It could be a generally useful tool to convert a character stream into a byte stream, so here goes:

import io

class EncodeIO(io.BufferedIOBase):
  def __init__(self,s,e='utf-8'):
    self.stream=s               # not raw, since it isn't
    self.encoding=e
    self.buf=b""                # encoded but not yet returned
  def _read(self,s): return self.stream.read(s).encode(self.encoding)
  def read(self,size=-1):
    b=self.buf
    self.buf=b""
    if size is None or size<0: return b+self._read(None)
    ret=[]
    while True:
      n=len(b)
      if size<n:
        b,self.buf=b[:size],b[size:]
        n=size
      ret.append(b)
      size-=n
      if not size: break
      b=self._read(min((size+1024)//2,size))
      if not b: break
    return b"".join(ret)
  read1=read

Obviously write could be defined symmetrically to decode input and send it to the underlying stream, although then you have to deal with having enough bytes for only part of a character.

Obsolesce answered 1/5, 2019 at 4:50 Comment(1)

@ShmulikA: Loops forever, even; edited. I forgot the break when I rewrote the buffering (before posting). – Obsolesce 7/5, 2019 at 23:21

As some pointed out, you need to do the encoding/decoding yourself.

However, you can achieve this in an elegant way - implementing your own TextIOWrapper for string => bytes.

Here is such a sample:

class BytesIOWrapper:
    def __init__(self, string_buffer, encoding='utf-8'):
        self.string_buffer = string_buffer
        self.encoding = encoding

    def __getattr__(self, attr):
        return getattr(self.string_buffer, attr)

    def read(self, size=-1):
        content = self.string_buffer.read(size)
        return content.encode(self.encoding)

    def write(self, b):
        content = b.decode(self.encoding)
        return self.string_buffer.write(content)

Which produces an output like this:

In [36]: bw = BytesIOWrapper(StringIO("some lengt˙˚hyÔstring in here"))

In [37]: bw.read(15)
Out[37]: b'some lengt\xcb\x99\xcb\x9ahy\xc3\x94'

In [38]: bw.tell()
Out[38]: 15

In [39]: bw.write(b'ME')
Out[39]: 2

In [40]: bw.seek(15)
Out[40]: 15

In [41]: bw.read()
Out[41]: b'MEring in here'

Hope it clears your thoughts!

Resultant answered 2/5, 2019 at 22:38 Comment(2)

read(size) must read <= size bytes. However, len(bw.read(15)) is 18. – Fissirostral 2/5, 2019 at 23:20

@FilipDimitrovski Indeed. That is because you say "read 15 bytes" when in fact it reads "15 string characters", which happens some of them to be 2 bytes long, hence the "18 length". I didn't say it was perfect, but at least it's not breaking the encoding (by splitting a valid utf-8 char in 2). That is a sample, which can be improved by adding more checking or more methods (readline, context manager, etc.) – Resultant 3/5, 2019 at 8:28

I had the exact same need, so I created an EncodedStreamReader class in the nr.utils.io package. It also solves the issue with actually reading the number of bytes requested instead of the number of characters from the wrapped stream.

$ pip install 'nr.utils.io>=0.1.0,<1.0.0'

Example usage:

import io
from nr.utils.io.readers import EncodedStreamReader
fp = EncodedStreamReader(io.StringIO('ä'), 'utf-8')
assert fp.read(1) == b'\xc3'
assert fp.read(1) == b'\xa4'

Sebiferous answered 9/6, 2020 at 21:34 Comment(0)

-1

bio from your example is _io.BytesIO class object. You have used 2 times the read() function.

I came up with bytes conversion and one read() method:

sio = io.StringIO('wello horld')
b = bytes(sio.read(), encoding='utf-8')
print(b)

But the second variant should be even faster:

sio = io.StringIO('wello horld')
b = sio.read().encode()
print(b)

Housebound answered 1/5, 2019 at 15:54 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Compose a File Object to Read

Compose a File Object to Write To

Recommended topics

Hot tags