Wrap an open stream with io.TextIOWrapper
Asked Answered
P

6

55

How can I wrap an open binary stream – a Python 2 file, a Python 3 io.BufferedReader, an io.BytesIO – in an io.TextIOWrapper?

I'm trying to write code that will work unchanged:

  • Running on Python 2.
  • Running on Python 3.
  • With binary streams generated from the standard library (i.e. I can't control what type they are)
  • With binary streams made to be test doubles (i.e. no file handle, can't re-open).
  • Producing an io.TextIOWrapper that wraps the specified stream.

The io.TextIOWrapper is needed because its API is expected by other parts of the standard library. Other file-like types exist, but don't provide the right API.

Example

Wrapping the binary stream presented as the subprocess.Popen.stdout attribute:

import subprocess
import io

gnupg_subprocess = subprocess.Popen(
        ["gpg", "--version"], stdout=subprocess.PIPE)
gnupg_stdout = io.TextIOWrapper(gnupg_subprocess.stdout, encoding="utf-8")

In unit tests, the stream is replaced with an io.BytesIO instance to control its content without touching any subprocesses or filesystems.

gnupg_subprocess.stdout = io.BytesIO("Lorem ipsum".encode("utf-8"))

That works fine on the streams created by Python 3's standard library. The same code, though, fails on streams generated by Python 2:

[Python 2]
>>> type(gnupg_subprocess.stdout)
<type 'file'>
>>> gnupg_stdout = io.TextIOWrapper(gnupg_subprocess.stdout, encoding="utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'file' object has no attribute 'readable'

Not a solution: Special treatment for file

An obvious response is to have a branch in the code which tests whether the stream actually is a Python 2 file object, and handle that differently from io.* objects.

That's not an option for well-tested code, because it makes a branch that unit tests – which, in order to run as fast as possible, must not create any real filesystem objects – can't exercise.

The unit tests will be providing test doubles, not real file objects. So creating a branch which won't be exercised by those test doubles is defeating the test suite.

Not a solution: io.open

Some respondents suggest re-opening (e.g. with io.open) the underlying file handle:

gnupg_stdout = io.open(
        gnupg_subprocess.stdout.fileno(), mode='r', encoding="utf-8")

That works on both Python 3 and Python 2:

[Python 3]
>>> type(gnupg_subprocess.stdout)
<class '_io.BufferedReader'>
>>> gnupg_stdout = io.open(gnupg_subprocess.stdout.fileno(), mode='r', encoding="utf-8")
>>> type(gnupg_stdout)
<class '_io.TextIOWrapper'>
[Python 2]
>>> type(gnupg_subprocess.stdout)
<type 'file'>
>>> gnupg_stdout = io.open(gnupg_subprocess.stdout.fileno(), mode='r', encoding="utf-8")
>>> type(gnupg_stdout)
<type '_io.TextIOWrapper'>

But of course it relies on re-opening a real file from its file handle. So it fails in unit tests when the test double is an io.BytesIO instance:

>>> gnupg_subprocess.stdout = io.BytesIO("Lorem ipsum".encode("utf-8"))
>>> type(gnupg_subprocess.stdout)
<type '_io.BytesIO'>
>>> gnupg_stdout = io.open(gnupg_subprocess.stdout.fileno(), mode='r', encoding="utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
io.UnsupportedOperation: fileno

Not a solution: codecs.getreader

The standard library also has the codecs module, which provides wrapper features:

import codecs

gnupg_stdout = codecs.getreader("utf-8")(gnupg_subprocess.stdout)

That's good because it doesn't attempt to re-open the stream. But it fails to provide the io.TextIOWrapper API. Specifically, it doesn't inherit io.IOBase and doesn't have the encoding attribute:

>>> type(gnupg_subprocess.stdout)
<type 'file'>
>>> gnupg_stdout = codecs.getreader("utf-8")(gnupg_subprocess.stdout)
>>> type(gnupg_stdout)
<type 'instance'>
>>> isinstance(gnupg_stdout, io.IOBase)
False
>>> gnupg_stdout.encoding
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/codecs.py", line 643, in __getattr__
    return getattr(self.stream, name)
AttributeError: '_io.BytesIO' object has no attribute 'encoding'

So codecs doesn't provide objects which substitute for io.TextIOWrapper.

What to do?

So how can I write code that works for both Python 2 and Python 3, with both the test doubles and the real objects, which wraps an io.TextIOWrapper around the already-open byte stream?

Pemberton answered 24/12, 2015 at 5:17 Comment(2)
re: io.open you could change unit tests, you know, e.g. a tempfile.TemporaryFile(); That's a hammer of a solution of course...Stucker
This is a rather too limited set of restrictions. Unit tests can open files if that is absolutely the only way to properly test something, for example. So a wrapper function that can special-cases file objects to grab the file descriptor, can be tested with a unittest just fine.Beautician
P
7

Based on multiple suggestions in various forums, and experimenting with the standard library to meet the criteria, my current conclusion is this can't be done with the library and types as we currently have them.

Pemberton answered 30/12, 2015 at 7:47 Comment(1)
Given how ill advised wrapping the GnuPG binary in subprocess and similar calls is in the first place, that's probably a good thing. Especially in something allegedly meant to be stable, production code. Now granted, the GPGME bindings hadn't been merged with GPGME's master branch when you asked this question originally, but they have now and you've still got my email B1, so if this is a thing and the focus is actually GPG rather than data streams in general; it's time to get in touch. Regards, B2. ;)Detection
S
30

Use codecs.getreader to produce a wrapper object:

text_stream = codecs.getreader("utf-8")(bytes_stream)

Works on Python 2 and Python 3.

Snider answered 29/12, 2015 at 13:11 Comment(5)
Thanks for the suggestion. That object doesn't provide enough of the io.TextIOWrapper API though, so isn't a solution.Pemberton
Ah, too bad. I guess you could put your test data in a file… :/Snider
Addressed in the question already: this needs to work also with test doubles that are not real files.Pemberton
Eventually I used a custom solution based on this. It doesn't address the requirements fully, so is not a solution; but I'm awarding the bounty as thanks for the help.Pemberton
Worked like a charm for me. Used this technique in concert with the csv package and boto to stream CSV files from S3.Buckles
S
16

It turns out you just need to wrap your io.BytesIO in io.BufferedReader which exists on both Python 2 and Python 3.

import io

reader = io.BufferedReader(io.BytesIO("Lorem ipsum".encode("utf-8")))
wrapper = io.TextIOWrapper(reader)
wrapper.read()  # returns Lorem ipsum

This answer originally suggested using os.pipe, but the read-side of the pipe would have to be wrapped in io.BufferedReader on Python 2 anyway to work, so this solution is simpler and avoids allocating a pipe.

Snider answered 30/12, 2015 at 8:38 Comment(2)
A Python 2 file object (as created by many standard library functions) does not work when passed to the io.BufferedReader constructor: AttributeError: 'file' object has no attribute 'readable'.Pemberton
Right, I read a few more branches of the question and see what you're getting at now. As you've determined in your own answer, I don't think you can do this for Py2 and Py3 without some tests of the type of object and branching.Snider
P
7

Based on multiple suggestions in various forums, and experimenting with the standard library to meet the criteria, my current conclusion is this can't be done with the library and types as we currently have them.

Pemberton answered 30/12, 2015 at 7:47 Comment(1)
Given how ill advised wrapping the GnuPG binary in subprocess and similar calls is in the first place, that's probably a good thing. Especially in something allegedly meant to be stable, production code. Now granted, the GPGME bindings hadn't been merged with GPGME's master branch when you asked this question originally, but they have now and you've still got my email B1, so if this is a thing and the focus is actually GPG rather than data streams in general; it's time to get in touch. Regards, B2. ;)Detection
S
4

Okay, this seems to be a complete solution, for all cases mentioned in the question, tested with Python 2.7 and Python 3.5. The general solution ended up being re-opening the file descriptor, but instead of io.BytesIO you need to use a pipe for your test double so that you have a file descriptor.

import io
import subprocess
import os

# Example function, re-opens a file descriptor for UTF-8 decoding,
# reads until EOF and prints what is read.
def read_as_utf8(fileno):
    fp = io.open(fileno, mode="r", encoding="utf-8", closefd=False)
    print(fp.read())
    fp.close()

# Subprocess
gpg = subprocess.Popen(["gpg", "--version"], stdout=subprocess.PIPE)
read_as_utf8(gpg.stdout.fileno())

# Normal file (contains "Lorem ipsum." as UTF-8 bytes)
normal_file = open("loremipsum.txt", "rb")
read_as_utf8(normal_file.fileno())  # prints "Lorem ipsum."

# Pipe (for test harness - write whatever you want into the pipe)
pipe_r, pipe_w = os.pipe()
os.write(pipe_w, "Lorem ipsum.".encode("utf-8"))
os.close(pipe_w)
read_as_utf8(pipe_r)  # prints "Lorem ipsum."
os.close(pipe_r)
Snider answered 30/12, 2015 at 13:59 Comment(4)
Already addressed in the question: The test doubles are not real files. io.open won't work because the test doubles can't be re-opened by path nor file handle.Pemberton
As stated in the answer, I’m addressing that by using pipes instead of BytesIO for the test doubles… or is there some reason you’re constrained to use BytesIO? It occurs to me that the very fact that BytesIO (on Python 2) isn’t enough “like” the objects you use in your real code is a good reason not to use it as a test double…Snider
The whole unit test suite is using io.StringIO and io.BytesIO for test doubles of a great many file operations. I'm ruling out “make a special set of test doubles just for this case” as a solution; I'm looking for a solution that works with the normal fake files (those that inherit from io.IOBase) and the normal real files of both Python versions.Pemberton
You could use the pipe code path for all situations then. Write the contents of your file, file pointer, bytesio etc to the pipe, and attach your reader to the read side, which will always be a file object.. It might be the only solution that works the right way, for all fake and real files, on both Py2 and Py3 with only one code path for all.Snider
B
2

I needed this as well, but based on the thread here, I determined that it was not possible using just Python 2's io module. While this breaks your "Special treatment for file" rule, the technique I went with was to create an extremely thin wrapper for file (code below) that could then be wrapped in an io.BufferedReader, which can in turn be passed to the io.TextIOWrapper constructor. It will be a pain to unit test, as obviously the new code path can't be tested on Python 3.

Incidentally, the reason the results of an open() can be passed directly to io.TextIOWrapper in Python 3 is because a binary-mode open() actually returns an io.BufferedReader instance to begin with (at least on Python 3.4, which is where I was testing at the time).

import io
import six  # for six.PY2

if six.PY2:
    class _ReadableWrapper(object):
        def __init__(self, raw):
            self._raw = raw

        def readable(self):
            return True

        def writable(self):
            return False

        def seekable(self):
            return True

        def __getattr__(self, name):
            return getattr(self._raw, name)

def wrap_text(stream, *args, **kwargs):
    # Note: order important here, as 'file' doesn't exist in Python 3
    if six.PY2 and isinstance(stream, file):
        stream = io.BufferedReader(_ReadableWrapper(stream))

    return io.TextIOWrapper(stream)

At least this is small, so hopefully it minimizes the exposure for parts that cannot easily be unit tested.

Barling answered 19/2, 2017 at 18:13 Comment(0)
K
1

Here's some code that I've tested in both python 2.7 and python 3.6.

The key here is that you need to use detach() on your previous stream first. This does not close the underlying file, it just rips out the raw stream object so that it can be reused. detach() will return an object that is wrappable with TextIOWrapper.

As an example here, I open a file in binary read mode, do a read on it like that, then I switch to a UTF-8 decoded text stream via io.TextIOWrapper.

I saved this example as this-file.py

import io

fileName = 'this-file.py'
fp = io.open(fileName,'rb')
fp.seek(20)
someBytes = fp.read(10)
print(type(someBytes) + len(someBytes))

# now let's do some wrapping to get a new text (non-binary) stream
pos = fp.tell() # we're about to lose our position, so let's save it
newStream = io.TextIOWrapper(fp.detach(),'utf-8') # FYI -- fp is now unusable
newStream.seek(pos)
theRest = newStream.read()
print(type(theRest), len(theRest))

Here's what I get when I run it with both python2 and python3.

$ python2.7 this-file.py 
(<type 'str'>, 10)
(<type 'unicode'>, 406)
$ python3.6 this-file.py 
<class 'bytes'> 10
<class 'str'> 406

Obviously the print syntax is different and as expected the variable types differ between python versions but works like it should in both cases.

Kuehn answered 1/3, 2017 at 23:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.