How to write unicode text to file in python 2 & 3 using same code?
Asked Answered
S

2

5

I am trying to write a program that can run through both python 2 & 3. It reads character from website and writes into file. I have already imported unicode_literals from __future__.

Straight out trying to write a string that looks like this:

txt = u'his$\u2026\n'

Will result in UnicodeEncodeError:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 4: ordinal not in range(128)

The only way to write it to a file in python2 is:

fp = open("/tmp/test", "w")
txt2 = txt.encode('utf-8')
fp.write(txt2) # It works
type(txt2) # str - that is why it works

However, trying to reuse the same code in python3 is not going to work since in python 3,

type(txt2) # is byte type

E.g

txt.encode('utf-8')
b'his$\xe2\x80\xa6\n'

Forcing a fp.write(txt2) will throw TypeError:

TypeError: write() argument must be str, not bytes

So, cantxt = u'his$\u2026\n' be written in a file using the same code block in both python 2 and 3. (Other than using a wrapper on fp.write)

Subtile answered 7/4, 2018 at 0:28 Comment(3)
Using 'print(txt.encode('utf-8'), file=fp)` is a part-time solution. It will work well in python2. However, it won't work well enough in python3, instead of printing the actual character it will actually print the string literal representation of the bytes. As in, instead of printing his$… python3 will end up with: b'his$\xe2\x80\xa6\n'.Subtile
What do you mean by "mixed string"? I see you've tagged this unicode-normalization; is that an issue here?Squeegee
I shouldn't have said mixed string, my bad. The string when printed looks something like this: his$…Subtile
S
12

You say:

The only way to write it to a file in python2 is:

fp = open("/tmp/test", "w")
txt2 = txt.encode('utf-8')
fp.write(txt2) # It works

But that's not true. There are many ways to do it that are better than this. The One Obvious Way To Do It is with io.open. In 3.x, this is the same function as the builtin open. In 2.6 and 2.7, it's effectively a backport of the 3.x builtin. This means you get 3.x-style Unicode text files in both versions:

fp = io.open("/tmp/test", "w", encoding='utf-8')
fp.write(txt2) # It works

If you need compatibility with 2.5 or earlier—or possibly 2.6 and 3.0 (they support io.open, but it's very slow in some cases), you can use the older way, codecs.open:

fp = codecs.open("/tmp/test", "w", encoding='utf-8')
fp.write(txt2) # It works

There are differences between the two under the covers, but most code you write isn't going to be interested in the underlying raw file or the encoder buffer or anything else besides the basic file-like object API, so you can also use try/except ImportError to fall back to codecs if io isn't available.

Squeegee answered 7/4, 2018 at 0:35 Comment(1)
codecs.open fixed my issue. I ignored io.open since it is too slow in 2.6 and did not much into codecs other than encode/decode. I am marking this as solved. Thanks.Subtile
Z
1

Opening the file with the 'b' mode will allow you to use identical code in Python2 and Python3:

txt = u'his$\u2026\n'

with open("/tmp/test", "wb") as fp:
    fp.write(txt.encode('utf-8'))

result:

$ python2 x.py 
$ md5sum /tmp/test
f39cd7554a823b05658d776a27eb97d9  /tmp/test
$ python3 x.py 
$ md5sum /tmp/test
f39cd7554a823b05658d776a27eb97d9  /tmp/test
Zosima answered 7/4, 2018 at 0:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.