What makes parsing a text file in 'r' mode more convenient than parsing it in 'rb' mode? Especially when the text file in question may contain non-ASCII characters.
This depends a little bit on what version of Python you're using. In Python 2, Chris Drappier's answer applies.
In Python 3, its a different (and more consistent) story: in text mode ('r'
), Python will parse the file according to the text encoding you give it (or, if you don't give one, a platform-dependent default), and read()
will give you a str
. In binary ('rb'
) mode, Python does not assume that the file contains things that can reasonably be parsed as characters, and read()
gives you a bytes
object.
Also, in Python 3, the universal newlines (the translating between '\n'
and platform-specific newline conventions so you don't have to care about them) is available for text-mode files on any platform, not just Windows.
sys.getdefaultencoding()
. On my Py3 install, its UTF-8, but you can't rely on that always being the case. –
Machute open
is given by locale.getpreferredencoding()
, not sys.getdefaultencoding()
. On my system (Windows with Python3.10), the former is 'cp1252', while the latter is 'utf-8'. –
Deemster from the documentation:
On Windows, 'b' appended to the mode opens the file in binary mode, so there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows makes a distinction between text and binary files; the end-of-line characters in text files are automatically altered slightly when data is read or written. This behind-the-scenes modification to file data is fine for ASCII text files, but it’ll corrupt binary data like that in JPEG or EXE files. Be very careful to use binary mode when reading and writing such files. On Unix, it doesn’t hurt to append a 'b' to the mode, so you can use it platform-independently for all binary files.
The difference lies in how the end-of-line (EOL) is handled. Different operating systems use different characters to mark EOL - \n
in Unix, \r
in Mac versions prior to OS X, \r\n
in Windows. When a file is opened in text mode, when the file is read, Python replaces the OS specific end-of-line character read from the file with just \n
. And vice versa, i.e. when you try to write \n
to a file opened in text mode, it is going to write the OS specific EOL character. You can find what your OS default EOL by checking os.linesep
.
When a file is opened in binary mode, no mapping takes place. What you read is what you get. Remember, text mode is the default mode. So if you are handling non-text files (images, video, etc.), make sure you open the file in binary mode, otherwise you’ll end up messing up the file by introducing (or removing) some bytes.
Python also has a universal newline mode. When a file is opened in this mode, Python maps all of the characters \r
, \n
and \r\n
to \n
.
For clarification and to answer Agostino's comment/question (I don't have sufficient reputation to comment so bear with me stating this as an answer...):
In Python 2 no line end modification happens, neither in text nor binary mode - as has been stated before, in Python 2 Chris Drappier's answer applies (please note that its link nowadays points to the 3.x Python docs but Chris' quoted text is of course from the Python 2 input and output tutorial)
So no, it is not true that opening a file in text mode with Python 2 on non-Windows does any line end modification:
0 $ cat data.txt
line1
line2
line3
0 $ file data.txt
data.txt: ASCII text, with CRLF line terminators
0 $ python2.7 -c 'f = open("data.txt"); print f.readlines()'
['line1\r\n', 'line2\r\n', 'line3\r\n']
0 $ python2.7 -c 'f = open("data.txt", "r"); print f.readlines()'
['line1\r\n', 'line2\r\n', 'line3\r\n']
0 $ python2.7 -c 'f = open("data.txt", "rb"); print f.readlines()'
It is however possible to open the file in universal newline mode in Python 2, which does exactly perform said line end mod:
0 $ python2.7 -c 'f = open("data.txt", "rU"); print f.readlines()'
['line1\n', 'line2\n', 'line3\n']
(the universal newline mode specifier is deprecated as of Python 3.x)
On Python 3, on the other hand, platform-specific line ends do get normalized to '\n' when reading a file in text mode, and '\n' gets converted to the current platform's default line end when writing in text mode (in addition to the bytes<->unicode<->bytes decoding/encoding going on in text mode). E.g. reading a Dos/Win CRLF-line-ended file on Linux will normalize the line ends to '\n'.
© 2022 - 2024 — McMap. All rights reserved.