Difference between parsing a text file in r and rb mode
Asked Answered
K

4

72

What makes parsing a text file in 'r' mode more convenient than parsing it in 'rb' mode? Especially when the text file in question may contain non-ASCII characters.

Kaplan answered 10/3, 2012 at 5:13 Comment(2)
Are you reading a text file or a binary file?Meggs
A text file. But for whatever reason I am given the file as a byte-stream.Kaplan
M
83

This depends a little bit on what version of Python you're using. In Python 2, Chris Drappier's answer applies.

In Python 3, its a different (and more consistent) story: in text mode ('r'), Python will parse the file according to the text encoding you give it (or, if you don't give one, a platform-dependent default), and read() will give you a str. In binary ('rb') mode, Python does not assume that the file contains things that can reasonably be parsed as characters, and read() gives you a bytes object.

Also, in Python 3, the universal newlines (the translating between '\n' and platform-specific newline conventions so you don't have to care about them) is available for text-mode files on any platform, not just Windows.

Machute answered 10/3, 2012 at 5:53 Comment(4)
for py3, will reading in text mode automatically try to detect what type of encoding it is? I imagine having to detect encoding is quite a challenge with a bytes object.Kaplan
@Keikoku Detecting encoding based on a stream alone, without any metadata, is impossible - think about the various encodings that are ASCII + use the 8th bit for information rather than parity; they all share 255 valid one-byte sequences, but only half of them (the ASCII half) represent the same character in each. Python's default isn't to guess it, its a session-wide default encoding, spelled sys.getdefaultencoding(). On my Py3 install, its UTF-8, but you can't rely on that always being the case.Machute
@Machute As far as I can tell, the default encoding used by open is given by locale.getpreferredencoding(), not sys.getdefaultencoding(). On my system (Windows with Python3.10), the former is 'cp1252', while the latter is 'utf-8'.Deemster
When I started reading this answer, I thought that the starting expression 'a little bit' was a joke :P Thank you for the explanation!Winger
N
22

from the documentation:

On Windows, 'b' appended to the mode opens the file in binary mode, so there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows makes a distinction between text and binary files; the end-of-line characters in text files are automatically altered slightly when data is read or written. This behind-the-scenes modification to file data is fine for ASCII text files, but it’ll corrupt binary data like that in JPEG or EXE files. Be very careful to use binary mode when reading and writing such files. On Unix, it doesn’t hurt to append a 'b' to the mode, so you can use it platform-independently for all binary files.

Nievelt answered 10/3, 2012 at 5:19 Comment(1)
So basically trying to read lines in binary mode is much more difficult because I'm not guaranteed that the EOL character is \n or \r\n or something else?Kaplan
B
13

The difference lies in how the end-of-line (EOL) is handled. Different operating systems use different characters to mark EOL - \n in Unix, \r in Mac versions prior to OS X, \r\n in Windows. When a file is opened in text mode, when the file is read, Python replaces the OS specific end-of-line character read from the file with just \n. And vice versa, i.e. when you try to write \n to a file opened in text mode, it is going to write the OS specific EOL character. You can find what your OS default EOL by checking os.linesep.

When a file is opened in binary mode, no mapping takes place. What you read is what you get. Remember, text mode is the default mode. So if you are handling non-text files (images, video, etc.), make sure you open the file in binary mode, otherwise you’ll end up messing up the file by introducing (or removing) some bytes.

Python also has a universal newline mode. When a file is opened in this mode, Python maps all of the characters \r, \n and \r\n to \n.

Belligerency answered 1/7, 2015 at 3:58 Comment(1)
Is this true for both Python 2 and Python 3?Grasp
O
2

For clarification and to answer Agostino's comment/question (I don't have sufficient reputation to comment so bear with me stating this as an answer...):

In Python 2 no line end modification happens, neither in text nor binary mode - as has been stated before, in Python 2 Chris Drappier's answer applies (please note that its link nowadays points to the 3.x Python docs but Chris' quoted text is of course from the Python 2 input and output tutorial)

So no, it is not true that opening a file in text mode with Python 2 on non-Windows does any line end modification:

0 $ cat data.txt 
line1
line2
line3
0 $ file data.txt 
data.txt: ASCII text, with CRLF line terminators
0 $ python2.7 -c 'f = open("data.txt"); print f.readlines()'
['line1\r\n', 'line2\r\n', 'line3\r\n']
0 $ python2.7 -c 'f = open("data.txt", "r"); print f.readlines()'
['line1\r\n', 'line2\r\n', 'line3\r\n']
0 $ python2.7 -c 'f = open("data.txt", "rb"); print f.readlines()'

It is however possible to open the file in universal newline mode in Python 2, which does exactly perform said line end mod:

0 $ python2.7 -c 'f = open("data.txt", "rU"); print f.readlines()'
['line1\n', 'line2\n', 'line3\n']

(the universal newline mode specifier is deprecated as of Python 3.x)

On Python 3, on the other hand, platform-specific line ends do get normalized to '\n' when reading a file in text mode, and '\n' gets converted to the current platform's default line end when writing in text mode (in addition to the bytes<->unicode<->bytes decoding/encoding going on in text mode). E.g. reading a Dos/Win CRLF-line-ended file on Linux will normalize the line ends to '\n'.

Oleviaolfaction answered 25/7, 2017 at 7:50 Comment(1)
Python3's open function has a newline parameter to control that if required docs.python.org/3/library/functions.html#open "newline controls how universal newlines mode works (it only applies to text mode). It can be None, '', '\n', '\r', and '\r\n'. It works as follows: When reading input from the stream, if newline is None, universal newlines mode is enabled"Maggs

© 2022 - 2024 — McMap. All rights reserved.