"TypeError: a bytes-like object is required, not 'str'" when handling file content in Python 3
Asked Answered
F

11

901

I've very recently migrated to Python 3.5. This code was working properly in Python 2.7:

with open(fname, 'rb') as f:
    lines = [x.strip() for x in f.readlines()]

for line in lines:
    tmp = line.strip().lower()
    if 'some-pattern' in tmp: continue
    # ... code

But in 3.5, on the if 'some-pattern' in tmp: continue line, I get an error which says:

TypeError: a bytes-like object is required, not 'str'

I was unable to fix the problem using .decode() on either side of the in, nor could I fix it using

    if tmp.find('some-pattern') != -1: continue

What is wrong, and how do I fix it?

Foti answered 10/10, 2015 at 13:28 Comment(7)
Why are you opening the file in binary mode but treat it as text?Implicatory
@MartijnPieters thanks for spotting the file open mode! Changing it to text-mode solved the issue... the code had worked reliably in Py2k for many years though...Foti
@Foti see: python.org/dev/peps/pep-0404/#strings-and-bytesLassitude
I am encountering this too where I have a requests result = requests.get and I attempt to x = result.content.split("\n"). I am a little confused by the error message because it seems to imply that result.content is a string and .split() is requiring a bytes-like object..?? ( "a bytes-like object is required, not 'str"')..Antisocial
Martjin is right, you should change the 'rb' option to 'r' to treat the file as a stringSurpass
Here is another example (which led me here) with the exact same symptom that worked in Python 2: effectively os.write(self.FILE, ":STOP");, after self.FILE = os.open("/dev/usbtmc0", os.O_RDWR) (given the particular hardware is connected through USB, with the right permissions, etc.)Present
"b" in front of the string, as one of the answers suggests, makes it work: os.write(self.FILE, b":STOP");. Though it would be better if the why was included here.Present
I
795

You opened the file in binary mode:

with open(fname, 'rb') as f:

This means that all data read from the file is returned as bytes objects, not str. You cannot then use a string in a containment test:

if 'some-pattern' in tmp: continue

You'd have to use a bytes object to test against tmp instead:

if b'some-pattern' in tmp: continue

or open the file as a textfile instead by replacing the 'rb' mode with 'r'.

Implicatory answered 10/10, 2015 at 13:30 Comment(6)
If you peek at the various documents that ppl have linked to, you'll see that everything "worked" in Py2 because default strings were bytes whereas in Py3, default strings are Unicode, meaning that any time you're doing I/O, esp. networking, byte strings are the standard, so you must learn to move b/w Unicode & bytes strings (en/decode). For files, we now have "r" vs. "rb" (and for 'w' & 'a') to help differentiate.Hatley
@wescpy: Python 2 has 'r' vs 'rb' too, switching between binary and text file behaviours (like translating newlines and on certain platforms, how the EOF marker is treated). That the io library (providing the default I/O functionality in Python 3 but also available in Python 2) now also decodes text files by default is the real change.Implicatory
@MartijnPieters: Yes, agreed. In 2.x, I only used the 'b' flag when having to work with binary files on DOS/Windows (as binary is the POSIX default). It's good that there is a dual purpose when using io in 3.x for file access.Hatley
r does not work with zipfile 's .open(). Example: def get_aoi1(zip): z = zipfile.ZipFile(zip) for f in z.namelist(): with z.open(f, 'r') as rptf: for l in rptf.readlines(): if l.find("$$") != -1: return l.split('=') else: return print(l) test = get_aoi1('testZip.zip')Obstipation
@Obstipation ZipFile.open() docs explicitly state that only binary mode is supported (Access a member of the archive as a binary file-like object). You can wrap the file object in io.TextIOWrapper() to achieve the same effect.Implicatory
@Obstipation also, don’t use .readlines() when you can iterate over the file object directly. Especially when you only need info from a single line. Why read everything into memory when that info could be found in the first buffered block?Implicatory
H
343

You can encode your string by using .encode()

Example:

'Hello World'.encode()

As the error describes, in order to write a string to a file you need to encode it to a byte-like object first, and encode() is encoding it to a byte-string.

Hematuria answered 22/5, 2016 at 16:17 Comment(4)
This comment was quite useful in the context of using fd.subprocess.Popen(); fd.communicate(...);.Comenius
If concatenation to a string is needed afterwards (TypeError: can only concatenate str (not "bytes") to str) : "Hello "+("World".encode()).decode() (same with join() obviously).Abbreviated
Why does that work?Present
You cannot write a string to a file, you need to encode the string to a byte-like object to do so. By running the encode() method of a string, we get the encoded version of it in the default encoding, which is usually utf-8.Hematuria
I
75

Like it has been already mentioned, you are reading the file in binary mode and then creating a list of bytes. In your following for loop you are comparing string to bytes and that is where the code is failing.

Decoding the bytes while adding to the list should work. The changed code should look as follows:

with open(fname, 'rb') as f:
    lines = [x.decode('utf8').strip() for x in f.readlines()]

The bytes type was introduced in Python 3 and that is why your code worked in Python 2. In Python 2 there was no data type for bytes:

>>> s=bytes('hello')
>>> type(s)
<type 'str'>
Injection answered 17/5, 2016 at 2:15 Comment(1)
Python 2 does indeed have a type for bytes, it's just confusingly called str while the type for text strings is called unicode. In Python 3 they changed the meaning of str so that it was the same as the old unicode type, and renamed the old str to bytes. They also removed a bunch of cases where it would automatically try to convert from one to the other.Pharmacognosy
N
34

You have to change from wb to w:

def __init__(self):
    self.myCsv = csv.writer(open('Item.csv', 'wb')) 
    self.myCsv.writerow(['title', 'link'])

to

def __init__(self):
    self.myCsv = csv.writer(open('Item.csv', 'w'))
    self.myCsv.writerow(['title', 'link'])

After changing this, the error disappears, but you can't write to the file (in my case). So after all, I don't have an answer?

Source: How to remove ^M

Changing to 'rb' brings me the other error: io.UnsupportedOperation: write

Nilgai answered 28/4, 2017 at 14:38 Comment(1)
Why does that work? An explanation would be in order. (But without "Edit:", "Update:", or similar - the answer should appear as if it was written today.)Present
T
25

For this small example, adding the only b before 'GET http://www.py4inf.com/code/romeo.txt HTTP/1.0\n\n' solved my problem:

import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('www.py4inf.com', 80))
mysock.send(b'GET http://www.py4inf.com/code/romeo.txt HTTP/1.0\n\n')

while True:
    data = mysock.recv(512)
    if (len(data) < 1):
        break
    print (data);

mysock.close()

What does the 'b' character do in front of a string literal?

Teatime answered 22/3, 2016 at 11:59 Comment(1)
Why does it work? The OP has left the building ("Last seen more than 5 years ago"), so perhaps somebody else can chime in?Present
S
25

Use the encode() function along with the hardcoded string value given in a single quote.

Example:

file.write(answers[i] + '\n'.encode())

Or

line.split(' +++$+++ '.encode())
Sarco answered 20/4, 2019 at 8:47 Comment(1)
Why does that work?Present
M
16

You opened the file in binary mode:

The following code will throw a TypeError: a bytes-like object is required, not 'str'.

for line in lines:
    print(type(line))# <class 'bytes'>
    if 'substring' in line:
       print('success')

The following code will work - you have to use the decode() function:

for line in lines:
    line = line.decode()
    print(type(line))# <class 'str'>
    if 'substring' in line:
       print('success')
Mcgannon answered 16/5, 2018 at 7:23 Comment(0)
M
10

Try opening your file as text:

with open(fname, 'rt') as f:
    lines = [x.strip() for x in f.readlines()]

Additionally, here is a link for Python 3.x on the official page: io — Core tools for working with streams.

And this is the open function: open

If you are really trying to handle it as a binary then consider encoding your string.

Mcminn answered 1/12, 2017 at 12:22 Comment(0)
L
6

Summary

Python 2.x encouraged many bad habits WRT text handling. In particular, its type named str does not actually represent text per the Unicode standard (that type is unicode), and the default "string literal" in fact produces a sequence of raw bytes - with some convenience functions for treating it like a string, if you can get away with assuming a "code page" style encoding.

In 3.x, "string literals" now produce actual strings, and built-in functionality no longer does any implicit conversions between the two types. Thus, the same code now has a TypeError, because the literal and the variable are of incompatible types. To fix the problem, one of the values must be either replaced or converted, so that the types match.

The Python documentation has an extremely detailed guide to working with Unicode properly.

In the example in the question, the input file is processed as if it contains text. Therefore, the file should have been opened in a text mode in the first place. The only good reason the file would have been opened in binary mode even in 2.x is to avoid universal newline translation; in 3.x, this is done by specifying the newline keyword parameter when opening a file in text mode.

To read a file as text properly requires knowing a text encoding, which is specified in the code by (string) name. The encoding iso-8859-1 is a safe fallback; it interprets each byte separately, as representing one of the first 256 Unicode code points, in order (so it will never raise an exception due to invalid data). utf-8 is much more common as of the time of writing, but it does not accept arbitrary data. (However, in many cases, for English text, the distinction will not matter; both of those encodings, and many more, are supersets of ASCII.)

Thus:

with open(fname, 'r', newline='\n', encoding='iso-8859-1') as f:
    lines = [x.strip() for x in f.readlines()]

# proceed as before
# If the results are wrong, take additional steps to ascertain the correct encoding

How the error is created when migrating from 2.x to 3.x

In 2.x, 'some-pattern' creates a str, i.e. a sequence of bytes that the programmer is then likely to pretend is text. The str type is the same as the bytes type, and different from the unicode type that properly represents text. Many methods are offered to treat this data as if it were text, but it is not a proper representation of text. The meaning of each value as a text character (the encoding) is assumed. (In order to enable the illusion of raw data as "text", there would sometimes be implicit conversions between the str and unicode types. However, this results in confusing errors of its own - such as getting UnicodeDecodeError from an attempt to encode, or vice-versa).

In 3.x, 'some-pattern' creates what is also called a str; but now str means the Unicode-using, properly-text-representing string type. (unicode is no longer used as a type name, and only bytes refers to the sequence-of-bytes type.) Some changes were made to bytes to dissociate it from the text-with-assumed-encoding interpretation (in particular, indexing into a bytes object now results in an int, rather than a 1-element bytes), but many strange legacy methods persist (including ones rarely used even with actual strings any more, like zfill).

Why this causes a problem

The data, tmp, is a bytes instance. It came from a binary source: in this case, a file opened with a 'b' file mode. In other cases, it could come from a raw network socket, a web request made with urllib or similar, or some other API call.

This means that it cannot do anything meaningful in combination with a string. The elements of a string are Unicode code points (i.e., abstractions that represent, for the most part, text characters, in a universal form that represents all world languages and many other symbols). The elements of a bytes are, well, bytes. (Specifically in 3.x, they are interpreted as unsigned integers ranging from 0 to 255 inclusive.)

When the code was migrated, the literal 'some-pattern' went from describing a bytes, to describing text. Thus, the code went from making a legal comparison (byte-sequence to byte-sequence), to making an illegal one (string to byte-sequence).

Fixing the problem

In order to operate on a string and a byte-sequence - whether it's checking for equality with ==, lexicographic comparison with <, substring search with in, concatenation with +, or anything else - either the string must be converted to a byte-sequence, or vice-versa. In general, only one of these will be the correct, sensible answer, and it will depend on the context.

Fixing the source

Sometimes, one of the values can be seen to be "wrong" in the first place. For example, if reading the file was intended to result in text, then it should have been opened in a text mode. In 3.x, the file encoding can simply be passed as an encoding keyword argument to open, and conversion to Unicode is handled seamlessly without having to feed a binary file to an explicit translation step (thus, universal newline handling still takes place seamlessly).

In the case of the original example, that could look like:

with open(fname, 'r') as f:
    lines = [x.strip() for x in f.readlines()]

This example assumes a platform-dependent default encoding for the file. This will normally work for files that were created in straightforward ways, on the same computer. In the general case, however, the encoding of the data must be known in order to work with it properly.

If the encoding is known to be, for example, UTF-8, that is trivially specified:

with open(fname, 'r', encoding='utf-8') as f:
    lines = [x.strip() for x in f.readlines()]

Similarly, a string literal that should have been a bytes literal is simply missing a prefix: to make the bytes sequence representing integer values [101, 120, 97, 109, 112, 108, 101] (i.e., the ASCII values of the letters example), write the bytes literal b'example', rather than the string literal `'example'). Similarly the other way around.

In the case of the original example, that would look like:

if b'some-pattern' in tmp:

There is a safeguard built in to this: the bytes literal syntax only allows ASCII characters, so something like b'ëxãmþlê' will be caught as a SyntaxError, regardless of the encoding of the source file (since it is not clear which byte values are meant; in the old implied-encoding schemes, the ASCII range was well established, but everything else was up in the air.) Of course, bytes literals with elements representing values 128..255 can still be written by using \x escaping for those values: for example, b'\xebx\xe3m\xfel\xea' will produce a byte-sequence corresponding to the text ëxãmþlê in Latin-1 (ISO 8859-1) encoding.

Converting, when appropriate

Conversion between byte-sequences and text is only possible when an encoding has been determined. It has always been so; we just used to assume an encoding locally, and then mostly ignore that we had done so. (Programmers in places like East Asia have been more aware of the problem historically, because they commonly need to work with scripts that have more than 256 distinct symbols, and thus their text requires multi-byte encodings.)

In 3.x, because there is no pressure to be able to treat byte-sequences implicitly as text with an assumed encoding, there are therefore no implicit conversion steps behind the scenes. This means that understanding the API is straightforward: Bytes are raw data; therefore, they are used to encode text, which is an abstraction. Therefore, the .encode() method is provided by str (which represents text), in order to encode text into raw data. Similarly, the .decode() method is provided by bytes (which represents a byte-sequence), in order to decode raw data into text.

Applying these to the example code, again supposing UTF-8 encoding is appropriate, gives:

if 'some-pattern'.encode('utf-8') in tmp:

and

if 'some-pattern' in tmp.decode('utf-8'):
Levator answered 29/1, 2023 at 22:7 Comment(0)
G
5

I got this error when I was trying to convert a char (or string) to bytes, the code was something like this with Python 2.7:

# -*- coding: utf-8 -*-
print(bytes('ò'))

This is the way of Python 2.7 when dealing with Unicode characters.

This won't work with Python 3.6, since bytes require an extra argument for encoding, but this can be little tricky, since different encoding may output different result:

print(bytes('ò', 'iso_8859_1')) # prints: b'\xf2'
print(bytes('ò', 'utf-8')) # prints: b'\xc3\xb2'

In my case I had to use iso_8859_1 when encoding bytes in order to solve the issue.

Guerdon answered 5/5, 2020 at 13:56 Comment(1)
Note that the coding comment at the top of the file doesn't affect the way bytes or encode works, it only changes the way characters in your Python source are interpreted.Pharmacognosy
W
2

This particular error sometimes shows up when scraping data from a webpage. In particular, if you were using requests library to get data, .content returns a bytes object while .text returns a string. So if you want to read the contents as a string (and do string operations such as in, .split() etc.), use .text instead.

import requests
x = requests.get('https://w3schools.com/python/demopage.htm')

'Page' in x.content     # <---- TypeError: a bytes-like object is required, not 'str'
'Page' in x.text        # <---- OK

type(x.content)         # <class 'bytes'>
type(x.text)            # <class 'str'>

In the standard urllib module, data can only be returned as a bytes object, in which case, decode() method may be useful to convert the bytes object into a string.

from urllib import request
y = request.urlopen('https://w3schools.com/python/demopage.htm')
z = y.read()
print(type(z))                 # <class 'bytes'>

decoded_z = z.decode('utf-8')  # <---- convert to string

'Page' in z                    # <----- TypeError
'Page' in decoded_z            # <----- OK
Welford answered 1/9, 2023 at 20:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.