Best way to convert string to bytes in Python 3? [closed]

Asked 28/9, 2011 at 15:14 Answered 25/3, 2022 at 17:28

Solved python string character-encoding python-3.x

1478

TypeError: 'str' does not support the buffer interface suggests two possible methods to convert a string to bytes:

b = bytes(mystring, 'utf-8')

b = mystring.encode('utf-8')

Which method is more Pythonic?

_{See Convert bytes to a string for the other way around.}

Haldeman answered 28/9, 2011 at 15:14 Comment(11)

Use encode/decode is more common, and perhaps clearer. – Proposition 29/9, 2011 at 7:39

@LennartRegebro I dismiss. Even if it's more common, reading "bytes()" i know what its doing, while encode() don't make me feel it is encoding to bytes. – Hesler 23/4, 2017 at 5:42

@erm3nda Which is a good reason to use it until it does feel like that, then you are one step closer to Unicode zen. – Proposition 24/4, 2017 at 19:26

@LennartRegebro I feel good enough to just use bytes(item, "utf8"), as explicit is better than implicit, so... str.encode( ) defaults silently to bytes, making you more Unicode-zen but less Explicit-Zen. Also "common" is not a term that i like to follow. Also, bytes(item, "utf8"), is more like the str(), and b"string" notations. My apologies if i am so noob to understand your reasons. Thank you. – Hesler 24/4, 2017 at 22:56

@erm3nda if you read the accepted answer you can see that encode() doesn't call bytes(), it's the other way around. Of course that's not immediately obvious which is why I asked the question. – Haldeman 24/4, 2017 at 23:3

Doh, sorry. Anyway, what i said applies too for some_string.encode(encoding), being as example "string".encode("utf8") which returns type bytes. For me, using the term bytes() makes much more sense. I tend to think that encode/decode is more charset related than data type. Again, i may be so much noob to think like that... but i love explicit, and there not "byte" refer into "some".encode("utf8"). Thank you, i've checked that str.encode() just doesnt't default to anyting. – Hesler 24/4, 2017 at 23:56

@erm3nda Doesn't the very meaning of the word encode in the context of text include "to bytes", because encoding text is the taking of abstract text data and turning it into some actual concrete byte representation? – Serafina 28/6, 2018 at 21:1

Encode and decode are always preferred as chaining is easier to read than nesting. e.g. ebcdic=passed.decode('utf-8').encode('ibm500') – Lindie 8/5, 2019 at 22:57

The 'utf-8' is the default, so the simplest answer is b = mystring.encode( ) – Crapulous 26/9, 2019 at 18:37

It's a little disturbing that existing answers here don't seem to talk about the importance of choosing and knowing an encoding. – Porte 8/3, 2023 at 17:31

@KarlKnechtel that's a separate question that's been asked 100 times here. Difficult to answer for the general case too. – Haldeman 8/3, 2023 at 17:55

863

If you look at the docs for bytes, it points you to bytearray:

bytearray([source[, encoding[, errors]]])

Return a new array of bytes. The bytearray type is a mutable sequence of integers in the range 0 <= x < 256. It has most of the usual methods of mutable sequences, described in Mutable Sequence Types, as well as most methods that the bytes type has, see Bytes and Byte Array Methods.

The optional source parameter can be used to initialize the array in a few different ways:

If it is a string, you must also give the encoding (and optionally, errors) parameters; bytearray() then converts the string to bytes using str.encode().

If it is an integer, the array will have that size and will be initialized with null bytes.

If it is an object conforming to the buffer interface, a read-only buffer of the object will be used to initialize the bytes array.

If it is an iterable, it must be an iterable of integers in the range 0 <= x < 256, which are used as the initial contents of the array.

Without an argument, an array of size 0 is created.

So bytes can do much more than just encode a string. It's Pythonic that it would allow you to call the constructor with any type of source parameter that makes sense.

For encoding a string, I think that some_string.encode(encoding) is more Pythonic than using the constructor, because it is the most self documenting -- "take this string and encode it with this encoding" is clearer than bytes(some_string, encoding) -- there is no explicit verb when you use the constructor.

I checked the Python source. If you pass a unicode string to bytes using CPython, it calls PyUnicode_AsEncodedString, which is the implementation of encode; so you're just skipping a level of indirection if you call encode yourself.

Also, see Serdalis' comment -- unicode_string.encode(encoding) is also more Pythonic because its inverse is byte_string.decode(encoding) and symmetry is nice.

Swett answered 28/9, 2011 at 15:27 Comment(10)

+1 for having a good argument and quotes from the python docs. Also unicode_string.encode(encoding) matches nicely with bytearray.decode(encoding) when you want your string back. – Krasnodar 28/9, 2011 at 15:30

bytearray is used when you need a mutable object. You don't need it for simple str↔bytes conversions. – Ultimately 28/9, 2011 at 15:41

@EugeneHomyakov This has nothing to do with bytearray except that the docs for bytes don't give details, they just say "this is an immutable version of bytearray" so I have to quote from there. – Swett 28/9, 2011 at 15:43

Just a cautionary note from Python in a Nutshell about bytes: Avoid using the bytes type as a function with an integer argument. In v2 this returns the integer converted to a (byte)string because bytes is an alias for str, while in v3 it returns a bytestring containing the given number of null characters. So, for example, instead of the v3 expression bytes(6), use the equivalent b'\x00'*6, which seamlessly works the same way in each version. – Contemporaneous 20/8, 2017 at 10:9

Just a note, that if you are trying to convert binary data to a string, you'll most likely need to use something like byte_string.decode('latin-1') as utf-8 doesn't cover the entire range 0x00 to 0xFF (0-255), check out the python docs for more info. – Neuromuscular 10/7, 2019 at 14:25

tl;dr would be helpful – Mayotte 11/12, 2019 at 7:46

Some examples with output would be helpful. – Selfforgetful 2/6, 2022 at 22:31

Feel free to add them :) – Chasse 31/8, 2022 at 9:44

@Neuromuscular the cases where you need to and can store binary data in a latin-1 string are very limited though. Why not just keep them as bytes? Latin-1 contains many control characters that can cause problems down the line. – Analytic 5/9, 2023 at 6:54

@Analytic this was 4 years ago, I don't really remember, but I think it had to do with displaying html data I received using an http request – Neuromuscular 6/9, 2023 at 12:20

706

It's easier than it is thought:

my_str = "hello world"
my_str_as_bytes = str.encode(my_str)
print(type(my_str_as_bytes)) # ensure it is byte representation
my_decoded_str = my_str_as_bytes.decode()
print(type(my_decoded_str)) # ensure it is string representation

you can verify by printing the types. Refer to output below.

<class 'bytes'>
<class 'str'>

Merlenemerlin answered 6/7, 2013 at 7:9 Comment(11)

He knows how to do it, he's just asking which way is better. Please re-read the question. – Swett 30/9, 2013 at 17:50

FYI: str.decode(bytes) didn't work for me (Python 3.3.3 said "type object 'str' has no attribute 'decode'") I used bytes.decode() instead – Impersonal 13/8, 2014 at 9:33

@Mike: use obj.method() syntax instead of cls.method(obj) syntax i.e., use bytestring = unicode_text.encode(encoding) and unicode_text = bytestring.decode(encoding). – Sofar 22/6, 2015 at 11:51

Mike and shenshin fixed the errors in the answer -- it is working now for py 3.6 – Labarbera 17/3, 2017 at 13:44

You should be very carefull because encode create bytes but class will still be str, bytes method create bytes class. – Selma 14/5, 2017 at 20:20

This answer looks more like a comment to me. How does this actually answer the question? – Miscreated 16/6, 2017 at 21:13

... i.e. you're needlessly making an unbound method, and then calling it passing the self as the first argument – Barcarole 11/4, 2018 at 7:41

@Swett who cares? it helps people who come to this page looking to perform this operation – Riana 1/5, 2018 at 21:56

@KolobCanyon The question already shows the right way to do it—call encode as a bound method on the string. This answer suggests that you should instead call the unbound method and pass it the string. That's the only new information in the answer, and it's wrong. – Smallsword 23/6, 2018 at 5:16

@Swett Even though it may seem that he/she knows (whatsoever), this is the most compact outline here. Thanks! – Cervantez 27/2, 2020 at 15:31

@Merlenemerlin you made my day! – Chieftain 25/8, 2021 at 14:41

282

The absolutely best way is neither of the 2, but the 3rd. The first parameter to encode defaults to 'utf-8' ever since Python 3.0. Thus the best way is

b = mystring.encode()

This will also be faster, because the default argument results not in the string "utf-8" in the C code, but NULL, which is much faster to check!

Here be some timings:

In [1]: %timeit -r 10 'abc'.encode('utf-8')
The slowest run took 38.07 times longer than the fastest. 
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 183 ns per loop

In [2]: %timeit -r 10 'abc'.encode()
The slowest run took 27.34 times longer than the fastest. 
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 137 ns per loop

Despite the warning the times were very stable after repeated runs - the deviation was just ~2 per cent.

Using encode() without an argument is not Python 2 compatible, as in Python 2 the default character encoding is ASCII.

>>> 'äöä'.encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Karyotin answered 23/7, 2017 at 20:35 Comment(11)

There's only a sizable difference here because (a) the string is pure ASCII, meaning the internal storage is already the UTF-8 version, so looking up the codec is almost the only cost involved at all, and (b) the string is tiny, so even if you did have to encode, it wouldn't make much difference. Try it with, say, '\u00012345'*10000. Both take 28.8us on my laptop; the extra 50ns is presumably lost in the rounding error. Of course this is a pretty extreme example—but 'abc' is just as extreme in the opposite direction. – Smallsword 23/6, 2018 at 5:22

@Smallsword true, but even then, there is no reason pass the argument as a string. – Barcarole 23/6, 2018 at 7:19

According to this, the default arguments are always "absolutely the best way" to do things, right? This kind of speed analysis would feel like a probable exaggeration if this was about discussing C code. In an interpreted language, it leaves me speechless. – Katar 14/4, 2020 at 23:27

@Katar you win nothing by explicitly typing the default argument values - more keystrokes, larger code and it is slower too. – Barcarole 25/7, 2020 at 7:16

The Zen of Python declares that explicit is better than implicit, which means that an explicit 'utf-8' parameter is to be preferred. But you've definitely shown that leaving off the parameter is faster. That makes this a good answer, even if it isn't the best one. – Haldeman 7/11, 2020 at 4:36

@MarkRansom then how many times have you actually used int(s, 10) ;-) – Barcarole 7/5, 2021 at 20:32

Never, but the default for int has ALWAYS been 10. Not so for encode. – Haldeman 7/5, 2021 at 20:37

@MarkRansom it has always been the default in Python 3 :P And there are no other Pythons. – Barcarole 20/9, 2021 at 8:18

Despite Python 2 no longer being supported, I suspect there will be people dealing with some legacy code for a very long time to come; if for no other reason than to upgrade it to the latest version of Python! I'm glad you didn't remove your warning for Python 2 users at the end. – Haldeman 20/9, 2021 at 17:37

The saving of ~50ns is not a good reason to replace self-declarative code with ambiguous code. – Cenotaph 7/10, 2021 at 16:37

Encoding and decoding can mean many different things, but if you tell me you are encoding a string to utf-8, I would think we are talking about a binary representation of the string or some kind of encoding conversion. – Schottische 15/3, 2022 at 23:46

Answer for a slightly different problem:

You have a sequence of raw unicode that was saved into a str variable:

s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"

You need to be able to get the byte literal of that unicode (for struct.unpack(), etc.)

s_bytes: bytes = b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'

Solution:

s_new: bytes = bytes(s, encoding="raw_unicode_escape")

Reference (scroll up for standard encodings):

Python Specific Encodings

Orin answered 24/1, 2021 at 18:38 Comment(2)

This was actually just what I was looking for. I could not figure out how to better phrase my question. :) Thank you @Brent! – Nadinenadir 6/2, 2021 at 18:34

This was the answer I needed, coming from a google search of "python 3 convert str to bytes binary" this was the top result and looked promising. There are more interesting questions -- like how to convert a unicode string into a regular string (python 2.7) :p – Somnolent 9/2, 2021 at 14:1

How about the Python 3 'memoryview' way.

Memoryview is a sort of mishmash of the byte/bytearray and struct modules, with several benefits.

Not limited to just text and bytes, handles 16 and 32 bit words too
Copes with endianness
Provides a very low overhead interface to linked C/C++ functions and data

Simplest example, for a byte array:

memoryview(b"some bytes").tolist()

[115, 111, 109, 101, 32, 98, 121, 116, 101, 115]

Or for a unicode string, (which is converted to a byte array)

memoryview(bytes("\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020", "UTF-16")).tolist()

[255, 254, 117, 0, 110, 0, 105, 0, 99, 0, 111, 0, 100, 0, 101, 0, 32, 0]

#Another way to do the same
memoryview("\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020".encode("UTF-16")).tolist()

[255, 254, 117, 0, 110, 0, 105, 0, 99, 0, 111, 0, 100, 0, 101, 0, 32, 0]

Perhaps you need words rather than bytes?

memoryview(bytes("\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020", "UTF-16")).cast("H").tolist()

[65279, 117, 110, 105, 99, 111, 100, 101, 32]

memoryview(b"some  more  data").cast("L").tolist()

[1701670771, 1869422624, 538994034, 1635017060]

Word of caution. Be careful of multiple interpretations of byte order with data of more than one byte:

txt = "\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020"
for order in ("", "BE", "LE"):
    mv = memoryview(bytes(txt, f"UTF-16{order}"))
    print(mv.cast("H").tolist())

[65279, 117, 110, 105, 99, 111, 100, 101, 32]
[29952, 28160, 26880, 25344, 28416, 25600, 25856, 8192]
[117, 110, 105, 99, 111, 100, 101, 32]

Not sure if that's intentional or a bug but it caught me out!!

The example used UTF-16, for a full list of codecs see Codec registry in Python 3.10

Compulsion answered 25/3, 2022 at 17:28 Comment(1)

All you're doing is adding another layer on top of what was suggested in the question. I can't see how that's useful at all. – Haldeman 25/3, 2022 at 17:36

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags