What does the 'b' character do in front of a string literal?
Asked Answered
C

12

1409

Apparently, the following is the valid syntax:

b'The string'

I would like to know:

  1. What does this b character in front of the string mean?
  2. What are the effects of using it?
  3. What are appropriate situations to use it?

I found a related question right here on SO, but that question is about PHP though, and it states the b is used to indicate the string is binary, as opposed to Unicode, which was needed for code to be compatible from version of PHP < 6, when migrating to PHP 6. I don't think this applies to Python.

I did find this documentation on the Python site about using a u character in the same syntax to specify a string as Unicode. Unfortunately, it doesn't mention the b character anywhere in that document.

Also, just out of curiosity, are there more symbols than the b and u that do other things?

Cyclograph answered 7/6, 2011 at 18:14 Comment(6)
For the curiosity part, since python 3.6 there are the f-strings which are really useful. You can do: v = "world" print(f"Hello {v}") getting "Hello world". Another example is f"{2 * 5}" which gives you "10". It is the way forward when working with strings.Antiscorbutic
f-Strings also have a handy debugging feature if you add an equals (=) sign after the variable but before the closing brace, so f'{v=}' would output "v=123" as the string, showing the name of whatever is being printed. Even for expressions, so f'{2*5=}' would print out "2*5=10"Protoxide
@Protoxide that feature was introduced in version 3.8Minimalist
For the curiosity part: stringprefix::= "r" | "u" | "R" | "U" | "f" | "F" | "fr" | "Fr" | "fR" | "FR" | "rf" | "rF" | "Rf" | "RF" bytesprefix::= "b" | "B" | "br" | "Br" | "bR" | "BR" | "rb" | "rB" | "Rb" | "RB" Documentation: String and Bytes literalsMinimalist
@Antiscorbutic this is the way…Chery
"r" prefix is commonly used with regular expressions to avoid having to use double-\ for escape sequences, which would make regex even more unreadable. docs.python.org/3/library/re.html#module-reMaun
R
543

To quote the Python 2.x documentation:

A prefix of 'b' or 'B' is ignored in Python 2; it indicates that the literal should become a bytes literal in Python 3 (e.g. when code is automatically converted with 2to3). A 'u' or 'b' prefix may be followed by an 'r' prefix.

The Python 3 documentation states:

Bytes literals are always prefixed with 'b' or 'B'; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.

Roselba answered 7/6, 2011 at 18:16 Comment(5)
So it sounds like Python < v3 will just ignore this extra character. What would be a case in v3 where you would need to use a b string as opposed to just a regular string?Cyclograph
@Gweebz - if you're actually typing out a string in a particular encoding instead of with unicode escapes (eg. b'\xff\xfe\xe12' instead of '\u32e1').Treasatreason
Actually, if you've imported unicode_literals from __future__, this will "reverse" the behavior for this particular string (in Python 2.x)Cogitation
A little more plain language narrative around the quoted documentation would make this a better answer IMHOSubmersed
"b is for bytes(/ASCII), as opposed to Unicode. In Python 3.x, strings are now Unicode by default." do we agree that suggested doc change is better? Also, that 3.x doc quote assumes you already know strings are now Unicode by default, without actually saying that. Also, 2.x is now ancient history, I'd move the 3.x quote above it (and mentions of 2to3 are pretty ancient too).Duhon
G
1209

Python 3.x makes a clear distinction between the types:

  • str = '...' literals = a sequence of characters. A “character” is a basic unit of text: a letter, digit, punctuation mark, symbol, space, or “control character” (like tab or backspace). The Unicode standard assigns each character to an integer code point between 0 and 0x10FFFF. (Well, more or less. Unicode includes ligatures and combining characters, so a string might not have the same number of code points as user-perceived characters.) Internally, str uses a flexible string representation that can use either 1, 2, or 4 bytes per code point.
  • bytes = b'...' literals = a sequence of bytes. A “byte” is the smallest integer type addressable on a computer, which is nearly universally an octet, or 8-bit unit, thus allowing numbers between 0 and 255.

If you're familiar with:

  • Java or C#, think of str as String and bytes as byte[];
  • SQL, think of str as NVARCHAR and bytes as BINARY or BLOB;
  • Windows registry, think of str as REG_SZ and bytes as REG_BINARY.

If you're familiar with C(++), then forget everything you've learned about char and strings, because a character is not a byte. That idea is long obsolete.

You use str when you want to represent text.

print('שלום עולם')

You use bytes when you want to represent low-level binary data like structs.

NaN = struct.unpack('>d', b'\xff\xf8\x00\x00\x00\x00\x00\x00')[0]

You can encode a str to a bytes object.

>>> '\uFEFF'.encode('UTF-8')
b'\xef\xbb\xbf'

And you can decode a bytes into a str.

>>> b'\xE2\x82\xAC'.decode('UTF-8')
'€'

But you can't freely mix the two types.

>>> b'\xEF\xBB\xBF' + 'Text with a UTF-8 BOM'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can't concat bytes to str

The b'...' notation is somewhat confusing in that it allows the bytes 0x01-0x7F to be specified with ASCII characters instead of hex numbers.

>>> b'A' == b'\x41'
True

But I must emphasize, a character is not a byte.

>>> 'A' == b'A'
False

In Python 2.x

Pre-3.0 versions of Python lacked this kind of distinction between text and binary data. Instead, there was:

  • unicode = u'...' literals = sequence of Unicode characters = 3.x str
  • str = '...' literals = sequences of confounded bytes/characters
    • Usually text, encoded in some unspecified encoding.
    • But also used to represent binary data like struct.pack output.

In order to ease the 2.x-to-3.x transition, the b'...' literal syntax was backported to Python 2.6, in order to allow distinguishing binary strings (which should be bytes in 3.x) from text strings (which should be str in 3.x). The b prefix does nothing in 2.x, but tells the 2to3 script not to convert it to a Unicode string in 3.x.

So yes, b'...' literals in Python have the same purpose that they do in PHP.

Also, just out of curiosity, are there more symbols than the b and u that do other things?

The r prefix creates a raw string (e.g., r'\t' is a backslash + t instead of a tab), and triple quotes '''...''' or """...""" allow multi-line string literals.

The f prefix (introduced in Python 3.6) creates a “formatted string literal” which can reference Python variables. For example, f'My name is {name}.' is shorthand for 'My name is {0}.'.format(name).

Goodloe answered 8/6, 2011 at 2:34 Comment(13)
Thanks! I understood it after reading these sentences: "In order to ease the 2.x-to-3.x transition, the b'...' literal syntax was backported to Python 2.6, in order to allow distinguishing binary strings (which should be bytes in 3.x) from text strings (which should be str in 3.x). The b prefix does nothing in 2.x, but tells the 2to3 script not to convert it to a Unicode string in 3.x."Wiese
The 'A' == b'A' --> False check really makes it clear. The rest of it is excellent, but up to that point I hadn't properly understood that a byte string is not really text.Kolomna
'שלום עולם' == 'hello world'Odawa
+1 for the .decode('UTF-8'). Was searching for how to change my b' string received over server POST request back to unicode.Cachucha
A CHARACTER IS NOT A BYTE is a wrong logical deduction from the C++ draft. C++ never had that kind of "idea". C++ defines a byte as an addressable unit of data storage large enough to hold any member of the basic character set of the execution environment. That's like saying a glass can hold water. Every water is a glass.Tersina
b"some string".decode('UTF-8'), I believe that's the line many are looking forLisk
In addition of u, b, r, Python 3.6, introduce f-string for string formatting. Example f'The temperature is {tmp_value} Celsius'Perth
the decode missed parentheses for me. (b'\xE2\x82\xAC').decode('UTF-8') worked.Exocarp
Can I suggest an edit? But I must emphasize, a character is not a byte.. Can you add immediately after that what IS a character? Because a precise definition of what is a character would help so much to understand.Natty
@Tersina True - a byte is not an octet of bits. So, technically, a character can be a byte.Pilloff
It's not even clear what "a character is not a byte" means here. Probably "a character IN PYTON is not a byte" was supposed? Adding does not look excessive to me.Porty
I've expanded on the distinction between "character" and "byte". Let me know if any further clarification is required.Goodloe
"The b'...' notation is somewhat confusing in that it allows..." - important note.Anaconda
R
543

To quote the Python 2.x documentation:

A prefix of 'b' or 'B' is ignored in Python 2; it indicates that the literal should become a bytes literal in Python 3 (e.g. when code is automatically converted with 2to3). A 'u' or 'b' prefix may be followed by an 'r' prefix.

The Python 3 documentation states:

Bytes literals are always prefixed with 'b' or 'B'; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.

Roselba answered 7/6, 2011 at 18:16 Comment(5)
So it sounds like Python < v3 will just ignore this extra character. What would be a case in v3 where you would need to use a b string as opposed to just a regular string?Cyclograph
@Gweebz - if you're actually typing out a string in a particular encoding instead of with unicode escapes (eg. b'\xff\xfe\xe12' instead of '\u32e1').Treasatreason
Actually, if you've imported unicode_literals from __future__, this will "reverse" the behavior for this particular string (in Python 2.x)Cogitation
A little more plain language narrative around the quoted documentation would make this a better answer IMHOSubmersed
"b is for bytes(/ASCII), as opposed to Unicode. In Python 3.x, strings are now Unicode by default." do we agree that suggested doc change is better? Also, that 3.x doc quote assumes you already know strings are now Unicode by default, without actually saying that. Also, 2.x is now ancient history, I'd move the 3.x quote above it (and mentions of 2to3 are pretty ancient too).Duhon
M
44

The b denotes a byte string.

Bytes are the actual data. Strings are an abstraction.

If you had multi-character string object and you took a single character, it would be a string, and it might be more than 1 byte in size depending on encoding.

If took 1 byte with a byte string, you'd get a single 8-bit value from 0-255 and it might not represent a complete character if those characters due to encoding were > 1 byte.

TBH I'd use strings unless I had some specific low level reason to use bytes.

Millham answered 7/6, 2011 at 18:34 Comment(0)
S
32

From server side, if we send any response, it will be sent in the form of byte type, so it will appear in the client as b'Response from server'

In order get rid of b'....' simply use below code:

Server file:

stri="Response from server"    
c.send(stri.encode())

Client file:

print(s.recv(1024).decode())

then it will print Response from server

Serra answered 17/8, 2018 at 7:27 Comment(4)
It doesn't explain the question that Jesse Webb has asked!Comminate
I was saying that without using encode and decode methods, the string output will be prefixed with b' ' as python take it as a byte type instead of string type.If you don't want to get an output like b'...' use the above that's it .What you didn't understand?Serra
Actually this is exactly the answer to the title of the question that was asked: Q: "What does b'x' do?" A: "It does 'x'.encode()" That is literally what it does. The rest of the question wanted to know much more than this, but the title is answered.Atingle
@MichaelErickson no, b'x' does not "do 'x'.encode(). It simply creates a value of the same type. If you don't believe me, try evaluating b'\u1000' == '\u1000'.encode().Crissie
G
29

The answer to the question is that, it does:

data.encode()

and in order to decode it(remove the b, because sometimes you don't need it)

use:

data.decode()
Generalship answered 18/11, 2020 at 7:18 Comment(1)
This is incorrect. bytes literals are interpreted at compile time by a different mechanism; they are not syntactic sugar for a data.encode() call, a str is not created in the process, and the interpretation of text within the "" is not the same. In particular, e.g. b"\u1000" does not create a bytes object representing Unicode character 0x1000 in any meaningful encoding; it creates a bytes object storing numeric values [92, 117, 49, 48, 48, 48] - corresponding to a backslash, lowercase u, digit 1, and three digit 0s.Crissie
G
14

Here's an example where the absence of b would throw a TypeError exception in Python 3.x

>>> f=open("new", "wb")
>>> f.write("Hello Python!")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'str' does not support the buffer interface

Adding a b prefix would fix the problem.

Gagger answered 23/6, 2014 at 7:2 Comment(0)
O
13

It turns it into a bytes literal (or str in 2.x), and is valid for 2.6+.

The r prefix causes backslashes to be "uninterpreted" (not ignored, and the difference does matter).

Ortensia answered 7/6, 2011 at 18:16 Comment(3)
This sounds wrong according to the documentation quoted in aix's answer; the b will be ignored in Python version other than 3.Cyclograph
It will be a str in 2.x either way, so it could be said that it is ignored. The distinction matters when you import unicode_literals from the __future__ module.Ortensia
"the b will be ignored in Python version other than 3." It will have no effect in 2.x because in 2.x, str names the same type that bytes does.Crissie
S
11

In addition to what others have said, note that a single character in unicode can consist of multiple bytes.

The way unicode works is that it took the old ASCII format (7-bit code that looks like 0xxx xxxx) and added multi-bytes sequences where all bytes start with 1 (1xxx xxxx) to represent characters beyond ASCII so that Unicode would be backwards-compatible with ASCII.

>>> len('Öl')  # German word for 'oil' with 2 characters
2
>>> 'Öl'.encode('UTF-8')  # convert str to bytes 
b'\xc3\x96l'
>>> len('Öl'.encode('UTF-8'))  # 3 bytes encode 2 characters !
3
Shoffner answered 7/3, 2018 at 12:16 Comment(2)
This is useful supplementary information, but it does not address the question at all. It should be written as a comment to another answer instead.Crissie
A single character in Unicode does not consist of bytes in the first place. A Unicode character in a specific encoding (like UTF-8, UTF-16, UTF-32, or oddball ones like UTF-7) can consist of multiple bytes (for some of those, they're always multiple bytes), but Unicode characters are platonic ideals; they have no inherent byte representation.Fernandes
O
9

b"hello" is not a string (even though it looks like one), but a byte sequence. It is a sequence of 5 numbers, which, if you mapped them to a character table, would look like h e l l o. However the value itself is not a string, Python just has a convenient syntax for defining byte sequences using text characters rather than the numbers itself. This saves you some typing, and also often byte sequences are meant to be interpreted as characters. However, this is not always the case - for example, reading a JPG file will produce a sequence of nonsense letters inside b"..." because JPGs have a non-text structure.

.encode() and .decode() convert between strings and bytes.

Orpington answered 26/4, 2022 at 3:34 Comment(0)
P
6

You can use JSON to convert it to dictionary

import json
data = b'{"key":"value"}'
print(json.loads(data))

{"key":"value"}


FLASK:

This is an example from flask. Run this on terminal line:

import requests
requests.post(url='http://localhost(example)/',json={'key':'value'})

In flask/routes.py

@app.route('/', methods=['POST'])
def api_script_add():
    print(request.data) # --> b'{"hi":"Hello"}'
    print(json.loads(request.data))
return json.loads(request.data)

{'key':'value'}

Peroneus answered 14/5, 2019 at 12:45 Comment(3)
This works well (I do the same for JSON data), but will fail for other type of data. If you have a generic str data, might be an XML for example, you can assign the variable and decode it. Something like data = request.data and then data = data.decode()Innkeeper
This does not answer the question. The question is about what the b means, not about what can be done with the object. Also, this can only be done with a very small subset of bytes literals, the ones that are formatted to the JSON specification.Crissie
dear @KarlKnechtel It doesn't answer this question directly that is true, but it is good for SEO for Stackoverflow if someone having this issue but isn't able to form the right question but only mentions like b' Flask/Django then this answer will be more relevant for the search engine to put it in front.Peroneus
W
1

bytes(somestring.encode()) is the solution that worked for me in python 3.

def compare_types():
    output = b'sometext'
    print(output)
    print(type(output))


    somestring = 'sometext'
    encoded_string = somestring.encode()
    output = bytes(encoded_string)
    print(output)
    print(type(output))


compare_types()
Whitver answered 7/9, 2022 at 9:47 Comment(0)
S
1

Answering question 1 and 2: b means you want to change/make use of the ordinary String type into Byte type. For an example:

>>> type(b'')
<class 'bytes'>
>>> type('')
<class 'str'> 

Answering questions 3: It can be used when we want to check the bytestream (a sequence of bytes) from some file/object. I.e we want to check SHA1 message digest of some file:

import hashlib

def hash_file(filename):
   """"This function returns the SHA-1 hash of the file passed into it"""

   # make a hash object
   h = hashlib.sha1()

   # open file for reading in binary mode
   with open(filename,'rb') as file:

       # loop till the end of the file
       chunk = 0
       while chunk != b'':
           # read only 1024 bytes at a time
           chunk = file.read(1024)
           h.update(chunk)

   # return the hex representation of digest
   return h.hexdigest()

message = hash_file("somefile.pdf")
print(message)
Sommerville answered 26/7, 2023 at 11:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.