What is the difference between a string and a byte string?

P

9

394

I am working with a library which returns a "byte string" (bytes) and I need to convert this to a string.

Is there actually a difference between those two things? How are they related, and how can I do the conversion?

Painful answered 3/6, 2011 at 7:6 Comment(1)

See also What does the 'b' character do in front of a string literal? – Crumpled 1/6, 2021 at 8:47

P

363

Assuming Python 3 (in Python 2, this difference is a little less well-defined) - a string is a sequence of characters, ie unicode codepoints; these are an abstract concept, and can't be directly stored on disk. A byte string is a sequence of, unsurprisingly, bytes - things that can be stored on disk. The mapping between them is an encoding - there are quite a lot of these (and infinitely many are possible) - and you need to know which applies in the particular case in order to do the conversion, since a different encoding may map the same bytes to a different string:

>>> b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'.decode('utf-16')
'蓏콯캁澽苏'
>>> b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'.decode('utf-8')
'τoρνoς'

Once you know which one to use, you can use the .decode() method of the byte string to get the right character string from it as above. For completeness, the .encode() method of a character string goes the opposite way:

>>> 'τoρνoς'.encode('utf-8')
b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'

Panel answered 3/6, 2011 at 7:49 Comment(12)

To clarify for Python 2 users: the str type is the same as the bytes type; this answer is equivalently comparing the unicode type (does not exist in Python 3) to the str type. – Lymphoblast 10/11, 2016 at 16:2

To be technically correct, unicode is not the default encoding, rather the utf-8 encoding is the default character encoding to store unicode strings in memory. – Mcburney 10/5, 2017 at 9:3

@KshitijSaraogi that isn't quite true either; that whole sentence was edited in and is a bit unfortunate. The in-memory representation of Python 3 str objects is not accessible or relevant from the Python side; the data structure is just a sequence of codepoints. Under PEP 393, the exact internal encoding is one of Latin-1, UCS2 or UCS4, and a utf-8 representation may be cached after it is first requested, but even C code is discouraged from relying on these internal details. – Panel 10/5, 2017 at 9:46

If they can't be directly stored on disk, so how are they stored in memory? – Laxation 4/11, 2017 at 14:38

@orety they do have to be encoded somehow internally for exactly that reason, but this isn't expos3s to you from Python code much like you don't have to care about how floating point numbers are stored. – Panel 5/11, 2017 at 22:43

What is the default encoding in that case, i.e. the encoding used when reading lines from a file into a string? – Kaila 1/8, 2019 at 22:47

"these are an abstract concept" I disagree with this - it's not abstract at all. It exists in some form within the memory of the program. – Schmitz 15/2, 2020 at 16:10

@ChrisStryczynski see the comments above - sure they're stored in memory somehow, but that form is explicitly abstracted away. Indeed, these days, it can change during the lifetime of a program and be different between different strings or might even be more than one (some encodings are cached), depending on the characters in them - but the only time you need to worry about that is if you're hacking on the implementation of the string type itself. – Panel 16/2, 2020 at 7:56

I agree with @ChrisStryczynski. I understand the distinction you're making, but to imply that somehow a character string isn't bytes and doesn't have an encoding is confusing, at least to me. To be 'in a computer', a string must be bytes, and for anyone to read it, it must have some character encoding. This is meaningful if, e.g., you try to print a Chinese UTF-8 string in a terminal, but get '??????'. To me, understanding this helps to clarify what strings and encodings are. In this sense, byte is just a way for a programmer to be explicit about a character encoding, for whatever reason. – Stoichiometric 29/12, 2020 at 22:6

I think that "abstract" is not the right word for it. In the same way a running python program would not commonly be referred to as an "abstract turing machine". Possibly just the implementation could be varied or hidden from the user. – Schmitz 29/12, 2020 at 23:21

@Kaila You may specify an encoding parameter for the open call; and you may see the documentation to understand the default value for that parameter. The file contents may use any encoding; it is your responsibility to know (or find out somehow) which encoding was used, and specify it. This is not in any way a Python-specific issue. Files can only contain raw data as a sequence of bytes. – Shanell 3/9, 2022 at 0:22

@ChrisStryczynski it is abstracted in the OOP sense: you are not supposed to know or care about what in-memory representation is used. – Shanell 3/9, 2022 at 0:24

E

752

The only thing that a computer can store is bytes.

To store anything in a computer, you must first encode it, i.e. convert it to bytes. For example:

If you want to store music, you must first encode it using MP3, WAV, etc.
If you want to store a picture, you must first encode it using PNG, JPEG, etc.
If you want to store text, you must first encode it using ASCII, UTF-8, etc.

MP3, WAV, PNG, JPEG, ASCII and UTF-8 are examples of encodings. An encoding is a format to represent audio, images, text, etc. in bytes.

In Python, a byte string is just that: a sequence of bytes. It isn't human-readable. Under the hood, everything must be converted to a byte string before it can be stored in a computer.

On the other hand, a character string, often just called a "string", is a sequence of characters. It is human-readable. A character string can't be directly stored in a computer, it has to be encoded first (converted into a byte string). There are multiple encodings through which a character string can be converted into a byte string, such as ASCII and UTF-8.

'I am a string'.encode('ASCII')

The above Python code will encode the string 'I am a string' using the encoding ASCII. The result of the above code will be a byte string. If you print it, Python will represent it as b'I am a string'. Remember, however, that byte strings aren't human-readable, it's just that Python decodes them from ASCII when you print them. In Python, a byte string is represented by a b, followed by the byte string's ASCII representation.

A byte string can be decoded back into a character string, if you know the encoding that was used to encode it.

b'I am a string'.decode('ASCII')

The above code will return the original string 'I am a string'.

Encoding and decoding are inverse operations. Everything must be encoded before it can be written to disk, and it must be decoded before it can be read by a human.

Ease answered 9/7, 2015 at 15:46 Comment(16)

Zenadix deserves some kudos here. After some years functioning in this environment, his is the first explanation that clicked with me. I may tattoo it on my other arm (one arm already has "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky" – Indocile 16/7, 2015 at 12:6

Absolutely brilliant. Lucid and easy to understand. However, I would like to mention that this line - "If you print it, Python will represent it as b'I am a string'" is true for Python3 as for Python2 bytes and str are the same thing. – Artema 17/12, 2016 at 9:11

I am awarding you this bounty for offering a very human-readable explanation to put some clarity in this subject! – Drusy 8/1, 2017 at 15:8

Great answer. The only thing that could perhaps be added is to point out more clearly that historically, programmers and programming languages have tended to explicitly or implicitly assume that a byte sequence and an ASCII string were the same thing. Python 3 decided to explicitly break this assumption, correctly IMHO. – Leggat 17/1, 2017 at 9:39

IMHO, Python3 should've opted to print bytes as hexa values as a default behaviour with some easy function to convert to ascii or print in ascii. – Aa 7/3, 2017 at 18:56

Link to Joel's post mentioned by @Indocile above : joelonsoftware.com/2003/10/08/… – Mcburney 7/5, 2017 at 15:54

I really like this explanation. However, I think it doesn't correctly explain some behavior in python (2.7). For example, using os.urandom(32) creates a string (the repr of the returned bytes). To "decode" (using the the meaning in this post) to a base64 string, one actually does encode('base64'). This is strange and is directly counter to what this post describes. – Listlessness 2/10, 2017 at 3:56

superb explanation, before this i had some confusion but now clear. thanks zenadix – Magic 21/4, 2018 at 2:48

Give this man a cookie! No disrespect, thank you very much for your detailed explanation. – Torey 16/5, 2018 at 20:13

In case of strings everything is clear. We just encode some abstraction to bytes 'I am a string'.encode('ASCII'). But what about image? Image is image and it already stored on disk. So what we encoding in case of image? – Piceous 2/4, 2020 at 19:53

One part I think might confuse some people: "A character string can't be directly stored in a computer...." At least to me, 'character string' is a term that means 'human symbols next to each other in a computer'. So, all character strings must be bytes 'stored' in a computer, and they must have an implicit encoding, the computer couldn't display them (encoding) in, e.g., an editor. I think implying otherwise is confusing. In my mind, Python's bytes is just a way for programmers to be explicit about character encoding. – Stoichiometric 29/12, 2020 at 22:10

That is exactly what I was looking for. Is there a way then to print the byte string into bytes? b'hello' => 68 65 6c 6c 6f – Buhl 22/11, 2021 at 8:48

"a character string, often just called a "string", is a sequence of characters. It is human-readable. A character string can't be directly stored in a computer" - What does this even mean? Are not the character string already stored in the computer? Sure, They are already there present as bytes and based upon the implicit encoding scheme they are presented in stdout. And that is precisely what is happening in the case of byte string as well. So.. What is the difference? – Egghead 15/1, 2022 at 7:43

@zarathoustra - list(b'hello') = [104, 101, 108, 108, 111] – Egghead 15/1, 2022 at 7:51

This is the most clear explanation imho. However, I don't follow why b'hi'.decode(), b'hi'.decode('utf8') and b'hi'.decode('ascii') gives the same output. – Cozenage 27/1, 2022 at 11:55

@AndrewAnderson Python's decode defaults to utf8. Thus, decode() and decode('utf8') do the same thing. As for ascii, UTF8 is a superset of ASCII, and anything encoded with ASCII can be decoded with UTF8; the inverse isn't necessarily true. To be more precise, the first 128 in Unicode, which is the standard that defines UTF8, correspond one-to-one with ASCII; thus valid ASCII text is valid UTF-8-encoded Unicode as well. – Interrelated 28/10, 2023 at 18:25

P

363

Assuming Python 3 (in Python 2, this difference is a little less well-defined) - a string is a sequence of characters, ie unicode codepoints; these are an abstract concept, and can't be directly stored on disk. A byte string is a sequence of, unsurprisingly, bytes - things that can be stored on disk. The mapping between them is an encoding - there are quite a lot of these (and infinitely many are possible) - and you need to know which applies in the particular case in order to do the conversion, since a different encoding may map the same bytes to a different string:

>>> b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'.decode('utf-16')
'蓏콯캁澽苏'
>>> b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'.decode('utf-8')
'τoρνoς'

Once you know which one to use, you can use the .decode() method of the byte string to get the right character string from it as above. For completeness, the .encode() method of a character string goes the opposite way:

>>> 'τoρνoς'.encode('utf-8')
b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'