Summary
If this error occurs, use a coding declaration to tell Python the encoding of the source code (.py) file. Without such a declaration, Python 3.x will default to UTF-8; Python 2.x will default to ASCII. The declaration looks like a comment that contains a label coding:
, followed by the name of a valid text encoding. All ASCII-transparent encodings are supported.
For example:
#!/usr/bin/env python
# coding: latin-1
Make sure of what encoding the file actually uses in order to write a correct encoding declaration. See How to determine the encoding of text for some hints. Alternately, try to use a different encoding, by checking the configuration options in your text editor.
The issue
Every file on a computer is composed of raw bytes, which are not inherently "text" even if the file is opened "in text mode". When a file is supposed to represent text (such as the source code of a Python program), it needs to be interpreted according to an encoding rule in order to make sense of the data.
However, there isn't an obvious way to indicate the encoding of a Python source file from outside the file - for example, the import
syntax doesn't offer anywhere to write an encoding name (after all, it doesn't necessarily import from a source file, anyway). So, the encoding has to be described somehow by the file contents itself, and Python needs a way to determine that encoding on the fly.
In order to make this work in a consistent and reliable way, since version 2.3, Python uses a simple bootstrapping process to determine the file encoding. The procedure is described by PEP 263:
First, Python starts reading the raw bytes of the file. If it starts with a UTF-8 encoded byte-order mark - the bytes 0xEF 0xBB 0xBF
- then Python discards these bytes and notes that the rest of the file should be UTF-8. (Files written this way are sometimes said to be in "utf-8-sig" encoding.) The rest of the process is still followed, to check for an incompatible coding declaration.
Next, Python attempts to read up to the next two lines of the file, using a default encoding (or UTF-8, if a byte-order mark was seen) - and universal newlines, of course:
If the first line is not a comment (noting that shebang lines are also comments in Python syntax), use the default encoding for the rest of the file.
Otherwise, if the first line is an encoding declaration (a comment that matches a specific regex), use the encoding that was declared for the rest of the file.
Otherwise, if the second line is an encoding declaration, use the encoding that was declared for the rest of the file.
Otherwise, use the default encoding for the rest of the file.
If the file started with a UTF-8 byte-order mark, and an encoding declaration other than UTF-8 was found, an exception is raised.
Python detects encoding declarations with this regex:
^[ \t\f]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+)
This is deliberately permissive; it's intended to match several standard coding declarations that were already in use by other tools (such as the Vim and Emacs text editors).
The syntax for the coding declaration is also designed so that only characters representable in ASCII are needed. Therefore, any "ASCII transparent" encoding can be used. The default encoding is also ASCII transparent; so if the first two lines include a coding declaration, it will be read properly, and if they don't, then the same (default) encoding will be used for the rest of the file anyway. The net effect is as if the correct encoding had been assumed the whole time, even though it wasn't known to begin with. Clever, right?
However, note well that UTF-16 and other non-ASCII-transparent encodings are not supported. In such encodings, the coding declaration cannot be read with the default encoding, so it won't be processed. A byte order mark can't be used to signal UTF-16, either: it simply isn't recognized. It appears that there was a plan to support this originally, but it was dropped.
Python 3.x
PEP 3120 changes the default encoding to UTF-8. Therefore, source files can simply be saved with UTF-8 encoding, contain arbitrary text according to the Unicode standard and be used without an encoding declaration. Plain ASCII data is also valid UTF-8 data, so there is still not a problem.
Use an encoding declaration if the source code must be interpreted with a different ASCII-transparent encoding, such as Latin-1 (ISO-8859-1) or Shift-JIS. For example:
#!/usr/bin/python
# -*- coding: iso-8859-1 -*-
# Assuming the file is actually encoded in Latin-1,
# the text character here would be represented as a 0xff byte.
# This would not be valid UTF-8 data, so the declaration is necessary.
# or else a SyntaxError will occur.
# In UTF-8, the text would be represented as 0xc3 0xbf.
print('ÿ')
# Similarly, without the encoding declaration, this line would print ÿ instead.
print('ÿ')
Python 2.x
The default encoding is ASCII. Therefore, an encoding declaration is necessary to write any non-ASCII text (such as £
) in the source file.
Note that using Unicode text in 2.x still requires Unicode literals regardless of the source encoding. Specifying an encoding can allow Python 2.x to interpret 'ÿ'
as valid source code (and specifying Latin-1 correctly for a Latin-1 input, instead of UTF-8, can allow it to see that text as ÿ
rather than ÿ
), but that will still be a byte literal (unfortunately called str
). To create an actual Unicode string, make sure to use either a u
prefix or the appropriate "future import": from __future__ import unicode_literals
.
(But then, it may still be necessary to do even more in order to make such a string printable, especially on Windows; and lots of other things can still go wrong. Python 3 fixes all of that automatically. For anyone sticking with ancient, unsupported versions because of an aversion to specifying encodings explicitly: please reconsider. "Explicit is better than implicit". The 3.x way is much easier and more pleasant in the long run.)
Other workarounds
Regardless of the encoding, Unicode escapes can be used to include arbitrary Unicode characters in a string literal:
>>> # With every supported source file encoding, the following is represented
>>> # with the same bytes in the source file, AND prints the same string:
>>> print('\xf8\u86c7\U0001f9b6')
ø蛇🦶
No matter what encoding is chosen for the source file, and whether or not it is declared (since this text is also valid ASCII and valid UTF-8), this should print a lowercase o with a line through it, the Chinese hanzi/Japanese kanji for "snake", and a foot emoji. (Assuming, of course, that your terminal supports these characters.)
However, this cannot be used in identifier names:
>>> ø = 'monty' # no problem in 3.x; see https://peps.python.org/pep-3131/
>>> 蛇 = 'python' # although a foot emoji is not a valid identifier
>>> # however:
>>> \xf8 = 'monty'
File "<stdin>", line 1
\xf8 = 'monty'
^
SyntaxError: unexpected character after line continuation character
>>> \u86c7 = 'python'
File "<stdin>", line 1
\u86c7 = 'python'
^
SyntaxError: unexpected character after line continuation character
The error is reported this way because the backslash (outside of a quoted string) is a line continuation character and everything after it is illegal.