"SyntaxError: Non-ASCII character ..." or "SyntaxError: Non-UTF-8 code starting with ..." trying to use non-ASCII text in a Python script
Asked Answered
C

7

323

I tried this code in Python 2:

def NewFunction():
    return '£'

But I get an error message that says:

SyntaxError: Non-ASCII character '\xa3' in file '...' but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details

Similarly, in Python 3, if I write the same code and save it with Latin-1 encoding, I get:

SyntaxError: Non-UTF-8 code starting with '\xa3' in file ... on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

How can I use a pound sign in string literals in my code?


See also: Correct way to define Python source code encoding for details about whether an encoding declaration is needed and how it should be written. Please use that question to close duplicates asking about how to write the declaration, and this one for questions asking about resolving the error.

Consubstantiate answered 14/5, 2012 at 19:12 Comment(0)
G
387

I'd recommend reading that PEP the error gives you. The problem is that your code is trying to use the ASCII encoding, but the pound symbol is not an ASCII character. Try using UTF-8 encoding. You can start by putting # -*- coding: utf-8 -*- at the top of your .py file. To get more advanced, you can also define encodings on a string by string basis in your code. However, if you are trying to put the pound sign literal in to your code, you'll need an encoding that supports it for the entire file.

Guzzle answered 14/5, 2012 at 19:16 Comment(1)
Strings do not have encodings. Encodings are used to encode strings into bytes, or to decode bytes into strings. The coding declaration for a Python source file affects the text of the file itself - it has nothing to do with strings in the source code; it's equally relevant to variable names (which in 3.x may contain non-ASCII characters).Mcmahon
C
354

Adding the following two lines at the top of my .py script worked for me (first line was necessary):

#!/usr/bin/env python
# -*- coding: utf-8 -*- 
Constellate answered 6/11, 2013 at 9:29 Comment(3)
I got the same problem and my Python is 2.7.11. After adding the the second line # -*- coding: utf-8 -*- to the top of the file, it resolved the problem.Court
First line is to make the py file executable on *nix. It is not really related to this question.Forras
Of course, this doesn't help at all if the file's actual encoding is not UTF-8, as seems to be the case here.Herewith
W
58

First add the # -*- coding: utf-8 -*- line to the beginning of the file and then use u'foo' for all your non-ASCII unicode data:

def NewFunction():
    return u'£'

or use the magic available since Python 2.6 to make it automatic:

from __future__ import unicode_literals
Windowshop answered 14/5, 2012 at 19:21 Comment(5)
If you have # -*- coding: utf-8 -*- you don't need to prefix your unicode strings with uAcuity
@Windowshop what about if it's on a variable? example by reading a file? I can't use uVariable, how I do it?Peccant
@DanielLee Except this is not true. # -*- coding: utf-8 -*- followed by print 'błąd' will output garbage, while print u'błąd' works.Mirza
@DanielLee What Przemek D said. Putting UTF-8 literals into your source code like that is generally not a good idea, and can lead to unwanted behaviour, especially in Python 2. If literals aren't pure 7 bit ASCII they should be actual Unicode, not UTF-8, so in Python 2 you should put the u prefix on such literals. In Python 3, plain strings are Unicode anyway, but the u prefix is permitted in recent versions of Python 3 to make it a little easier to write code which behaves correctly in both Python 2 & 3.Gasiform
@Skizo-ozᴉʞS This particular error message (in the title of this question) would not happen in either of those scenarios. Generally speaking, you need to specify the encoding of any file you read, and if you want to print something to a device which uses a specific encoding, similarly specify the encoding or manually convert when you write. Python 3 simplifies this a lot, though there are still corner cases where you have to specify the encoding explicitly. Perhaps see also nedbatchelder.com/text/unipain.htmlHerewith
H
14

The error message tells you exactly what's wrong. The Python interpreter needs to know the encoding of the non-ASCII character.

If you want to return U+00A3 then you can say

return u'\u00a3'

which represents this character in pure ASCII by way of a Unicode escape sequence. If you want to return a byte string containing the literal byte 0xA3, that's

return b'\xa3'

(where in Python 2 the b is implicit; but explicit is better than implicit).

The linked PEP in the error message instructs you exactly how to tell Python "this file is not pure ASCII; here's the encoding I'm using". If the encoding is UTF-8, that would be

# coding=utf-8

or the Emacs-compatible

# -*- encoding: utf-8 -*-

If you don't know which encoding your editor uses to save this file, examine it with something like a hex editor and some googling. The Stack Overflow tag has a tag info page with more information and some troubleshooting tips.

In so many words, outside of the 7-bit ASCII range (0x00-0x7F), Python can't and mustn't guess what string a sequence of bytes represents. https://tripleee.github.io/8bit#a3 shows 21 possible interpretations for the byte 0xA3 and that's only from the legacy 8-bit encodings; but it could also very well be the first byte of a multi-byte encoding. But in fact, I would guess you are actually using Latin-1, so you should have

# coding: latin-1

as the first or second line of your source file. Anyway, without knowledge of which character the byte is supposed to represent, a human would not be able to guess this, either.

A caveat: coding: latin-1 will definitely remove the error message (because there are no byte sequences which are not technically permitted in this encoding), but might produce completely the wrong result when the code is interpreted if the actual encoding is something else. You really have to know the encoding of the file with complete certainty when you declare the encoding.

Herewith answered 13/6, 2018 at 7:43 Comment(2)
This is an adaptation of an earlier answer of mine to a duplicate question: https://mcmap.net/q/55961/-python-syntax-error-non-ascii-duplicateHerewith
Python 3 defaults to UTF-8 for source files, and you should probably be using UTF-8 for everything these days anyway. utf8everywhere.orgHerewith
V
10

Adding the following two lines in the script solved the issue for me.

# !/usr/bin/python
# coding=utf-8

Hope it helps !

Vaporish answered 6/12, 2019 at 8:52 Comment(2)
This effectively duplicates an earlier answer from 2013. What exactly to put in the shebang on the first line is somewhat system-dependent, but outside the scope of the discussion here.Herewith
Also, you can't have a space between # and !Herewith
A
7

You're probably trying to run Python 3 file with Python 2 interpreter. Currently (as of 2019), python command defaults to Python 2 when both versions are installed, on Windows and most Linux distributions.

But in case you're indeed working on a Python 2 script, a not yet mentioned on this page solution is to resave the file in UTF-8+BOM encoding, that will add three special bytes to the start of the file, they will explicitly inform the Python interpreter (and your text editor) about the file encoding.

Acanthus answered 28/8, 2019 at 13:56 Comment(2)
BOMs in UTF-8 are a nuisance, though they are often necessary on Windows in particular.Herewith
(The real lesson is to avoid Windows, but it's not a comfortable one for many users.)Herewith
M
2

Summary

If this error occurs, use a coding declaration to tell Python the encoding of the source code (.py) file. Without such a declaration, Python 3.x will default to UTF-8; Python 2.x will default to ASCII. The declaration looks like a comment that contains a label coding:, followed by the name of a valid text encoding. All ASCII-transparent encodings are supported.

For example:

#!/usr/bin/env python
# coding: latin-1

Make sure of what encoding the file actually uses in order to write a correct encoding declaration. See How to determine the encoding of text for some hints. Alternately, try to use a different encoding, by checking the configuration options in your text editor.

The issue

Every file on a computer is composed of raw bytes, which are not inherently "text" even if the file is opened "in text mode". When a file is supposed to represent text (such as the source code of a Python program), it needs to be interpreted according to an encoding rule in order to make sense of the data.

However, there isn't an obvious way to indicate the encoding of a Python source file from outside the file - for example, the import syntax doesn't offer anywhere to write an encoding name (after all, it doesn't necessarily import from a source file, anyway). So, the encoding has to be described somehow by the file contents itself, and Python needs a way to determine that encoding on the fly.

In order to make this work in a consistent and reliable way, since version 2.3, Python uses a simple bootstrapping process to determine the file encoding. The procedure is described by PEP 263:

  • First, Python starts reading the raw bytes of the file. If it starts with a UTF-8 encoded byte-order mark - the bytes 0xEF 0xBB 0xBF - then Python discards these bytes and notes that the rest of the file should be UTF-8. (Files written this way are sometimes said to be in "utf-8-sig" encoding.) The rest of the process is still followed, to check for an incompatible coding declaration.

  • Next, Python attempts to read up to the next two lines of the file, using a default encoding (or UTF-8, if a byte-order mark was seen) - and universal newlines, of course:

    • If the first line is not a comment (noting that shebang lines are also comments in Python syntax), use the default encoding for the rest of the file.

    • Otherwise, if the first line is an encoding declaration (a comment that matches a specific regex), use the encoding that was declared for the rest of the file.

    • Otherwise, if the second line is an encoding declaration, use the encoding that was declared for the rest of the file.

    • Otherwise, use the default encoding for the rest of the file.

  • If the file started with a UTF-8 byte-order mark, and an encoding declaration other than UTF-8 was found, an exception is raised.

Python detects encoding declarations with this regex:

^[ \t\f]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+)

This is deliberately permissive; it's intended to match several standard coding declarations that were already in use by other tools (such as the Vim and Emacs text editors).

The syntax for the coding declaration is also designed so that only characters representable in ASCII are needed. Therefore, any "ASCII transparent" encoding can be used. The default encoding is also ASCII transparent; so if the first two lines include a coding declaration, it will be read properly, and if they don't, then the same (default) encoding will be used for the rest of the file anyway. The net effect is as if the correct encoding had been assumed the whole time, even though it wasn't known to begin with. Clever, right?

However, note well that UTF-16 and other non-ASCII-transparent encodings are not supported. In such encodings, the coding declaration cannot be read with the default encoding, so it won't be processed. A byte order mark can't be used to signal UTF-16, either: it simply isn't recognized. It appears that there was a plan to support this originally, but it was dropped.

Python 3.x

PEP 3120 changes the default encoding to UTF-8. Therefore, source files can simply be saved with UTF-8 encoding, contain arbitrary text according to the Unicode standard and be used without an encoding declaration. Plain ASCII data is also valid UTF-8 data, so there is still not a problem.

Use an encoding declaration if the source code must be interpreted with a different ASCII-transparent encoding, such as Latin-1 (ISO-8859-1) or Shift-JIS. For example:

#!/usr/bin/python
# -*- coding: iso-8859-1 -*-
# Assuming the file is actually encoded in Latin-1,
# the text character here would be represented as a 0xff byte.
# This would not be valid UTF-8 data, so the declaration is necessary.
# or else a SyntaxError will occur.
# In UTF-8, the text would be represented as 0xc3 0xbf.
print('ÿ')
# Similarly, without the encoding declaration, this line would print ÿ instead.
print('ÿ')

Python 2.x

The default encoding is ASCII. Therefore, an encoding declaration is necessary to write any non-ASCII text (such as £) in the source file.

Note that using Unicode text in 2.x still requires Unicode literals regardless of the source encoding. Specifying an encoding can allow Python 2.x to interpret 'ÿ' as valid source code (and specifying Latin-1 correctly for a Latin-1 input, instead of UTF-8, can allow it to see that text as ÿ rather than ÿ), but that will still be a byte literal (unfortunately called str). To create an actual Unicode string, make sure to use either a u prefix or the appropriate "future import": from __future__ import unicode_literals.

(But then, it may still be necessary to do even more in order to make such a string printable, especially on Windows; and lots of other things can still go wrong. Python 3 fixes all of that automatically. For anyone sticking with ancient, unsupported versions because of an aversion to specifying encodings explicitly: please reconsider. "Explicit is better than implicit". The 3.x way is much easier and more pleasant in the long run.)

Other workarounds

Regardless of the encoding, Unicode escapes can be used to include arbitrary Unicode characters in a string literal:

>>> # With every supported source file encoding, the following is represented
>>> # with the same bytes in the source file, AND prints the same string:
>>> print('\xf8\u86c7\U0001f9b6')
ø蛇🦶

No matter what encoding is chosen for the source file, and whether or not it is declared (since this text is also valid ASCII and valid UTF-8), this should print a lowercase o with a line through it, the Chinese hanzi/Japanese kanji for "snake", and a foot emoji. (Assuming, of course, that your terminal supports these characters.)

However, this cannot be used in identifier names:

>>> ø = 'monty' # no problem in 3.x; see https://peps.python.org/pep-3131/
>>> 蛇 = 'python' # although a foot emoji is not a valid identifier
>>> # however:
>>> \xf8 = 'monty'
  File "<stdin>", line 1
    \xf8 = 'monty'
                 ^
SyntaxError: unexpected character after line continuation character
>>> \u86c7 = 'python'
  File "<stdin>", line 1
    \u86c7 = 'python'
                    ^
SyntaxError: unexpected character after line continuation character

The error is reported this way because the backslash (outside of a quoted string) is a line continuation character and everything after it is illegal.

Mcmahon answered 16/3, 2023 at 6:36 Comment(1)
This is the best answer, it tells me why.Malayopolynesian

© 2022 - 2024 — McMap. All rights reserved.