\ufeff Invalid character in identifier
Asked Answered
G

5

7

I have the following code :

import urllib.request

try:
    url = "https://www.google.com/search?q=test"

    headers = {}
    usag = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:25.0) Gecko/20100101 Firefox/25.0'
    headers['User-Agent'] = usag.encode('utf-8-sig')
    req = urllib.request.Request(url, headers=headers)
    resp = urllib.request.urlopen(req)
    respData = resp.read()

    saveFile = open('withHeaders.txt','w')
    saveFile.write(str(respData))
    saveFile.close()

except Exception as e:
    print(str(e))

it gives me the following error:

D:\virtualenv\samples\urllibb>python 1.py
  File "1.py", line 35
    usag = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:25.0) Gecko/20100101 Firefox/25.0'\ufeff
                                                                                              ^
 SyntaxError: invalid character in identifier

I can't see the \ufeff in my code though.

Grubb answered 28/1, 2016 at 19:48 Comment(3)
I missed that you used an image for the error. It is text, just copy and paste it here, that makes your error searchable (always a good idea).Mazza
I'm sorry but I can't copy from CMD... (or maybe I'm Noob enough)Grubb
Use the [-] menu on your CMD window, it has options there to select and copy.Mazza
M
18

\ufeff is a the ZERO WIDTH NO-BREAK SPACE codepoint; it is not rendered when printing. It is used as a byte order mark in UTF-16 and UTF-32 to record the order in which the encoded bytes are to be decoded (big-endian or little-endian).

UTF-8 doesn't need a BOM (it only has one fixed ordering of the bytes, no need to track an alternative), but Microsoft decided it was a handy signature character for their tools to detect UTF-8 files vs. 8-bit encodings (such as most of the windows codepages employ).

I suspect you are using a Microsoft text editor such as Notepad to save your code. Don't do this, it'll include the BOM but Python doesn't support it or strip it from UTF-8 source files. You probably saved the file with Notepad, then continued with a different tool to add more code to the start and the BOM got caught in the middle.

Either delete the whole line and the next and re-type them, or select from the closing quote of the string you define until just before the h of headers on the next line, delete that part and re-insert a newline and enough indentation.

If your editor supports using escape sequences when searching and replacing (SublimeText does when in regex mode, for example), you could just use that to search for the character and replace it with an empty string. In SublimeText, switch on regex support and search for \x{feff}, replacing those occurrences with an empty string.

The Python utf-8-sig encoding that you are using here also includes that BOM:

headers['User-Agent'] = usag.encode('utf-8-sig')

HTTP headers should not include that codepoint either. HTTP headers typically stick to Latin-1 instead; even ASCII would suffice here, but otherwise use 'utf-8' (no -sig).

You don't really need to use str.encode() there, you could also just define a bytestring:

headers = {}
usag = b'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:25.0) Gecko/20100101 Firefox/25.0'
headers['User-Agent'] = usag

Note the b prefix to the string literal.

Mazza answered 28/1, 2016 at 19:58 Comment(3)
hmm, doesn't work either :( i.imgur.com/J1zjdwF.jpg By the way I'm using sublime on Windows 7.Grubb
@Katallone: the character is still there, yes. It is really part of your file. The remark about using a byte string literal is extra, it was not meant to be a fix for that character being there. You still need to fix that, but once that is done, then you also need to fix the utf-8-sig problem, and the best way to do that is to just define your bytestring as a literal instead of encoding it.Mazza
@Grubb what do you think I've been trying to do here? What did you want me to do, come over to your place and edit the file for you? I've given you several methods to try out. Just complaining that they don't work doesn't tell me anything. I can't see your screen or see what you are doing right or wrong. Take into account that we volunteer our time, this is not my job. Sorry this isn't working for you but, really, tone down that sense of entitlement a little, please.Mazza
Q
7

simply, open script file in Notepad++, go to the "Encoding" tab, select "Encode in UTF-8 without BOM" and save file.

Quietude answered 2/2, 2016 at 19:24 Comment(0)
F
0

For *nix folk, just open the file with

[n]vim -b filename

then

:set list

You'll see it at the begging of the fist line. Since it's zero width you cant even delete it in text mode, tried deleting the line and pasting in the text from having stripped it out in python and it was still there before character 0 in text.

Fitzpatrick answered 12/8, 2022 at 19:21 Comment(0)
B
0

In the most upvoted answer, they recommended using regex search/replace in Sublime Text to replace the characters. I couldn't get that to work but if you simply "Save with Encoding" and choose UTF-8 instead of UTF-8 with BOM it will do what you need.enter image description here

Baltic answered 27/2, 2023 at 21:58 Comment(0)
K
-1

The character is there, after the closing quote on the usag = 'Mozilla... line.

Kruller answered 28/1, 2016 at 19:54 Comment(1)
But they can't see it, presumably in their editor. That's because it is an invisible zero-width character.Mazza

© 2022 - 2024 — McMap. All rights reserved.