Why does ï»¿ appear in my data? [duplicate]

filename = 'pi_million_digits.txt' with open(filename) as file_object: lines = file_object.readlines() pi_string = '' for line in lines: pi_string += line.strip() print(pi_string[:52] + "...") print(len(pi_string))

It looks like you're opening a file with a UTF-8 encoded Byte Order Mark using the ISO-8859-1 encoding (presumably because this is the default encoding on your OS).

If you open it as bytes and read the first line, you should see something like this:

>>> next(open('pi_million_digits.txt', 'rb'))
b'\xef\xbb\xbf3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'

… where \xef\xbb\xbf is the UTF-8 encoding of the BOM. Opened as ISO-8859-1, it looks like what you're getting:

>>> next(open('pi_million_digits.txt', encoding='iso-8859-1'))
'ï»¿3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'

… and opening it as UTF-8 shows the actual BOM character U+FEFF:

>>> next(open('pi_million_digits.txt', encoding='utf-8'))
'\ufeff3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'

To strip the mark out, use the special encoding utf-8-sig:

>>> next(open('pi_million_digits.txt', encoding='utf-8-sig'))
'3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'

The use of next() in the examples above is just for demonstration purposes. In your code, you just need to add the encoding argument to your open() line, e.g.

with open(filename, encoding='utf-8-sig') as file_object:
    # ... etc.

Recommended topics

Hot tags