Why does  appear in my data? [duplicate]
Asked Answered
K

1

8

I downloaded the file 'pi_million_digits.txt' from here:

https://github.com/ehmatthes/pcc/blob/master/chapter_10/pi_million_digits.txt

I then used this code to open and read it:

filename = 'pi_million_digits.txt'

with open(filename) as file_object:
    lines = file_object.readlines()

pi_string = ''
for line in lines:
    pi_string += line.strip()

print(pi_string[:52] + "...")
print(len(pi_string))

However the output produced is correct apart from the fact it is preceded by same strange symbols: "3.141...."

What causes these strange symbols? I am stripping the lines so I'd expect such symbols to be removed.

Kaule answered 18/5, 2017 at 16:39 Comment(2)
The file is likely corrupted.Subsonic
I have looked at the file in a text editor and it looks OK? Could it look OK in the text editor and still be corrupted?Kaule
A
15

It looks like you're opening a file with a UTF-8 encoded Byte Order Mark using the ISO-8859-1 encoding (presumably because this is the default encoding on your OS).

If you open it as bytes and read the first line, you should see something like this:

>>> next(open('pi_million_digits.txt', 'rb'))
b'\xef\xbb\xbf3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'

… where \xef\xbb\xbf is the UTF-8 encoding of the BOM. Opened as ISO-8859-1, it looks like what you're getting:

>>> next(open('pi_million_digits.txt', encoding='iso-8859-1'))
'3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'

… and opening it as UTF-8 shows the actual BOM character U+FEFF:

>>> next(open('pi_million_digits.txt', encoding='utf-8'))
'\ufeff3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'

To strip the mark out, use the special encoding utf-8-sig:

>>> next(open('pi_million_digits.txt', encoding='utf-8-sig'))
'3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'

The use of next() in the examples above is just for demonstration purposes. In your code, you just need to add the encoding argument to your open() line, e.g.

with open(filename, encoding='utf-8-sig') as file_object:
    # ... etc.
Azilian answered 18/5, 2017 at 17:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.