Writing and then reading a string in file encoded in latin1
Asked Answered
O

1

9

Here are 2 code samples, Python3 : the first one writes two files with latin1 encoding :

s='On écrit ça dans un fichier.'
with open('spam1.txt', 'w',encoding='ISO-8859-1') as f:
    print(s, file=f)
with open('spam2.txt', 'w',encoding='ISO-8859-1') as f:
    f.write(s)

The second one reads the same files with the same encoding :

with open('spam1.txt', 'r',encoding='ISO-8859-1') as f:
    s1=f.read()
with open('spam2.txt', 'r',encoding='ISO-8859-1') as f:
    s2=f.read()

Now, printing s1 and s2 I get

On écrit ça dans un fichier.

instead of the initial "On écrit ça dans un fichier."

What is wrong ? I also tried with io.open but I miss something. The funny part is that I had no such problem with Python2.7 and its str.decode method which is now gone...

Could someone help me ?

Oehsen answered 22/7, 2013 at 14:34 Comment(2)
Are you 100% certain the files were written with Latin-1 encoding? That looks awfully much like UTF-8 data..Galley
>>> 'On écrit ça dans un fichier.'.encode('utf8').decode('latin1') gives 'On écrit ça dans un fichier.'Galley
G
8

Your data was written out as UTF-8:

>>> 'On écrit ça dans un fichier.'.encode('utf8').decode('latin1')
'On écrit ça dans un fichier.'

This either means you did not write out Latin-1 data, or your source code was saved as UTF-8 but you declared your script (using a PEP 263-compliant header to be Latin-1 instead.

If you saved your Python script with a header like:

# -*- coding: latin-1 -*-

but your text editor saved the file with UTF-8 encoding instead, then the string literal:

s='On écrit ça dans un fichier.'

will be misinterpreted by Python as well, in the same manner. Saving the resulting unicode value to disk as Latin-1, then reading it again as Latin-1 will preserve the error.

To debug, please take a close look at print(s.encode('unicode_escape')) in the first script. If it looks like:

b'On \\xc3\\xa9crit \\xc3\\xa7a dans un fichier.'

then your source code encoding and the PEP-263 header are disagreeing on how the source code should be interpreted. If your source code is correctly decoded the correct output is:

b'On \\xe9crit \\xe7a dans un fichier.'

If Spyder is stubbornly ignoring the PEP-263 header and reading your source as Latin-1 regardless, avoid using non-ASCII characters and use escape codes instead; either using \uxxxx unicode code points:

s = 'On \u00e9crit \u007aa dans un fichier.'

or \xaa one-byte escape codes for code-points below 256:

s = 'On \xe9crit \x7aa dans un fichier.'
Galley answered 22/7, 2013 at 14:41 Comment(12)
@Coulombeau: without some landmarks, I cannot help you find your way. I gave you an indication on how to debug this. How about you update your question with the output of s.encode('unicode_escape') and poke me again?Galley
Well, I'm lost ! I edited only the first script as you requested. Made the test print(s.encode('unicode_escape')) which gave me the first buggy result you cited. The decided to add an header (which I hadn't done before) and put # -*- coding: utf-8 -*- or tried also ascii or latin1. Nothing changed. Then I wrote the simple lines (I need to understand, let's take it very simple !) : # -*- coding: utf-8 -*- s='On écrit ça dans un fichier.' print(s.encode('utf-8').decode('utf-8')) which gave me... On écrit çadans un fichier.Borsch
sorry for the first useless comment, i published it by mistake and then was unable to edit it because 5mins passedBorsch
Your editor is then saving your source code as Latin1 instead.Galley
Well, can it be a problem xith my distribution ? I'm using Windows 7 with WinPython3.3(64bits). I've not really the choice for the distribution as I'm a teacher and that's the distribution on the computers of the CPGE (french system...) in which I'm teaching. Anyway. I'm editing with spyder right now. Should I change or maybe edit some configuration file ? The bad result for print(s.encode('utf-8').decode('utf-8')) is really ununderstandable to me...Borsch
That's because s itself is already incorrect. This is not a problem with your distribution but with how your source code is saved.Galley
first Hex of my file : 23 20 2D... which correspond to # - So the coding of the file should be ANSI or CP1252 but with header # -*- coding: cp1252 -*- nothing changes... And I precise that the string is correctly written in the editor or in the console by a print(s)Borsch
@Coulombeau: And what does print(s.encode('unicode_escape')) tell you about the value?Galley
Ok, I think it's a bug of Spyder or at least I got an hint. I executed the very simple code with # -*- coding: utf-8 -*- as header (because my hex editor gave me c3 c9 for é even if the heading hex for utf8 were missing), then defining s='On écrit ça dans la console.' on a second line, and print(s) on the third. With Spyder, I get On écrit ça dans la console. in the console window even if the string is alright in the editor window. Then I loaded and runned the same file (without editing) under IDLE : and the result was correct !Borsch
@Coulombeau: Interesting! Seems Spyder is ignoring the PEP 263 header. You can use \uxxxx escapes instead when creating a literal.Galley
The print(s.encode('unicode_escape')) is alright when the script is runned from IDLE, and wrong when runned form Spyder.Borsch
Thanks a lot for your help, I wouldn't have thought about the encoding of my script.py if you hadn't pointed it out !Borsch

© 2022 - 2024 — McMap. All rights reserved.