Handling french letters in Python
Asked Answered
I

2

10

I am reading data from a file which contains words with french and english letters. I am attempting to construct a list of all of the possible english and french letters (stored as strings). I do this with the code below:

# encoding: utf-8
def trackLetter(letters, line):
    for a in line:
        found = False;
        for b in letters:
            if b==a:
                found = True
        if not found:
            letters += a

cur_letters = []; # for storing possible letters

data = urllib2.urlopen('https://duolinguist.wordpress.com/2015/01/06/top-5000-words-in-french-wordlist/', 'utf-8')
for line in data:
    trackLetter(cur_letters, line)
    # works if I print here

print cur_letters

This code prints the following:

['t', 'h', 'e', 'o', 'f', 'a', 'n', 'd', 'i', 'r', 's', 'b', 'y', 'w', 'u', 'm', 'l', 'v', 'c', 'p', 'g', 'k', 'x', 'j', 'z', 'q', '\xc3', '\xa0', '\xaa', '\xb9', '\xa9', '\xa8', '\xb4', '\xae', '-', '\xe2', '\x80', '\x99', '\xa2', '\xa7', '\xbb', '\xaf']

Obviously the French letters have been lost in some sort of conversion to ASCII, despite me specifying the UTF encoding! The strange thing is when I print out the line directly (shown as a comment), the french characters appear perfectly!

What should I do to preserve these characters (é, è, ê, etc.), or convert them back to their original version?

Ivey answered 24/11, 2016 at 20:10 Comment(2)
Possible duplicate of Unicode (utf8) reading and writing to files in pythonCrassus
No, reading the filie isn't the issue - see the OP's "works if I print here" commentNewby
N
7

They aren't lost, they're just escaped when you print the list.

When you print a list in Python 2, it calls the __str__ method of the list itself, not on each individual item, and the list's __str__ method escapes your non-ascii characters. See this excellent answer for more explanation:

How does str(list) work?

The following snippet demonstrates the issue succintly:

char_list = ['é', 'è', 'ê']
print(char_list)
# ['\xc3\xa9', '\xc3\xa8', '\xc3\xaa']

print(', '.join(char_list))
# é, è, ê
Newby answered 24/11, 2016 at 20:33 Comment(4)
That's definitely helpful, although it doesn't seem to fix my issue. Your code works perfectly for me, but for some reason when I call print(''.join(cur_letters)) at the end of my code it gives me the error [Decode error - output not utf-8]Ivey
This error is even thrown in my trackLetter() function if I call print type(a) on the french charactersIvey
Ah.. does it solve your problem if you open the file via codecs.open("words.txt", "r", "utf-8")?Newby
I simplified the problem in my original post for clarity - I am actually reading lines off a website (see edited post).Ivey
I
-1

Not an ideal answer, but as a workaround the french characters can also be added manually:

french_letters = ['é',
        'à', 'è', 'ù',
        'â', 'ê', 'î', 'ô', 'û',
        'ç',
        'ë', 'ï', 'ü']

all_letters = cur_letters + french_letters
Ivey answered 24/11, 2016 at 21:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.