python: lower() german umlauts

Asked 24/2, 2013 at 14:43 Answered 24/2, 2013 at 16:0

Solved python unicode diacritics lowercase case-folding

I have a problem with converting uppercase letters with umlauts to lowercase ones.

print("ÄÖÜAOU".lower())

The A, O and the U gets converted properly but the Ä,Ö and Ü stays uppercase. Any ideas?

First problem is fixed with the .decode('utf-8') but I still have a second one:

# -*- coding: utf-8 -*-
original_message="ÄÜ".decode('utf-8')
original_message=original_message.lower()
original_message=original_message.replace("ä", "x")
print(original_message)

Traceback (most recent call last): File "Untitled.py", line 4, in original_message=original_message.replace("ä", "x") UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Widthwise answered 24/2, 2013 at 14:43 Comment(3)

Are you using python 2 or 3? – Heinous 24/2, 2013 at 14:46

Python 2.7.2 the one shipped with OSX. – Widthwise 24/2, 2013 at 14:46

@Widthwise There's your problem. – Graphemics 24/2, 2013 at 14:46

You'll need to mark it as a unicode string unless you're working with plain ASCII;

> print(u"ÄÖÜAOU".lower())

äöüaou

It works the same when working with variables, it all depends on the type assigned to the variable to begin with.

> olle = "ÅÄÖABC"
> print(olle.lower())
ÅÄÖabc

> olle = u"ÅÄÖABC"
> print(olle.lower())
åäöabc

Shocking answered 24/2, 2013 at 14:47 Comment(12)

I have # -- coding: utf-8 -- in the first line, looks like its the Python version as BlaXpirit suggest. – Widthwise 24/2, 2013 at 14:50

@Widthwise The above example was run on standard Python 2.7.2 on Mac OS X. Without marking as unicode, it will only convert ascii characters to lower case, with the u marker, it gives the correct output. – Shocking 24/2, 2013 at 14:51

So the tag in the beginning is not enough? – Widthwise 24/2, 2013 at 14:54

The tag just tells Python the encoding of the file. – Sharondasharos 24/2, 2013 at 14:58

@Widthwise Just as Matthias says, the coding metadata only helps Python to correctly detect the encoding of the file, it has nothing to do with ascii versus unicode strings at runtime. – Shocking 24/2, 2013 at 14:59

@Widthwise If original_message contains a unicode string, yes, it will work just fine. Added an example to the answer. – Shocking 24/2, 2013 at 15:7

Problem is the variable comes from a raw_input – Widthwise 24/2, 2013 at 15:10

It does work until the script hits a point, where it should replace characters. – Widthwise 24/2, 2013 at 15:14

@Widthwise If you're doing raw_input from stdin, you can get it as a unicode string using olle=raw_input().decode(sys.stdin.encoding) instead of just olle=raw_input(). – Shocking 24/2, 2013 at 15:15

As said, if I do that I get an error in the replace part of the script: File "KORO.py", line 46, in replace c("ä", "335") File "KORO.py", line 200, in c original_message=original_message.replace(letter, number) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) – Widthwise 24/2, 2013 at 15:18

added an example to the question. – Widthwise 24/2, 2013 at 15:27

@Widthwise Regarding your addition, you need to replace using unicode strings too; original_message=original_message.replace(u"ä", u"x") works well. – Shocking 24/2, 2013 at 15:38

You are dealing with encoded strings, not with unicode text.

The .lower() method of byte strings can only deal with ASCII values. Decode your string to Unicode or use a unicode literal (u''), then lowercase:

>>> print u"\xc4AOU".lower()
äaou

Heinous answered 24/2, 2013 at 14:48 Comment(1)

@user2104634: you need to read the Python Unicode HOWTO; you decode the variable to a unicode value (variable.decode(encoding')). – Heinous 24/2, 2013 at 15:0

If you're using Python 2 but don't want to prefix u"" on all your strings put this at the beginning of your program:

from __future__ import unicode_literals
olle = "ÅÄÖABC"
print(olle.lower())

will now return:

åäöabc

The encoding specifies how to interpret the characters read in from disk into a program, but the from __ future __ import statement tells how to interpret these strings within the program itself. You will probably need both.

Jargonize answered 24/2, 2013 at 16:0 Comment(1)

today, my suggestion would be -- use Python 3. unicode_literals doesn't work in enough places to be worth it. – Jargonize 8/11, 2018 at 18:14

Recommended topics

Hot tags