python: lower() german umlauts
Asked Answered
W

3

8

I have a problem with converting uppercase letters with umlauts to lowercase ones.

print("ÄÖÜAOU".lower())

The A, O and the U gets converted properly but the Ä,Ö and Ü stays uppercase. Any ideas?

First problem is fixed with the .decode('utf-8') but I still have a second one:

# -*- coding: utf-8 -*-
original_message="ÄÜ".decode('utf-8')
original_message=original_message.lower()
original_message=original_message.replace("ä", "x")
print(original_message)

Traceback (most recent call last): File "Untitled.py", line 4, in original_message=original_message.replace("ä", "x") UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Widthwise answered 24/2, 2013 at 14:43 Comment(3)
Are you using python 2 or 3?Heinous
Python 2.7.2 the one shipped with OSX.Widthwise
@Widthwise There's your problem.Graphemics
S
9

You'll need to mark it as a unicode string unless you're working with plain ASCII;

> print(u"ÄÖÜAOU".lower())

äöüaou

It works the same when working with variables, it all depends on the type assigned to the variable to begin with.

> olle = "ÅÄÖABC"
> print(olle.lower())
ÅÄÖabc

> olle = u"ÅÄÖABC"
> print(olle.lower())
åäöabc
Shocking answered 24/2, 2013 at 14:47 Comment(12)
I have # -- coding: utf-8 -- in the first line, looks like its the Python version as BlaXpirit suggest.Widthwise
@Widthwise The above example was run on standard Python 2.7.2 on Mac OS X. Without marking as unicode, it will only convert ascii characters to lower case, with the u marker, it gives the correct output.Shocking
So the tag in the beginning is not enough?Widthwise
The tag just tells Python the encoding of the file.Sharondasharos
@Widthwise Just as Matthias says, the coding metadata only helps Python to correctly detect the encoding of the file, it has nothing to do with ascii versus unicode strings at runtime.Shocking
@Widthwise If original_message contains a unicode string, yes, it will work just fine. Added an example to the answer.Shocking
Problem is the variable comes from a raw_inputWidthwise
It does work until the script hits a point, where it should replace characters.Widthwise
@Widthwise If you're doing raw_input from stdin, you can get it as a unicode string using olle=raw_input().decode(sys.stdin.encoding) instead of just olle=raw_input().Shocking
As said, if I do that I get an error in the replace part of the script: File "KORO.py", line 46, in replace c("ä", "335") File "KORO.py", line 200, in c original_message=original_message.replace(letter, number) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)Widthwise
added an example to the question.Widthwise
@Widthwise Regarding your addition, you need to replace using unicode strings too; original_message=original_message.replace(u"ä", u"x") works well.Shocking
H
3

You are dealing with encoded strings, not with unicode text.

The .lower() method of byte strings can only deal with ASCII values. Decode your string to Unicode or use a unicode literal (u''), then lowercase:

>>> print u"\xc4AOU".lower()
äaou
Heinous answered 24/2, 2013 at 14:48 Comment(1)
@user2104634: you need to read the Python Unicode HOWTO; you decode the variable to a unicode value (variable.decode(encoding')).Heinous
J
2

If you're using Python 2 but don't want to prefix u"" on all your strings put this at the beginning of your program:

from __future__ import unicode_literals
olle = "ÅÄÖABC"
print(olle.lower())

will now return:

åäöabc

The encoding specifies how to interpret the characters read in from disk into a program, but the from __ future __ import statement tells how to interpret these strings within the program itself. You will probably need both.

Jargonize answered 24/2, 2013 at 16:0 Comment(1)
today, my suggestion would be -- use Python 3. unicode_literals doesn't work in enough places to be worth it.Jargonize

© 2022 - 2024 — McMap. All rights reserved.