Python, Windows, Ansi - encoding, again
Asked Answered
D

3

16

Hello there,

even if i really tried... im stuck and somewhat desperate when it comes to Python, Windows, Ansi and character encoding. I need help, seriously... searching the web for the last few hours wasn't any help, it just drives me crazy.

I'm new to Python, so i have almost no clue what's going on. I'm about to learn the language, so my first program, which ist almost done, should automatically generate music-playlists from a given folder containing mp3s. That works just fine, besides one single problem...

...i can't write Umlaute (äöü) to the playlist-file.

After i found a solution for "wrong-encoded" Data in the sys.argv i was able to deal with that. When reading Metadata from the MP3s, i'm using some sort of simple character substitution to get rid of all those international special chars, like french accents or this crazy skandinavian "o" with a slash in it (i don't even know how to type it...). All fine.

But i'd like to write at least the mentioned Umlaute to the playlist-file, those characters are really common here in Germany. And unlike the Metadata, where i don't care about some missing characters or miss-spelled words, this is relevant - because now i'm writing the paths to the files.

I've tried so many various encoding and decoding methods, i can't list them all here.. heck, i'm not even able to tell which settings i tried half an hour ago. I found code online, here, and elsewhere, that seemed to work for some purposes. Not for mine.

I think the tricky part is this: it seems like the Problem is the Ansi called format of the files i need to write. Correct - i actually need this Ansi-stuff. About two hours ago i actually managed to write whatever i'd like to an UFT-8 file. Works like charm... until i realized that my Player (Winamp, old Version) somehow doesn't work with those UTF-8 playlist files. It couldn't resolve the Path, even if it looks right in my editor.

If i change the file format back to Ansi, Paths containing special chars get corrupted. I'm just guessing, but if Winamp reads this UTF-8 files as Ansi, that would cause the Problem i'm experiencing right now.

So...

  1. I DO have to write äöü in a path, or it will not work
  2. It DOES have to be an ANSI-"encoded" file, or it will not work
  3. Things like line.write(str.decode('utf-8')) break the funktion of the file
  4. A magical comment at the beginning of the script like # -*- coding: iso-8859-1 -*- does nothing here (though it is helpful when it comes to the mentioned Metadata and allowed characters in it...)
  5. Oh, and i'm using Python 2.7.3. Third-Party modules dependencies, you know...

Is there ANYONE who could guide me towards a way out of this encoding hell? Any help is welcome. If i need 500 lines of Code for another functions or classes, i'll type them. If there's a module for handling such stuff, let me know! I'd buy it! Anything helpful will be tested.

Thank you for reading, thanks for any comment,

greets!

Doone answered 29/12, 2012 at 6:36 Comment(8)
You need to show the code you are using, along with sample data, or there's no way to tell what you're doing wrong. You also need to explain your problem more clearly. You say you're having a lot of problems but you don't say specifically what errors you get, what behavior you get that you're trying to change, etc.Nashbar
Reduce the length of the question otherwise not many will read through all of that.Diplocardiac
Which version of Windows are you on? If memory serves, not all versions of Windows support non-ASCII characters in paths. Additionally, as you've found, some software won't play nice with non-ASCII paths, even if the OS is compatible.Serranid
@sr2222: Windows has always supported 8-bit path names (which covers codepages); full Unicode pathnames are supported from XP onwards.Joshia
@BrenBarn: Any string containing an Umlaut will do. I have only one problem. I can't post neither every sample of code i tried nor the corresponding error messages - i'm not willing to write a book here.Doone
@AshRj: ...if i could somehow represent the whole problem in a single question, i would have done this. Sorry. Just asking "how to write umlaut to file" would sure give me plenty of answers - but none of the usual approaches works out for me... just one example, i can't use an UTF-8 file...Doone
It was just a suggestion. The likelihood of people reading your question is more if its short and concise with a code sample included.Diplocardiac
@AshRj: ...that's what i usually tell people, and of course you 're right. Therefore i hate it even more to get in some trouble without being able to express myself the way i'd like to. Oh, the irony...Doone
C
30

As mentioned in the comments, your question isn't very specific, so I'll try to give you some hints about character encodings, see if you can apply those to your specific case!

Unicode and Encoding

Here's a small primer about encoding. Basically, there are two ways to represent text in Python:

  • unicode. You can consider that unicode is the ultimate encoding, you should strive to use it everywhere. In Python 2.x source files, unicode strings look like u'some unicode'.
  • str. This is encoded text - to be able to read it, you need to know the encoding (or guess it). In Python 2.x, those strings look like 'some str'.

This changed in Python 3 (unicode is now str and str is now bytes).

How does that play out?

Usually, it's pretty straightforward to ensure that you code uses unicode for its execution, and uses str for I/O:

  • Everything you receive is encoded, so you do input_string.decode('encoding') to convert it to unicode.
  • Everything you need to output is unicode but needs to be encoded, so you do output_string.encode('encoding').

The most common encodings are cp-1252 on Windows (on US or EU systems), and utf-8 on Linux.

Applying this to your case

I DO have to write äöü in a path, or it will not work

Windows natively uses unicode for file paths and names, so you should actually always use unicode for those.

It DOES have to be an ANSI-"encoded" file, or it will not work

When you write to the file, be sure to always run your output through output.encode('cp1252') (or whatever encoding ANSI would be on your system).

Things like line.write(str.decode('utf-8')) break the funktion of the file

By now you probably realized that:

  • If str as indeed an str instance, Python will try to convert it to unicode using the utf-8 encoding, but then try to encode it again (likely in ascii) to write it to the file
  • If str is actually an unicode instance, Python will first encode it (likely in ascii, and that will probably crash) to then be able to decode it.

Bottom line is, you need to know if str is unicode, you should encode it. If it's already encoded, don't touch it (or decode it then encode it if the encoding is not the one you want!).

A magical comment at the beginning of the script like # -- coding: iso-8859-1 -- does nothing here (though it is helpful when it comes to the mentioned Metadata and allowed characters in it...)

Not a surprise, this only tells Python what encoding should be used to read your source file so that non-ascii characters are properly recognized.

Oh, and i'm using Python 2.7.3. Third-Party modules dependencies, you know...

Python 3 probably is a big update in terms of unicode and encoding, but that doesn't mean Python 2.x can't make it work!

Will that solve your issue?

You can't be sure, it's possible that the problem lies in the player you're using, not in your code.

Once you output it, you should make sure that your script's output is readable using reference tools (such as Windows Explorer). If it is, but the player still can't open it, you should consider updating to a newer version.

Cowden answered 29/12, 2012 at 11:58 Comment(2)
THANK YOU! I guess i'm going to print that and use it as reference, this is the best explanation i found until now. AND: it works! :-) ...the output.encode('cp1252') was exactly what i needed. Works perfect! I don't really understand why i got trouble using iso-8859... but on my Win7, cp1252 does the trick. If i can avoid it, i'm not going to dive deeper in this encoding-stuff. Also, it could be that a newer version of my player could handle this differences better, but i don't like bloatware and i'm somehow stuck with this player... it's great to got this working. Thanks a lot!Doone
...btw: all native english speaking nations could really be thankful for not having such a mess of characters in their alphabet. That's some advantage worth mentioning.Doone
J
6

On Windows there is special encoding available called mbcs, it converts between current default ANSI codepage and UNICODE. For example on a Spanish Language PC:

u'ñ'.encode('mbcs') -> '\xf1'
'\xf1'.decode('mbcs') -> u'ñ'

On Windows ANSI means current default multi-byte code page. For western European languages Windows ISO-8859-1, for eastern European languages windows ISO-8859-2) encoded byte string and other encodings for other languages as appropriate.

More info available at:

https://docs.python.org/2.4/lib/standard-encodings.html

See also:

https://docs.python.org/2/library/sys.html#sys.getfilesystemencoding

Jea answered 11/5, 2016 at 16:34 Comment(1)
it worked for Turkish chars in a file encoded with ANSI. Thanks.Trever
H
0

# -*- coding comments declare the character encoding of the source code (and therefore of byte-string literals like 'abc').

Assuming that by "playlist" you mean m3u files, then based on this specification you may be at the mercy of the mp3 player software you are using. This spec says only that the files contain text, no mention of what character encoding.

I have personally observed that various mp3 encoding software will use different encodings for mp3 metadata. Some use UTF-8, others ISO-8859-1. So you may have to allow encoding to be specified in configuration and leave it at that.

Hockett answered 29/12, 2012 at 7:11 Comment(1)
He's asking about writing paths and file names, not metadata.Serranid

© 2022 - 2024 — McMap. All rights reserved.