Test a string if it's Unicode, which UTF standard is and get its length in bytes?
Asked Answered
R

3

25

I need to test if a string is Unicode, and then if it whether it's UTF-8. After that, get the string's length in bytes including the BOM, if it ever uses that. How can this be done in Python?

Also for didactic purposes, what does a byte list representation of a UTF-8 string look like? I am curious how a UTF-8 string is represented in Python.

Latter edit: pprint does that pretty well.

Resound answered 21/8, 2012 at 10:37 Comment(7)
What encodings are you expecting the string to be in?Nunhood
I need them to be UTF8 and ASCII!Resound
If the string is ASCII, then it is also in UTF-8. What are you actually trying to do here?Nunhood
I get the relative path of a file in a zip with zipfile library and I need to see if it is according to this standard:w3.org/TR/widgets/#zip-relative-pathsResound
In that case, all you need is to test whether it is UTF-8.Nunhood
@Nunhood and if it has only ASCII characters ? How can I know it ?Resound
You can write string.decode('ascii'), but there's not much point, as ASCII is valid UTF-8.Nunhood
N
35
try:
    string.decode('utf-8')
    print "string is UTF-8, length %d bytes" % len(string)
except UnicodeError:
    print "string is not UTF-8"

In Python 2, str is a sequence of bytes and unicode is a sequence of characters. You use str.decode to decode a byte sequence to unicode, and unicode.encode to encode a sequence of characters to str. So for example, u"é" is the unicode string containing the single character U+00E9 and can also be written u"\xe9"; encoding into UTF-8 gives the byte sequence "\xc3\xa9".

In Python 3, this is changed; bytes is a sequence of bytes and str is a sequence of characters.

Nunhood answered 21/8, 2012 at 10:44 Comment(5)
I want to see also if string is ASCII or Unicode, your code doesn't take the possibility for the string being other type of UTF unicode?Resound
@EduardFlorinescu for other encodings, pass another encoding to string.decode.Nunhood
I get this error on string.decode('utf-8') UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)Resound
@EduardFlorinescu in that case string is already a unicode object, so it's a sequence of characters, not bytes. You can check how many bytes its UTF-8 representation uses with len(string.encode('utf-8')).Nunhood
It seems that a zipfile library Zipinfo object has a hidden field: orig_filename other that filename which is already unicode that contains the original encoding of the filename in my case UTF8.Resound
T
8

To Check if Unicode

>>>a = u'F'
>>>isinstance(a, unicode)
True

To Check if it is UTF-8 or ASCII

>>>import chardet
>>>encoding = chardet.detect('AA')
>>>encoding['encoding']
'ascii'
Tobytobye answered 21/8, 2012 at 11:10 Comment(1)
With instance I get a lot of this: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal and with the second if I put in place of ('AA') I get IndexError: tuple index out of rangeResound
B
7

I would definitely recommend Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know about Unicode and Character Sets (No Excuses!), if you haven't already read it.

For Python's Unicode and encoding/decoding machinery, start here. To get the byte-length of a Unicode string encoded in utf-8, you could do:

print len(my_unicode_string.encode('utf-8'))

Your question is tagged python-2.5, but be aware that this changes somewhat in Python 3+.

Basidiomycete answered 21/8, 2012 at 10:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.