Test a string if it's Unicode, which UTF standard is and get its length in bytes?

Asked 21/8, 2012 at 10:37 Answered 21/8, 2012 at 11:10

Solved python string unicode utf-8 python-2.5

I need to test if a string is Unicode, and then if it whether it's UTF-8. After that, get the string's length in bytes including the BOM, if it ever uses that. How can this be done in Python?

Also for didactic purposes, what does a byte list representation of a UTF-8 string look like? I am curious how a UTF-8 string is represented in Python.

Latter edit: pprint does that pretty well.

Resound answered 21/8, 2012 at 10:37 Comment(7)

What encodings are you expecting the string to be in? – Nunhood 21/8, 2012 at 10:53

I need them to be UTF8 and ASCII! – Resound 21/8, 2012 at 11:1

If the string is ASCII, then it is also in UTF-8. What are you actually trying to do here? – Nunhood 21/8, 2012 at 11:2

I get the relative path of a file in a zip with zipfile library and I need to see if it is according to this standard:w3.org/TR/widgets/#zip-relative-paths – Resound 21/8, 2012 at 11:11

In that case, all you need is to test whether it is UTF-8. – Nunhood 21/8, 2012 at 11:15

@Nunhood and if it has only ASCII characters ? How can I know it ? – Resound 21/8, 2012 at 11:29

You can write string.decode('ascii'), but there's not much point, as ASCII is valid UTF-8. – Nunhood 21/8, 2012 at 11:31

try:
    string.decode('utf-8')
    print "string is UTF-8, length %d bytes" % len(string)
except UnicodeError:
    print "string is not UTF-8"

In Python 2, str is a sequence of bytes and unicode is a sequence of characters. You use str.decode to decode a byte sequence to unicode, and unicode.encode to encode a sequence of characters to str. So for example, u"é" is the unicode string containing the single character U+00E9 and can also be written u"\xe9"; encoding into UTF-8 gives the byte sequence "\xc3\xa9".

In Python 3, this is changed; bytes is a sequence of bytes and str is a sequence of characters.

Nunhood answered 21/8, 2012 at 10:44 Comment(5)

I want to see also if string is ASCII or Unicode, your code doesn't take the possibility for the string being other type of UTF unicode? – Resound 21/8, 2012 at 10:48

@EduardFlorinescu for other encodings, pass another encoding to string.decode. – Nunhood 21/8, 2012 at 10:52

I get this error on string.decode('utf-8') UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128) – Resound 21/8, 2012 at 11:34

@EduardFlorinescu in that case string is already a unicode object, so it's a sequence of characters, not bytes. You can check how many bytes its UTF-8 representation uses with len(string.encode('utf-8')). – Nunhood 21/8, 2012 at 11:40

It seems that a zipfile library Zipinfo object has a hidden field: orig_filename other that filename which is already unicode that contains the original encoding of the filename in my case UTF8. – Resound 21/8, 2012 at 11:57

To Check if Unicode

>>>a = u'F'
>>>isinstance(a, unicode)
True

To Check if it is UTF-8 or ASCII

>>>import chardet
>>>encoding = chardet.detect('AA')
>>>encoding['encoding']
'ascii'

Tobytobye answered 21/8, 2012 at 11:10 Comment(1)

With instance I get a lot of this:

UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal

and with the second if I put in place of ('AA') I get IndexError: tuple index out of range – Resound 21/8, 2012 at 11:28

I would definitely recommend Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know about Unicode and Character Sets (No Excuses!), if you haven't already read it.

For Python's Unicode and encoding/decoding machinery, start here. To get the byte-length of a Unicode string encoded in utf-8, you could do:

print len(my_unicode_string.encode('utf-8'))

Your question is tagged python-2.5, but be aware that this changes somewhat in Python 3+.

Basidiomycete answered 21/8, 2012 at 10:44 Comment(0)

Recommended topics

Hot tags