Warning raised by inserting 4-byte unicode to mysql
Asked Answered
C

3

8

Look at the following:

/home/kinka/workspace/py/tutorial/tutorial/pipelines.py:33: Warning: Incorrect string 
value: '\xF0\x9F\x91\x8A\xF0\x9F...' for column 't_content' at row 1
n = self.cursor.execute(self.sql, (item['topic'], item['url'], item['content']))

The string '\xF0\x9F\x91\x8A, actually is a 4-byte unicode: u'\U0001f62a'. The mysql's character-set is utf-8 but inserting 4-byte unicode it will truncate the inserted string. I googled for such a problem and found that mysql under 5.5.3 don't support 4-byte unicode, and unfortunately mine is 5.5.224. I don't want to upgrade the mysql server, so I just want to filter the 4-byte unicode in python, I tried to use regular expression but failed. So, any help?

Coffin answered 29/5, 2012 at 11:53 Comment(4)
That's the FISTED HAND SIGN color emoji: πŸ‘Š... – Weihs
@MartijnPieters - unicodedata.name("\U0001f62a") says 'SLEEPY FACE' (which would be b'\xf0\x9f\x98\xaa' in utf-8), so someting is not right here... – Laky
Actually, it is a sleepy face. I'm scraping pages from sina weibo(twitter in China), and I scraped such SLEEP FACE. – Coffin
Yup, and '\xF0\x9F\x91\x8A'.decode('utf8') is u'\U0001f44a', which is 'FISTED HAND SIGN' :-) – Weihs
W
10

If MySQL cannot handle UTF-8 codes of 4 bytes or more then you'll have to filter out all unicode characters over codepoint \U00010000; UTF-8 encodes codepoints below that threshold in 3 bytes or fewer.

You could use a regular expression for that:

>>> import re
>>> highpoints = re.compile(u'[\U00010000-\U0010ffff]')
>>> example = u'Some example text with a sleepy face: \U0001f62a'
>>> highpoints.sub(u'', example)
u'Some example text with a sleepy face: '

Alternatively, you could use the .translate() function with a mapping table that only contains None values:

>>> nohigh = { i: None for i in xrange(0x10000, 0x110000) }
>>> example.translate(nohigh)
u'Some example text with a sleepy face: '

However, creating the translation table will eat a lot of memory and take some time to generate; it is probably not worth your effort as the regular expression approach is more efficient.

This all presumes you are using a UCS-4 compiled python. If your python was compiled with UCS-2 support then you can only use codepoints up to '\U0000ffff' in regular expressions and you'll never run into this problem in the first place.

I note that as of MySQL 5.5.3 the newly-added utf8mb4 codec does supports the full Unicode range.

Weihs answered 29/5, 2012 at 12:46 Comment(12)
I tried your code, but it doesn't work. It is \U(uppercase u). However, your thought is really enlightening, thanks! – Coffin
You are quite right; corrected to use proper 8 byte escape sequences. I had some trouble at first due to using a UCS2 compiled python :-P – Weihs
But in my case, it really matters whether lowercase or not. What I use is highpoints = re.compile(u'[\U00010000-\U0001ffff]') and it works. It seems that in my computer(is it the problem of the version of python? mine is python 2.7).With uppercase \U, the unicode supports much bigger range. – Coffin
No, you are quite correct. I've just lowered the lower bound as well, realizing I misread the UTF-8 table. – Weihs
Any idea why I'm getting an error with: re.compile(u'[\U00010000-\U0010ffff]') "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 244, in _compile raise error, v # invalid expression sre_constants.error: bad character range – Fugacity
@MichaelWaterfall: You have a UCS2 (2-byte unicode) compiled Python; it'll only support unicode values up to \uffff. – Weihs
Ah okay, thanks! Out of interest then, why does print(u'\U0001f3b6') display the correct emoji character (which is 4-bytes)? – Fugacity
@MichaelWaterfall: I am not certain how this is all handled; a UCS2 build can print a 4-byte unicode string, but the regular expression module cannot handle it properly because the internal representation cannot handle it. – Weihs
@MichaelWaterfall: On a UCS-4 build you can compile the regular expression, on a UCS-2 build you cannot. Python 3.3 does away with the distinction, btw, so there is hope for the future. :-) – Weihs
@MichaelWaterfall: Do import sys; print sys.maxunicode. If you get 65535 it's a UCS-2 build, 1114111 for a wide UCS-4 build. – Weihs
@MichaelWaterfall: On a UCS-2 build, that character is actually 2 bytes, using a UTF-16 surrogate; len(u'\U0001f3b6') == 2 on such a build. On a UCS-4 build, it's len(u'\U0001f3b6') == 1.. – Weihs
@MartijnPieters Thank you for the great info. I'd actually just noticed that len(u'\U0001f3b6') == 2 and figured that was caused by the UCS-2 version of Python. Time to update! Thanks again. – Fugacity
D
2

I think you should use utf8mb4 collation instead of utf8 and run

SET NAMES UTF8MB4

after connection with DB (link, link, link)

Drawplate answered 2/7, 2014 at 7:15 Comment(1)
Setting the connection to use utf8mb4 is the best approach, but you shouldn't do it with SET NAMES. This command changes the connection setting at the server end without letting the client library know about the change, which means that anything in the client library using the C mysql_real_escape_string API can get bad results. This can result in SQL injection security holes if an East Asian multibyte encoding is one or both of the charsets involved. Character sets should be set at connection time; in python-mysql this would be done with the charset argument to connect(). – Inimitable
J
0

simple normalization for string without regex and translate:

def normalize_unicode(s):
    return ''.join([ unichr(k) if k < 0x10000 else 0xfffd for k in [ord(c) for c in s]])
Joh answered 15/2, 2016 at 11:48 Comment(0)

© 2022 - 2024 β€” McMap. All rights reserved.