Look at the following:
/home/kinka/workspace/py/tutorial/tutorial/pipelines.py:33: Warning: Incorrect string
value: '\xF0\x9F\x91\x8A\xF0\x9F...' for column 't_content' at row 1
n = self.cursor.execute(self.sql, (item['topic'], item['url'], item['content']))
The string '\xF0\x9F\x91\x8A
, actually is a 4-byte unicode: u'\U0001f62a'
. The mysql's character-set is utf-8 but inserting 4-byte unicode it will truncate the inserted string.
I googled for such a problem and found that mysql under 5.5.3 don't support 4-byte unicode, and unfortunately mine is 5.5.224.
I don't want to upgrade the mysql server, so I just want to filter the 4-byte unicode in python, I tried to use regular expression but failed.
So, any help?
π
... β Weihsunicodedata.name("\U0001f62a")
says'SLEEPY FACE'
(which would beb'\xf0\x9f\x98\xaa'
in utf-8), so someting is not right here... β Lakysina weibo
(twitter in China), and I scraped suchSLEEP FACE
. β Coffin'\xF0\x9F\x91\x8A'.decode('utf8')
isu'\U0001f44a'
, which is'FISTED HAND SIGN'
:-) β Weihs