Python & MySql: Unicode and Encoding
Asked Answered
C

3

33

I am parsing json data and trying to store some of the json data into Mysql database. I am currently getting following unicode error. My question is how should I handle this.

  • Should I handle it from the database side, and if so how can I modify my table to do so?
  • Should I handle it from python side?

Here is my table structure

CREATE TABLE yahoo_questions (
   question_id varchar(40) NOT NULL, 
   question_subj varbinary(255), 
   question_content varbinary(255),
   question_userId varchar(40) NOT NULL,
   question_timestamp varchar(40),
   category_id varbinary(20) NOT NULL,
   category_name varchar(40) NOT NULL,
   choosen_answer varbinary(255),
   choosen_userId varchar(40),
   choosen_usernick varchar(40),
   choosen_ans_timestamp varchar(40),
   UNIQUE (question_id)
);

Error While inserting via python code:

Traceback (most recent call last):
  File "YahooQueryData.py", line 78, in <module>
    +"VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)", (row[2], row[5], row[6], quserId, questionTime, categoryId, categoryName, qChosenAnswer, choosenUserId, choosenNickName, choosenTimeStamp))
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/MySQLdb/cursors.py", line 159, in execute
    query = query % db.literal(args)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/MySQLdb/connections.py", line 264, in literal
    return self.escape(o, self.encoders)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/MySQLdb/connections.py", line 202, in unicode_literal
    return db.literal(u.encode(unicode_literal.charset))
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 204-230: ordinal not in range(256)

Python Code segment:

    #pushing user id to the url to get full json stack
    urlobject = urllib.urlopen(base_url.format(row[2]))
    qnadatajson = urlobject.read()
    data = json.loads(qnadatajson)
cur.execute("INSERT INTO yahoo_questions (question_id, question_subj, question_content, question_userId, question_timestamp,"
            +"category_id, category_name, choosen_answer, choosen_userId, choosen_usernick, choosen_ans_timestamp)"
            +"VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)", (row[2], row[5], row[6], quserId, questionTime, categoryId, categoryName, qChosenAnswer, choosenUserId, choosenNickName, choosenTimeStamp))

json Structure

questions: [
{
Id: "20111201185322AA5HTDc",
Subject: "what are the new pokemon call?",
Content: "I used to know them I stop at dialga and palkia version and I heard there's new ones what's it call
",
Date: "2011-12-01 18:53:22",
Timestamp: "1322794402",

What I also did prior to running the query I execute the following on mysql SET character_set_client = utf8

And this how the mysql variables looks like:

mysql> SHOW variables LIKE '%character_set%';
+--------------------------+--------------------------------------------------------+
| Variable_name            | Value                                                  |
+--------------------------+--------------------------------------------------------+
| character_set_client     | utf8                                                   |
| character_set_connection | utf8                                                   |
| character_set_database   | latin1                                                 |
| character_set_filesystem | binary                                                 |
| character_set_results    | utf8                                                   |
| character_set_server     | latin1                                                 |
| character_set_system     | utf8                                                   |
| character_sets_dir       | /usr/local/mysql-5.5.10-osx10.6-x86_64/share/charsets/ |
+--------------------------+--------------------------------------------------------+
8 rows in set (0.00 sec)
Champignon answered 3/12, 2011 at 5:7 Comment(2)
The usual "get mysql to accept utf8" incantation is to make sure that SET NAMES 'utf8' is your very first query when opening the connection. Have you tried that instead of character_set_client? (This is just a comment because I don't think that's the problem here, based on your var output. I think your MYSQLdb Python library is trying to encode to latin-1 and doesn't know it should be encoding to utf-8. Check the documentation and see if there is some way to inform it of this fact.)Thaumaturge
Your problem is with the %s, you're trying to convert unicode data into a string. I have the same problem, just not sure how to fix it...Sooty
T
93

I think that your MYSQLdb python library doesn't know it's supposed to encode to utf8, and is encoding to the default python system-defined charset latin1.

When you connect() to your database, pass the charset='utf8' parameter. This should also make a manual SET NAMES or SET character_set_client unnecessary.

Thaumaturge answered 3/12, 2011 at 5:34 Comment(9)
con = mdb.connect('localhost', 'XXXX', 'XXXX', 'XXXX', unix_socket='/tmp/mysql.sock', charset='utf8') is this what you meant...?Champignon
That is what I meant. However, I'm now not sure from the documentation if you need to SET NAMES utf8 also or not. I would try both.Thaumaturge
Now i am having strange json error after 300 some records.. strange but this was a great help.Champignon
Which one? charset by itself, or did you need SET NAMES too? Let me know so I can update the answer. (And accept/check the answer if it worked.)Thaumaturge
what I have is exactly what I put in the first comment: charset='utf8', so charset worked.Champignon
But did you remove your SET NAMES utf8 and/or SET character_set_client=utf8 query as well, or not?Thaumaturge
good question no I didn't remove SET character_set_client=utf8 I didn't do SET NAMES utf8Champignon
After 4 hours of debugging/googling, you saved me! in my case, I was missing charset='utf8mb4' when connecting to db.Pliam
mysql_connection = MySQLdb.connect(host, username, password, mysql_database_name, port, charset='utf8') thank you @FrancisAvila. this solved my problem!!Protonema
D
42

First, make sure you are assigning the charset and use_unicode parameters when making your MySQL connection:

conn = mysql.connect(host='127.0.0.1',
                     user='user',
                     passwd='passwd',
                     db='db',
                     charset='utf8',
                     use_unicode=True)

Secondly, use prepared statements when actually querying the database. Below is an example INSERT query of a string containing a unicode character.

cursor.execute('INSERT INTO mytable VALUES (null, %s)',                  
               ('Some string that contains unicode: ' + unichr(300),))
Dube answered 16/1, 2012 at 1:32 Comment(1)
None of the other answers mention the prepared statement issue; that was the trick for me. Related: to escape the percent sign in a python format string with %s, etc, like for queries with LIKE, or date formats, use a double-percent sign %%Blynn
F
1

Still facing the same issue,
try to downgraded your mysql-connector-python version, this is work for me.

Change mysql-connector-python==8.0.30 to mysql-connector-python==8.0.28.

Copy this,

pip uninstall mysql-connector-python==8.0.30

pip install mysql-connector-python==8.0.28
Fictionalize answered 4/8, 2022 at 14:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.