how to use `charset` and `encoding` in `create_engine` of SQLAlchemy (to create pandas dataframe)?
Asked Answered
H

6

29

I am very confused with the way charset and encoding work in SQLAlchemy. I understand (and have read) the difference between charsets and encodings, and I have a good picture of the history of encodings.

I have a table in MySQL in latin1_swedish_ci (Why? Possible because of this). I need to create a pandas dataframe in which I get the proper characters (and not weird symbols). Initially, this was in the code:

connect_engine = create_engine('mysql://user:[email protected]/db')
sql_query = "select * from table1"
df = pandas.read_sql(sql_query, connect_engine)

We started having troubles with the Š character (corresponding to the u'\u0160' unicode, but instead we get '\x8a'). I expected this to work:

connect_engine = create_engine('mysql://user:[email protected]/db', encoding='utf8') 

but, I continue getting '\x8a', which, I realized, makes sense given that the default of the encoding parameter is utf8. So, then, I tried encoding='latin1' to tackle the problem:

connect_engine = create_engine('mysql://user:[email protected]/db', encoding='latin1')

but, I still get the same '\x8a'. To be clear, in both cases (encoding='utf8' and encoding='latin1'), I can do mystring.decode('latin1') but not mystring.decode('utf8').

And then, I rediscovered the charset parameter in the connection string, i.e. 'mysql://user:[email protected]/db?charset=latin1'. And after trying all possible combinations of charset and encoding, I found that this one work:

connect_engine = create_engine('mysql://user:[email protected]/db?charset=utf8')

I would appreciate if somebody can explain me how to correctly use the charset in the connection string, and the encoding parameter in the create_engine?

Horehound answered 24/7, 2017 at 11:48 Comment(0)
A
11

encoding is the codec used for encoding/decoding within SQLAlchemy. From the documentation:

For those scenarios where the DBAPI is detected as not supporting a Python unicode object, this encoding is used to determine the source/destination encoding. It is not used for those cases where the DBAPI handles unicode directly.

[...]

To properly configure a system to accommodate Python unicode objects, the DBAPI should be configured to handle unicode to the greatest degree as is appropriate [...]

mysql-python handles unicode directly, so there's no need to use this setting.

charset is a setting specific to the mysql-python driver. From the documentation:

This charset is the client character set for the connection.

This setting controls three variables on the server, specifically character_set_results, which is what you are interested in. When set, strings are returned as unicode objects.

Note that this applies only if you have latin1 encoded data in the database. If you've stored utf-8 bytes as latin1, you may have better luck using encoding instead.

Albrecht answered 24/7, 2017 at 18:14 Comment(2)
thanks @univerio, I still get a bit lost because there is several names and I am not were to place them: database -> python driver -> mysql-python interface (DBAPI?) -> SQLALchemy. Is the driver the same as the interface and the DBAPI?Horehound
@Horehound The driver is the same thing as the DBAPI. (Well technically DBAPI is the interface defined by PEP 249, and the driver is the library that implements this interface, but there's no need to be this pedantic.) So it goes database -> driver/DBAPI -> SQLAlchemy.Albrecht
S
28

encoding parameter does not work correctly.

So, as @doru said in this link, you should add ?charset=utf8mb4 at the end of the connection string. like this:

connect_string = "mysql+pymysql://{}:{}@{}:{}/{}?charset=utf8mb4".format(
    DB_USER, DB_PASS, DB_HOST, DB_PORT, DATABASE
)
Stucker answered 18/1, 2018 at 7:38 Comment(2)
This saved me in an unexpected way. Is there some other know issue where the driver/DBAPI picks latin1 and ignores server-side settings? I have this set on the server, not sure why I need to set it on the connector too to fix things.Pushball
thanks, this helped in resolving this issue issue - 'latin-1' codec can't encode character '\U0001f600' in position 70: ordinal not in range(256)Gaye
L
12

I had the same problem. I just added ?charset=utf8mb4 at the end of the url.

Here is mine:

Before

SQL_ENGINE = sqlalchemy.create_engine(
    "mysql+pymysql://"
    + MySQL.USER
    + ":"
    + MySQL.PASSWORD
    + "@"
    + MySQL.HOST
    + ":"
    + str(MySQL.PORT)
    + "/"
    + MySQL.DB_NAME
)

After

sqlalchemy.create_engine(
    "mysql+pymysql://"
    + MySQL.USER
    + ":"
    + MySQL.PASSWORD
    + "@"
    + MySQL.HOST
    + ":"
    + str(MySQL.PORT)
    + "/"
    + MySQL.DB_NAME
    + "?charset=utf8mb4"  # This line was added
)
Lusatia answered 22/3, 2018 at 10:51 Comment(1)
Answer seems right but I needed a backslash before the ?charset, i.e. "/?charset=utf8mb4"Tuber
A
11

encoding is the codec used for encoding/decoding within SQLAlchemy. From the documentation:

For those scenarios where the DBAPI is detected as not supporting a Python unicode object, this encoding is used to determine the source/destination encoding. It is not used for those cases where the DBAPI handles unicode directly.

[...]

To properly configure a system to accommodate Python unicode objects, the DBAPI should be configured to handle unicode to the greatest degree as is appropriate [...]

mysql-python handles unicode directly, so there's no need to use this setting.

charset is a setting specific to the mysql-python driver. From the documentation:

This charset is the client character set for the connection.

This setting controls three variables on the server, specifically character_set_results, which is what you are interested in. When set, strings are returned as unicode objects.

Note that this applies only if you have latin1 encoded data in the database. If you've stored utf-8 bytes as latin1, you may have better luck using encoding instead.

Albrecht answered 24/7, 2017 at 18:14 Comment(2)
thanks @univerio, I still get a bit lost because there is several names and I am not were to place them: database -> python driver -> mysql-python interface (DBAPI?) -> SQLALchemy. Is the driver the same as the interface and the DBAPI?Horehound
@Horehound The driver is the same thing as the DBAPI. (Well technically DBAPI is the interface defined by PEP 249, and the driver is the library that implements this interface, but there's no need to be this pedantic.) So it goes database -> driver/DBAPI -> SQLAlchemy.Albrecht
C
7

This works for me .

from sqlalchemy import create_engine
from sqlalchemy.engine.url import URL

db_url = {
    'database': "dbname",
    'drivername': 'mysql',
    'username': 'myname',
    'password': 'mypassword',
    'host': '127.0.0.1',
    'query': {'charset': 'utf8'},  # the key-point setting
}

engine = create_engine(URL(**db_url), encoding="utf8")
Coccyx answered 8/3, 2018 at 9:21 Comment(0)
A
0

To save someone else the 3 days it took a colleague and me to get to the root of this problem, the accepted answer helped me when debugging a very tricky error during migration of new/changed tables during boot to a MySQL5.7 database that presented as a misleading error to an Alembic error while it was SQLAlchemy.

All tables in the database are already in utf-8

The error manifested as:

Traceback (most recent call last):
File "./migration_manager.py", line 71, in <module>
diff = command.get_diff(alembic_cfg, message=str(timestamp), autogenerate=True, sql=False)
File "/home/alembic/command.py", line 86, in get_diff
script.run_env()
File "/home/alembic/script.py", line 208, in run_env
util.load_python_file(self.dir, 'env.py')
File "/home/alembic/util.py", line 230, in load_python_file
module = load_module_py(module_id, path)
File "/home/alembic/compat.py", line 63, in load_module_py
mod = imp.load_source(module_id, path, fp)
File "storage_repo/env.py", line 66, in <module>
run_migrations_online()
File "storage_repo/env.py", line 50, in run_migrations_online
connection = engine.connect()
File "/home/sqlalchemy/engine/base.py", line 1779, in connect
return self._connection_cls(self, **kwargs)
File "/home/sqlalchemy/engine/base.py", line 60, in __init__
self.__connection = connection or engine.raw_connection()
File "/home/sqlalchemy/engine/base.py", line 1848, in raw_connection
return self.pool.unique_connection()
File "/home/sqlalchemy/pool.py", line 280, in unique_connection
return _ConnectionFairy._checkout(self)
File "/home/sqlalchemy/pool.py", line 641, in _checkout
fairy = _ConnectionRecord.checkout(pool)
File "/home/sqlalchemy/pool.py", line 440, in checkout
rec = pool._do_get()
File "/home/sqlalchemy/pool.py", line 1055, in _do_get
return self._create_connection()
File "/home/sqlalchemy/pool.py", line 285, in _create_connection
return _ConnectionRecord(self)
File "/home/sqlalchemy/pool.py", line 416, in __init__
exec_once(self.connection, self)
File "/home/sqlalchemy/event/attr.py", line 250, in exec_once
self(*args, **kw)
File "/home/sqlalchemy/event/attr.py", line 260, in __call__
fn(*args, **kw)
File "/home/sqlalchemy/util/langhelpers.py", line 1219, in go
return once_fn(*arg, **kw)
File "/home/sqlalchemy/engine/strategies.py", line 165, in first_connect
dialect.initialize(c)
File "/home/sqlalchemy/dialects/mysql/base.py", line 2362, in initialize
self._detect_ansiquotes(connection)
File "/home/sqlalchemy/dialects/mysql/base.py", line 2633, in _detect_ansiquotes
mode = row[1] or ''
File "/home/sqlalchemy/dialects/mysql/base.py", line 3135, in __getitem__
return item.decode(self.charset)
LookupError: unknown encoding: utf8mb4

and the solution was to modify the MySQL connection string when defining the engine uri, using the solution in the accepted answer:

return "mysql+mysqldb://" + user + ":" + password + "@" + host + port + "/" + database + '?charset=utf8mb4'

(I added this answer because there is nothing that I could find handling this specific & imho misleading error, in the hopes of saving another poor soul some time)

Adventitia answered 5/12, 2023 at 5:33 Comment(0)
E
0

For me, it worked to append this at the end of the connection string:

"?encoding=UTF-8&nencoding=UTF-8"

So this is my final string:

connection_string = f"oracle+cx_oracle://{usr}:{pwd}@{dsn_tns}?encoding=UTF-8&nencoding=UTF-8"
Eupheemia answered 29/4, 2024 at 9:0 Comment(3)
(Pro tip: update to use the replacement for cx_Oracle, see Using SQLAlchemy 2.0 with python-oracledb for Oracle Database).Southsoutheast
@ChristopherJones Thank you for this tip and the tutorial, but what are the advantages?Eupheemia
Generally faster. And you don't need to install Oracle Instant Client.Southsoutheast

© 2022 - 2025 — McMap. All rights reserved.