I am very confused with the way charset and encoding work in SQLAlchemy. I understand (and have read) the difference between charsets and encodings, and I have a good picture of the history of encodings.
I have a table in MySQL in latin1_swedish_ci (Why? Possible because of this). I need to create a pandas dataframe in which I get the proper characters (and not weird symbols). Initially, this was in the code:
connect_engine = create_engine('mysql://user:[email protected]/db')
sql_query = "select * from table1"
df = pandas.read_sql(sql_query, connect_engine)
We started having troubles with the Š
character (corresponding to the u'\u0160'
unicode, but instead we get '\x8a'). I expected this to work:
connect_engine = create_engine('mysql://user:[email protected]/db', encoding='utf8')
but, I continue getting '\x8a'
, which, I realized, makes sense given that the default of the encoding parameter is utf8
. So, then, I tried encoding='latin1'
to tackle the problem:
connect_engine = create_engine('mysql://user:[email protected]/db', encoding='latin1')
but, I still get the same '\x8a'. To be clear, in both cases (encoding='utf8'
and encoding='latin1'
), I can do mystring.decode('latin1')
but not mystring.decode('utf8')
.
And then, I rediscovered the charset
parameter in the connection string, i.e. 'mysql://user:[email protected]/db?charset=latin1'
. And after trying all possible combinations of charset and encoding, I found that this one work:
connect_engine = create_engine('mysql://user:[email protected]/db?charset=utf8')
I would appreciate if somebody can explain me how to correctly use the charset
in the connection string, and the encoding
parameter in the create_engine
?