When I use pandas to read a MySQL table, some columns (see 'to_nlc' in the image below) that used to have an integer type became a floating-point number (automatically append .0).
Can anyone figure out why this happens? Or at least have a guess?
When I use pandas to read a MySQL table, some columns (see 'to_nlc' in the image below) that used to have an integer type became a floating-point number (automatically append .0).
Can anyone figure out why this happens? Or at least have a guess?
Problem is your data contains NaN
values, so int
is automatically cast to float
.
I think you can check NA type promotions:
When introducing NAs into an existing Series or DataFrame via reindex or some other means, boolean and integer types will be promoted to a different dtype in order to store the NAs. These are summarized by this table:
Typeclass Promotion dtype for storing NAs
floating no change
object no change
integer cast to float64
boolean cast to object
While this may seem like a heavy trade-off, in practice I have found very few cases where this is an issue in practice. Some explanation for the motivation here in the next section.
NaN
to 0
? like df['colname'] = df['colname'].fillna().astype(int)
or df = df.fillna(0).astype(int)
? –
Spitball integer na
and raise error then not. –
Spitball As already said the problem is that pandas' integer can not handle NULL/NA value.
You can replace read_sql_table with read_sql and convert NULL to some integer value (for example 0 or -1, something which has NULL sense in your setting):
df = pandas.read_sql("SELECT col1, col2, IFNULL(col3, 0) FROM table", engine)
Here col3 can be NULL in mysql, ifnull will return 0 if it is NULL or col3 value otherwise.
Or same thing with little function helper:
def read_sql_table_with_nullcast(table_name, engine, null_cast={}):
"""
table_name - table name
engine - sql engine
null_cast - dictionary of columns to replace NULL:
column name as key value to replace with as value.
for example {'col3':0} will set all NULL in col3 to 0
"""
import pandas
cols = pandas.read_sql("SHOW COLUMNS FROM " + table_name, engine)
cols_call = [c if c not in null_cast else "ifnull(%s,%d) as %s"%(c,null_cast[c],c) for c in cols['Field']]
sel = ",".join(cols_call)
return pandas.read_sql("SELECT " + sel + " FROM " + table_name, engine)
read_sql_table_with_nullcast("table", engine, {'col3':0})
As another answer states:
The problem is that your data contains NaN values, so int is automatically cast to float.
The solution, in my case, was to use the pd.read_sql
option dtype_backend='numpy_nullable'
. This option was new in version 2.0, and the value that worked for me, 'numpy_nullable'
became the default in pandas 2.1.0. I'm on pandas 2.0.3, and had to use this option.
With this option, my ints remained ints despite nulls in some of the columns.
You can use parameters: coerce_float=False
df = pd.read_sql(sql, con=conn, coerce_float=False)
coerce_floatbool, default True Attempts to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point, useful for SQL result sets.
https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html
Another possibility is to exclude NULL
values in the WHERE
clause of your SQL query, if you're not expecting them and they correspond to unusable rows.
So it won't be suitable in all circumstances, but is a clean and simple option when it does apply.
© 2022 - 2024 — McMap. All rights reserved.