Specifying dtypes for read_sql in pandas

E

6

28

I would like to specify the dtypes returned when doing pandas.read_sql. In particular I am interested in saving memory and having float values returned as np.float32 instead of np.float64. I know that I can convert afterwards with astype(np.float32) but that doesn't solve the problem of the large memory requirements in the initial query. In my actual code, I will be pulling 84 million rows, not the 5 shown here. pandas.read_csv allows for specifying dtypes as a dict, but I see no way to do that with read_sql.

I am using MySQLdb and Python 2.7.

As an aside, read_sql seems to use far more memory while running (about 2x) than it needs for the final DataFrame storage.

In [70]: df=pd.read_sql('select ARP, ACP from train where seq < 5', connection)

In [71]: df
Out[71]: 
   ARP      ACP
0  1.17915  1.42595
1  1.10578  1.21369
2  1.35629  1.12693
3  1.56740  1.61847
4  1.28060  1.05935


In [72]: df.dtypes
Out[72]: 
ARP    float64
ACP    float64
dtype: object

Edenedens answered 17/8, 2016 at 15:16 Comment(1)

That would also be good in order to avoid conversion of integer column to floats whenever you have NaNs. – Slop 25/8, 2017 at 10:32

H

15

You can use pandas read_sql_query which allows you to specify the returned dtypes (supported only since pandas 1.3).

pd.read_sql_query('select ARP, ACP from train where seq < 5', connection,
                  dtype={'ARP': np.float32, 'ACP': np.float32})

Holograph answered 29/9, 2021 at 19:45 Comment(0)

T

7

As an aside, read_sql seems to use far more memory while running (about 2x) than it needs for the final DataFrame storage.

Maybe you can try our tool ConnectorX (pip install -U connectorx), which is implemented in Rust and aims to improve the performance of pandas.read_sql in terms of both time and memory usage, and provides similar interface. To switch to it, you only need to:

import connectorx as cx
conn_url = "mysql://username:password@server:port/database"
query = "select ARP, ACP from train where seq < 5"
df = cx.read_sql(conn_url, query)

The reason pandas.read_sql uses a lot of memory during running is because of its large intermediate python objects, in ConnectorX we use Rust and stream process to tackle this problem.

Here is some benchmark result:

PostgreSQL:
MySQL:

Tobacconist answered 9/11, 2021 at 22:9 Comment(2)

Looks like a nice tool! – Edenedens 10/11, 2021 at 15:43

This tool is great. Querying ~5GB from my SQLite db ran @ 23 seconds using ConnectorX compared to the 91 seconds running on pd.read_sql – Encyclopedist 3/3, 2024 at 0:0

A

3

What about cast() and convert()?

'SELECT cast(ARP as float32()), cast (ACP as float32()) from train where seq < 5'

or something similar.

http://www.smallsql.de/doc/sql-functions/system/convert.html

Anglicist answered 6/11, 2017 at 12:24 Comment(1)

That is a very good suggestion, but I found 2 problems with it: 1. cast does not allow FLOAT (dev.mysql.com/doc/refman/5.6/en/cast-functions.html) 2. Even when trying DECIMAL(10,2), for example, the read_sql still returns float64 values. How annoying is that? – Edenedens 7/11, 2017 at 13:56

E

2

Take a look in this github issue, looks like they are inclined to add the option.

Endophyte answered 30/4, 2019 at 15:23 Comment(0)

W

1

Regarding @Ohad Bruker answer: It is since Pandas 2.0.0 also possbile to define dtype directly in read_sql.

Could be added as a comment to the answer but i am not authorized yet.

Witless answered 31/10, 2023 at 15:9 Comment(2)

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center. – Malkamalkah 31/10, 2023 at 19:5

This does not provide an answer to the question. Once you have sufficient reputation you will be able to comment on any post; instead, provide answers that don't require clarification from the asker. - From Review – Trevino 4/11, 2023 at 17:5

Z

0

CAST your data types in your SQL code. Python will adhere to the explicit data types in your SQL code. You can confirm this when doing df.info()

Zebapda answered 26/1, 2023 at 23:31 Comment(1)

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center. – Malkamalkah 29/1, 2023 at 20:26

Recommended topics

Hot tags