Create sql table from dask dataframe using map_partitions and pd.df.to_sql

Asked 24/1, 2019 at 10:45 Answered 8/10, 2020 at 13:16

Solved python postgresql pandas dask pandas-to-sql

Dask doesn't have a df.to_sql() like pandas and so I am trying to replicate the functionality and create an sql table using the map_partitions method to do so. Here is my code:

import dask.dataframe as dd
import pandas as pd
import sqlalchemy_utils as sqla_utils

db_url = 'my_db_url_connection'
conn = sqla.create_engine(db_url)

ddf = dd.read_csv('data/prod.csv')
meta=dict(ddf.dtypes)
ddf.map_partitions(lambda df: df.to_sql('table_name', db_url, if_exists='append',index=True), ddf, meta=meta)

This returns my dask dataframe object, but when I go look into my psql server there's no new table... what is going wrong here?

UPDATE Still can't get it to work, but due to independent issue. Follow-up question: duplicate key value violates unique constraint - postgres error when trying to create sql table from dask dataframe

Zeigler answered 24/1, 2019 at 10:45 Comment(0)

Simply, you have created a dataframe which is a prescription of the work to be done, but you have not executed it. To execute, you need to call .compute() on the result.

Note that the output here is not really a dataframe, each partition evaluates to None (because to_sql has no output), so it might be cleaner to express this with df.to_delayed, something like

dto_sql = dask.delayed(pd.DataFrame.to_sql)
out = [dto_sql(d, 'table_name', db_url, if_exists='append', index=True)
       for d in ddf.to_delayed()]
dask.compute(*out)

Also note, that whether you get good parallelism will depend on the database driver and the data system itself.

Bolster answered 24/1, 2019 at 14:57 Comment(2)

Thanks for this! Your last phrase is a good point. Do you know whether postgresql would support parallelism? Is there any point in doing this at all or better just to call ddf.compute().to_sql(...)? I was hoping dask would speed things up as converting to sql is very slow, but now I see it may not be possible. – Zeigler 24/1, 2019 at 16:34

follow-up question #54352283 – Zeigler 24/1, 2019 at 16:57

UPDATE : Dask to_sql() is now available https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.to_sql

Piercy answered 8/10, 2020 at 13:16 Comment(1)

Your answer doesn't provide any new solutions to this question. You should consider adding your answer as a comment to the accepted answer instead. – Pluvious 8/10, 2020 at 14:42

Recommended topics

Hot tags