Pandas: How to insert dataframe into Clickhouse

Asked 16/10, 2019 at 21:49 Answered 29/3, 2021 at 11:51

I am trying to insert a Pandas dataframe into Clickhouse.

This is my code

import pandas
import sqlalchemy as sa

uri = 'clickhouse://default:@localhost/default'
ch_db = sa.create_engine(uri)

pdf = pandas.DataFrame.from_records([
    {'year': 1994, 'first_name': 'Vova'},
    {'year': 1995, 'first_name': 'Anja'},
    {'year': 1996, 'first_name': 'Vasja'},
    {'year': 1997, 'first_name': 'Petja'},
])

pdf.to_sql('test_humans', ch_db, if_exists='append', index=False)

And this is the error that I am receiving. Does this have to do with some missing extra arguments regarding the ENGINE? How can I fix this?

Exception: Code: 62, e.displayText() = DB::Exception: Syntax error: failed at position 65 (line 7, col 2): FORMAT TabSeparatedWithNamesAndTypes. Expected one of: ENGINE, storage definition (version 19.15.2.2 (official build))

Logging

INFO:sqlalchemy.engine.base.Engine:EXISTS TABLE test_humans INFO:sqlalchemy.engine.base.Engine:{} INFO:sqlalchemy.engine.base.Engine: CREATE TABLE test_humans ( first_name TEXT, year BIGINT )

INFO:sqlalchemy.engine.base.Engine:{} INFO:sqlalchemy.engine.base.Engine:ROLLBACK

Koziarz answered 16/10, 2019 at 21:49 Comment(4)

Are you using clickhouse-dialect for sqlalchemy? Could you provide the schema description for test_humans-table? – Cogency 17/10, 2019 at 0:31

Yes I am. When using to_sql(), pandas will usually create a schema automatically. I guess it has to do with exactly that. The create schema sql statement that the dialect is creating misses some proprietary command regarding "ENGINE". – Koziarz 17/10, 2019 at 5:23

See my update.. some log information about the issued queries. – Koziarz 17/10, 2019 at 5:25

which driver you use exacly github.com/mymarilyn/clickhouse-driver or github.com/cloudflare/sqlalchemy-clickhouse ? – Putter 18/10, 2019 at 6:21

Starting with version 0.2.0 clickhouse_driver implements method insert_dataframe. See: https://clickhouse-driver.readthedocs.io/en/latest/api.html#clickhouse_driver.Client.insert_dataframe

Salaam answered 29/3, 2021 at 11:51 Comment(3)

This works very well. It is very easy, and is more efficient than using client.execute("INSERT INTO your_table VALUES", df.to_dict('records')) because it will transpose the DataFrame and send the data in columnar format. This doesn't do automatic table generation, but I wouldn't trust that anyway. Clickhouse has too many important decisions of schema design - especially partitioning - to leave this to autogen. – Ordinary 28/7, 2021 at 9:23

Somehow I needed use_numpy=True. Client('localhost', settings={"use_numpy":True}) else – Nerte 26/7, 2022 at 1:59

Would be helpful to include the actual example of usage as it is missing from official documentation: client.insert_dataframe('INSERT INTO your_table VALUES', df) – Nippers 9/1 at 15:18

You can do it without sqlalchemy.

pip install clickhouse-driver

from clickhouse_driver import Client


client = Client('localhost')
df = pandas.DataFrame.from_records([
    {'year': 1994, 'first_name': 'Vova'},
    {'year': 1995, 'first_name': 'Anja'},
    {'year': 1996, 'first_name': 'Vasja'},
    {'year': 1997, 'first_name': 'Petja'},
])

client.insert_dataframe(
    'INSERT INTO "your_table" (year, first_name) VALUES',
    df,
    settings=dict(use_numpy=True),
)

# or 
client.execute("INSERT INTO your_table VALUES", df.to_dict('records'))

Aerator answered 8/1, 2020 at 15:14 Comment(0)

You can also use https://github.com/kszucs/pandahouse to insert your dataframe without extra conversions.

pip install pandahouse

import pandahouse as ph

pdf = pandas.DataFrame.from_records([
    {'year': 1994, 'first_name': 'Vova'},
    {'year': 1995, 'first_name': 'Anja'},
    {'year': 1996, 'first_name': 'Vasja'},
    {'year': 1997, 'first_name': 'Petja'},
])

connection = dict(database='default',
                  host='localhost',
                  user='default',
                  password='')

ph.to_clickhouse(pdf, 'test_humans', index=False, chunksize=100000, connection=connection)

Wofford answered 22/1, 2020 at 18:14 Comment(0)

sqlalchemy-clickhouse cannot create table automatically (at least until version 0.1.5.post0 inclusively) and interprets any sql-query of table creation as SELECT-query that always tail by FORMAT TabSeparatedWithNamesAndTypes-clause.

To work around this issue need to create a table manually using the infi.clickhouse_orm-api (this module delivered with sqlalchemy-clickhouse):

import pandas as pd
from infi.clickhouse_orm.engines import Memory
from infi.clickhouse_orm.fields import UInt16Field, StringField
from infi.clickhouse_orm.models import Model
from sqlalchemy import create_engine


# define the ClickHouse table schema
class Test_Humans(Model):
    year = UInt16Field()
    first_name = StringField()
    engine = Memory()


engine = create_engine('clickhouse://default:@localhost/test')

# create table manually
with engine.connect() as conn:
    conn.connection.create_table(Test_Humans) # https://github.com/Infinidat/infi.clickhouse_orm/blob/master/src/infi/clickhouse_orm/database.py#L142

pdf = pd.DataFrame.from_records([
    {'year': 1994, 'first_name': 'Vova'},
    {'year': 1995, 'first_name': 'Anja'},
    {'year': 1996, 'first_name': 'Vasja'},
    {'year': 1997, 'first_name': 'Petja'},
    # ! sqlalchemy-clickhouse ignores the last item so add fake one
    {}
])

pdf.to_sql('test_humans', engine, if_exists='append', index=False)

Take into account that sqlalchemy-clickhouse ignores the last item so add fake one (see source code and related issue 10).

Let's check this table in DB:

SELECT *
FROM test.test_humans

┌─year─┬─first_name─┐
│ 1994 │ Vova       │
│ 1995 │ Anja       │
│ 1996 │ Vasja      │
│ 1997 │ Petja      │
└──────┴────────────┘

4 rows in set. Elapsed: 0.003 sec.
*/

See the answer https://mcmap.net/q/1769668/-how-to-insert-a-df-into-a-clickhouse-table-if-data-type-is-uuid-and-list-of-datetime too.

Cogency answered 17/4, 2020 at 11:46 Comment(0)

Recommended topics

Hot tags