Pandas: How to insert dataframe into Clickhouse
Asked Answered
K

4

6

I am trying to insert a Pandas dataframe into Clickhouse.

This is my code

import pandas
import sqlalchemy as sa

uri = 'clickhouse://default:@localhost/default'
ch_db = sa.create_engine(uri)

pdf = pandas.DataFrame.from_records([
    {'year': 1994, 'first_name': 'Vova'},
    {'year': 1995, 'first_name': 'Anja'},
    {'year': 1996, 'first_name': 'Vasja'},
    {'year': 1997, 'first_name': 'Petja'},
])

pdf.to_sql('test_humans', ch_db, if_exists='append', index=False)

And this is the error that I am receiving. Does this have to do with some missing extra arguments regarding the ENGINE? How can I fix this?

Exception: Code: 62, e.displayText() = DB::Exception: Syntax error: failed at position 65 (line 7, col 2): FORMAT TabSeparatedWithNamesAndTypes. Expected one of: ENGINE, storage definition (version 19.15.2.2 (official build))

Logging

INFO:sqlalchemy.engine.base.Engine:EXISTS TABLE test_humans INFO:sqlalchemy.engine.base.Engine:{} INFO:sqlalchemy.engine.base.Engine: CREATE TABLE test_humans ( first_name TEXT, year BIGINT )

INFO:sqlalchemy.engine.base.Engine:{} INFO:sqlalchemy.engine.base.Engine:ROLLBACK

Koziarz answered 16/10, 2019 at 21:49 Comment(4)
Are you using clickhouse-dialect for sqlalchemy? Could you provide the schema description for test_humans-table?Cogency
Yes I am. When using to_sql(), pandas will usually create a schema automatically. I guess it has to do with exactly that. The create schema sql statement that the dialect is creating misses some proprietary command regarding "ENGINE".Koziarz
See my update.. some log information about the issued queries.Koziarz
which driver you use exacly github.com/mymarilyn/clickhouse-driver or github.com/cloudflare/sqlalchemy-clickhouse ?Putter
S
6

Starting with version 0.2.0 clickhouse_driver implements method insert_dataframe. See: https://clickhouse-driver.readthedocs.io/en/latest/api.html#clickhouse_driver.Client.insert_dataframe

Salaam answered 29/3, 2021 at 11:51 Comment(3)
This works very well. It is very easy, and is more efficient than using client.execute("INSERT INTO your_table VALUES", df.to_dict('records')) because it will transpose the DataFrame and send the data in columnar format. This doesn't do automatic table generation, but I wouldn't trust that anyway. Clickhouse has too many important decisions of schema design - especially partitioning - to leave this to autogen.Ordinary
Somehow I needed use_numpy=True. Client('localhost', settings={"use_numpy":True}) elseNerte
Would be helpful to include the actual example of usage as it is missing from official documentation: client.insert_dataframe('INSERT INTO your_table VALUES', df)Nippers
A
4

You can do it without sqlalchemy.

pip install clickhouse-driver

from clickhouse_driver import Client


client = Client('localhost')
df = pandas.DataFrame.from_records([
    {'year': 1994, 'first_name': 'Vova'},
    {'year': 1995, 'first_name': 'Anja'},
    {'year': 1996, 'first_name': 'Vasja'},
    {'year': 1997, 'first_name': 'Petja'},
])

client.insert_dataframe(
    'INSERT INTO "your_table" (year, first_name) VALUES',
    df,
    settings=dict(use_numpy=True),
)

# or 
client.execute("INSERT INTO your_table VALUES", df.to_dict('records'))
Aerator answered 8/1, 2020 at 15:14 Comment(0)
W
3

You can also use https://github.com/kszucs/pandahouse to insert your dataframe without extra conversions.

pip install pandahouse

import pandahouse as ph

pdf = pandas.DataFrame.from_records([
    {'year': 1994, 'first_name': 'Vova'},
    {'year': 1995, 'first_name': 'Anja'},
    {'year': 1996, 'first_name': 'Vasja'},
    {'year': 1997, 'first_name': 'Petja'},
])

connection = dict(database='default',
                  host='localhost',
                  user='default',
                  password='')

ph.to_clickhouse(pdf, 'test_humans', index=False, chunksize=100000, connection=connection)
Wofford answered 22/1, 2020 at 18:14 Comment(0)
C
1

sqlalchemy-clickhouse cannot create table automatically (at least until version 0.1.5.post0 inclusively) and interprets any sql-query of table creation as SELECT-query that always tail by FORMAT TabSeparatedWithNamesAndTypes-clause.

To work around this issue need to create a table manually using the infi.clickhouse_orm-api (this module delivered with sqlalchemy-clickhouse):

import pandas as pd
from infi.clickhouse_orm.engines import Memory
from infi.clickhouse_orm.fields import UInt16Field, StringField
from infi.clickhouse_orm.models import Model
from sqlalchemy import create_engine


# define the ClickHouse table schema
class Test_Humans(Model):
    year = UInt16Field()
    first_name = StringField()
    engine = Memory()


engine = create_engine('clickhouse://default:@localhost/test')

# create table manually
with engine.connect() as conn:
    conn.connection.create_table(Test_Humans) # https://github.com/Infinidat/infi.clickhouse_orm/blob/master/src/infi/clickhouse_orm/database.py#L142

pdf = pd.DataFrame.from_records([
    {'year': 1994, 'first_name': 'Vova'},
    {'year': 1995, 'first_name': 'Anja'},
    {'year': 1996, 'first_name': 'Vasja'},
    {'year': 1997, 'first_name': 'Petja'},
    # ! sqlalchemy-clickhouse ignores the last item so add fake one
    {}
])

pdf.to_sql('test_humans', engine, if_exists='append', index=False)

Take into account that sqlalchemy-clickhouse ignores the last item so add fake one (see source code and related issue 10).

Let's check this table in DB:

SELECT *
FROM test.test_humans

┌─year─┬─first_name─┐
│ 1994 │ Vova       │
│ 1995 │ Anja       │
│ 1996 │ Vasja      │
│ 1997 │ Petja      │
└──────┴────────────┘

4 rows in set. Elapsed: 0.003 sec.
*/

See the answer https://mcmap.net/q/1769668/-how-to-insert-a-df-into-a-clickhouse-table-if-data-type-is-uuid-and-list-of-datetime too.

Cogency answered 17/4, 2020 at 11:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.