How to insert a DF into a clickhouse table, if data type is uuid and list of datetime?
Asked Answered
A

2

1

I have a df as a result of a sql query, rows are as follows:

UserID:  <class 'uuid.UUID'> 6f6e526c-acd8-496f-88e0-6bfdd594e2c2
ObjectID:  <class 'int'> 6095016
ObjectClass:  <class 'str'> cottages
Views:  <class 'list'> [datetime.datetime(1970, 1, 1, 0, 0)]
RecDate:  <class 'pandas._libs.tslibs.timestamps.Timestamp'> 2021-07-13 15:50:32
Events:  <class 'list'> ['']

And a clickhouse table:

CREATE TABLE default.rec_eval_data
(
    `UserID` UUID,
    `ObjectID` Int32,
    `ObjectClass` String,
    `Views` Array(DateTime),
    `RecDate` DateTime,
    `Events` Array(String)
)
ENGINE = ReplacingMergeTree
ORDER BY (UserID, ObjectID)
SETTINGS index_granularity = 8192

I'm trying different ways to insert the DF into the table, but i keep getting various errors.

I'm using clickhouse_driver library to perform stuff.

I have read This thread and used all the methods, but to no avail

What I have tried already:

  1. Native:

    client.execute("INSERT INTO rec_eval_data VALUES", data.to_dict(orient='records'), types_check=True)

Error: clickhouse_driver.errors.TypeMismatchError: Code: 53. Type mismatch in VALUES section. Repeat query with types_check=True for detailed info. Column Views: argument out of range

  1. Pandahouse:

    connection = dict(database='default', host='http://localhost:8123', user='default', schema='default', password='')

    ph.to_clickhouse(data, 'rec_eval_data', index=False, chunksize=100000, connection=connection)

Error: It uses http and GET method, which automatically acts in readonly mode, so I could not proceed. Maybe there's a way to change method to POST?

  1. clickhouse_driver insert_dataframe:

    client.insert_dataframe('INSERT INTO rec_eval_data VALUES ', data)

Error: TypeError: Unsupported column type: <class 'numpy.ndarray'>. list or tuple is expected.

  1. iteration:

    for date, row in data.T.iteritems():

    client.execute("INSERT INTO rec_eval_data " "(UserID, "
    "ObjectID, "
    "ObjectClass, "
    "Views, "
    "RecDate, "
    "Events)"
    " VALUES " "({UserID}, " "{ObjectID}, " "{ObjectClass}, " "{Views}, " "{RecDate}, " "{Events}) " .format( UserID=UserID, ObjectID=row['ObjectID'], ObjectClass=row['ObjectClass'], Views=row['UserID'], RecDate=row['RecDate'], Events=row['Events']))

Error: It tries to split UserID into pieces. Can't find how to avoid it: DB::Exception: Missing columns: '6bfdd594e2c2' '496f' 'acd8' '6f6e526c' while processing query: '(((6f6e526c - acd8) - 496f) - 88.) - 6bfdd594e2c2', required columns: '6f6e526c' 'acd8' '496f' '6bfdd594e2c2' '6f6e526c' 'acd8' '496f' '6bfdd594e2c2'.

Please, help, I can't fix it( I'm new both to CH and pandas(

Stack trace:

Antigen answered 18/8, 2021 at 10:26 Comment(0)
C
1

Try this:

! this code works only when for column Events Array(String) passed the empty array otherwise got the error "AttributeError: 'list' object has no attribute 'tolist'" (it looks like it is the bug in clickhouse_driver).

from datetime import datetime
from uuid import UUID
from clickhouse_driver import Client
import pandas as pd

client = Client(host='localhost', settings={'use_numpy': True})


def get_inserted_data():
    return [
            {
                'UserID': UUID('417ddc5d-e556-4d27-95dd-a34d84e40003'),
                'ObjectID': 1003,
                'ObjectClass': 'Class3',
                'Views': [datetime.now(), datetime.now()],
                'RecDate': datetime.now(),
                #'Events': ['aa', 'bb'] # got error "AttributeError: 'list' object has no attribute 'tolist'"
                'Events': []
            }
        ]

data = []

for item in get_inserted_data():
    data.append([
        item['UserID'],
        item['ObjectID'],
        item['ObjectClass'],
        item['Views'],
        item['RecDate'],
        item['Events']
    ])

client.insert_dataframe(
    'INSERT INTO test.rec_eval_data VALUES',
    pd.DataFrame(data, columns=['UserID', 'ObjectID', 'ObjectClass', 'Views', 'RecDate', 'Events'])
)
from clickhouse_driver import Client
from iso8601 import iso8601

client = Client(host='localhost')

client.execute(
    'INSERT INTO test.rec_eval_data (UserID, ObjectID, ObjectClass, Views, RecDate, Events) VALUES',
    [{
        'UserID': '417ddc5d-e556-4d27-95dd-a34d84e40002',
        'ObjectID': 1002,
        'ObjectClass': 'Class2',
        'Views': [iso8601.parse_date('2021-08-02 01:00:00'), iso8601.parse_date('2021-08-03 01:00:00')],
        'RecDate': iso8601.parse_date('2021-08-02 01:00:00'),
        'Events': ['03', '04']
    }])
import requests

CH_USER = 'default'
CH_PASSWORD = ''
SSL_VERIFY = False

host = 'http://localhost:8123'
db = 'test'
table = 'rec_eval_data'

content = 'UserID\tObjectID\tObjectClass\tViews\tRecDate\tEvents' \
          '\n417ddc5d-e556-4d27-95dd-a34d84e46a50\t1001\tClass1\t[\'2021-08-01 00:00:00\',\'2021-08-02 00:00:00\']\t2021-08-01 00:00:00\t[\'01\',\'02\']'

content = content.encode('utf-8')
query_dict = {
    'query': 'INSERT INTO ' + db + '.' + table + ' FORMAT TabSeparatedWithNames '
}

r = requests.post(host, data=content, params=query_dict, auth=(CH_USER, CH_PASSWORD), verify=SSL_VERIFY)

print(r.text)
Chronology answered 19/8, 2021 at 7:29 Comment(0)
W
0
  1. Native
    client.execute('INSERT INTO rec_eval_data VALUES', data.to_dict('records'), settings={'max_partitions_per_insert_block':0})
Wail answered 8/12, 2021 at 13:40 Comment(1)
As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.Tuning

© 2022 - 2024 — McMap. All rights reserved.