We are facing some performance issues while querying BigTable.
We are getting about 500 rows/sec in the monitoring dashboard when querying for about 1400 rows, and it takes about 1.6 seconds when running the next snippet in a machine in the same region as the BigTable instance:
partial_rows = instance.table(table_name).read_rows(filter_=row_filter, row_set=row_set)
ids = {}
for row in partial_rows:
key = row.row_key.decode()
category = key[0:5]
id_, year_month = key[5:].split('_')
dates = ids.get(id_, {})
for k, v in row.cells[column_family].items():
sub_key = k.decode()
day = sub_key[0:2]
measurement = f'{category}_{sub_key[3:]}'
date = datetime.date(int(year_month[0:4]), int(year_month[4:6]), int(day))
value = struct.unpack('>i', v[0].value)[0]
measurements = dates.get(date, {})
measurements[measurement] = value
dates[date] = measurements
ids[id_] = dates
The table schema we are using is:
- Row key:
{category}{id}_{year}{month}
- Column:
{day}{measurement_name}
In our case, this schema completely follows BigTable's guidelines.
The snippet is very simple, we perform some operations with the key and the column name for creating an ids
dictionary that looks like this:
{
"4326": {
"2019-01-01": {
"value_a": 49
},
"2019-01-02": {
"value_a": 53
},
...
},
"3857": {
"2019-01-01": {
"value_a": 56
},
"2019-01-02": {
"value_a": 59
},
...
},
...
}
Do you have any idea why we are getting such low read throughput?
Is there any way to obtain all the data at once and not using the read_rows
streamed generator? Or even other solutions with the same Python client or another one, like something asynchronous.
\d\d_{column_name}
. In our case, we have about 300 columns for each row. When consulting only for 30 columns vs getting the whole row without using a filter, we notice the improvement in time for getting 10 times more data, because it doesn't take 10 times more time. But still, we can't use those times. [1/2] – Hara\d\d
and using the digits we need, we see some improvement, but only some dozens of milliseconds. About the table size, yes, we are dealing with millions of rows, so that is not the problem. But still, we find that getting the data in Python generator is the problem here. [2/2] – Hara