BigTable: Improve BigTable queries performance using the Python client
Asked Answered
H

0

7

We are facing some performance issues while querying BigTable.

We are getting about 500 rows/sec in the monitoring dashboard when querying for about 1400 rows, and it takes about 1.6 seconds when running the next snippet in a machine in the same region as the BigTable instance:

partial_rows = instance.table(table_name).read_rows(filter_=row_filter, row_set=row_set)

ids = {}
for row in partial_rows:
    key = row.row_key.decode()
    category = key[0:5]
    id_, year_month = key[5:].split('_')
    dates = ids.get(id_, {})

    for k, v in row.cells[column_family].items():
        sub_key = k.decode()
        day = sub_key[0:2]
        measurement = f'{category}_{sub_key[3:]}'

        date = datetime.date(int(year_month[0:4]), int(year_month[4:6]), int(day))
        value = struct.unpack('>i', v[0].value)[0]

        measurements = dates.get(date, {})
        measurements[measurement] = value
        dates[date] = measurements

    ids[id_] = dates

The table schema we are using is:

  • Row key: {category}{id}_{year}{month}
  • Column: {day}{measurement_name}

In our case, this schema completely follows BigTable's guidelines.

The snippet is very simple, we perform some operations with the key and the column name for creating an ids dictionary that looks like this:

{
  "4326": {
    "2019-01-01": {
      "value_a": 49
    },
    "2019-01-02": {
      "value_a": 53
    },
    ...
  },
  "3857": {
    "2019-01-01": {
      "value_a": 56
    },
    "2019-01-02": {
      "value_a": 59
    },
    ...
  },
  ...
}

Do you have any idea why we are getting such low read throughput?

Is there any way to obtain all the data at once and not using the read_rows streamed generator? Or even other solutions with the same Python client or another one, like something asynchronous.

Hara answered 12/11, 2019 at 16:4 Comment(3)
Could you post more information about the row filter you are using? Using a complex row filter can affect query performance. Also, are you only doing this for small sections? There is some startup time for the connection to the client to take effect. This would be negligible if working with millions of rows, but for several hundred, it could be impacting your overall query time.Garonne
Hi @BillyJacobson, Our row filter is pretty simple, \d\d_{column_name}. In our case, we have about 300 columns for each row. When consulting only for 30 columns vs getting the whole row without using a filter, we notice the improvement in time for getting 10 times more data, because it doesn't take 10 times more time. But still, we can't use those times. [1/2]Hara
So, when using a simpler filter, removing the RegEx for \d\d and using the digits we need, we see some improvement, but only some dozens of milliseconds. About the table size, yes, we are dealing with millions of rows, so that is not the problem. But still, we find that getting the data in Python generator is the problem here. [2/2]Hara

© 2022 - 2024 — McMap. All rights reserved.