Using HBase to store time series data

Asked 8/11, 2010 at 17:21 Answered 16/6, 2013 at 22:59

We are trying to use HBase to store time-series data. The model we have currently stores the time-series as versions within a cell. This implies that the cell could end up storing millions of versions, and the queries on this time-series would retrieve a range of versions using the setTimeRange method available on the Get class in HBase.

e.g.

{
    "row1" : {
        "columnFamily1" : {
            "column1" : {
                1 : "1",
                2 : "2"
            },
            "column2" : {
                1 : "1"
            }
        }
    }
}

Is this a reasonable model to store time-series data in HBase?

Is the alternate model of storing data in multiple columns (is it possible to query across columns) or rows more suitable?

Monti answered 8/11, 2010 at 17:21 Comment(0)

I don't think you should use versioning to store the time series here. Not because it won't work, but because it's not designed for that particular use case and there are other ways.

I suggest you store the time series as the time step as the column qualifier and the value will be the data itself. Something like:

{
    "row1" : {
        "columnFamily1" : {
            "col1-000001" : "1"
            "col1-000002" : "2"
            "col1-000003" : "91"
            "col2-000001" : "31"
            }
        }
    }
}

One nice thing here is that HBase stores the column qualifiers in sorted order, so when reading the time series back you should see the items in order.

Another realistic option would be to have the identifier for the record as the first part of the rowkey, but then have the time step in the rowkey as well. Something like:

{
    "fooseries-00001" : {
        "columnFamily1" : {
            "val" : "1"
            }
        }
    }
    "fooseries-00002" : {
        "columnFamily1" : {
            "val" : "2"
            }
        }
    }

}

This has the nice feature that it'll be pretty easy to do range scans in a particular series. For example, pulling out fooseries's steps 104 to 199 is going to be pretty trivial to implement and be efficient.

The downside to this one is deleting an entire series is going to require a bit more management and synchronization. Another downside is that MapReduce analytics are going to have a hard time doing any sort of analysis on this data. With the above approach, the entire time series will be passed to one map() call, while here, map() will be called for each frame.

Pornography answered 26/4, 2012 at 3:19 Comment(1)

Whether you put the time step in the rowkey or column qualifier is probably best determined by your data access pattern. If you typically tend to get all the columns in your scans and deletion isn't a primary scenario, then the rowkey design makes a ton of sense. – Breakneck 5/12, 2013 at 0:46

+1 for openTSDB It does many tricks to simplify time-based rollup queries.

As for original question, you can have as many cell versions as you want (there is no limit). There is no performance penalty, 'Get' is implemented as Scan anyway in HBase and setTimeRange is quite effective filter.

Obloquy answered 16/6, 2013 at 22:59 Comment(0)

Recommended topics

Hot tags