Big data with very fast access

Asked 22/6, 2015 at 15:10 Answered 29/7, 2015 at 15:9

hbase apache-phoenix influxdb opentsdb kairosdb

I am facing to a problem: database for process plants. There are up to 50,000 sensors at sampling rate of 50 ms. All measured values need to be stored at least 3 years and must support real-time queries (i.e. users can see historical data with delay less than 1 second). I recently read an article about Time-series Database, many options are on hand: OpenTSDB, KairosDB, InfluxDB, ...

I am confused which one would be proper for the purpose? Any one know about this please help me!

UPDATE 15.06.25

Today I run a test based on OpenTSDB. I used Virtual Box to create a cluster of 3 CentOS x64 VMs (1 master, 2 slaves). The host configuration is 8 GB RAM, core i5. The master VM configuration is 3 GB RAM, and the slaves configuration is 1.5 GB RAM. I write a python program to send data to OpenTSDB as below:

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("192.168.10.55", 4242))

start_time = time.time()
start_epoch = 1434192418;

for x in range(0, 1000000):
    curr_epoch = start_epoch + x

    tag1 = "put TAG_1 %d 12.9 stt=good\n" % (curr_epoch)
    tag2 = "put TAG_2 %d 12.9 stt=good\n" % (curr_epoch)
    tag3 = "put TAG_3 %d 12.9 stt=good\n" % (curr_epoch)
    tag4 = "put TAG_4 %d 12.9 stt=good\n" % (curr_epoch)
    tag5 = "put TAG_5 %d 12.9 stt=good\n" % (curr_epoch)
    tag6 = "put TAG_6 %d 12.9 stt=good\n" % (curr_epoch)
    tag7 = "put TAG_7 %d 12.9 stt=good\n" % (curr_epoch)
    tag8 = "put TAG_8 %d 12.9 stt=good\n" % (curr_epoch)
    tag9 = "put TAG_9 %d 12.9 stt=good\n" % (curr_epoch)
    tag10 = "put TAG_10 %d 12.9 stt=good\n" % (curr_epoch)
    str = tag1 + tag2 + tag3 + tag4 + tag5 + tag6 + tag7 + tag8 + tag9 + tag10

    s.send(str)

print("--- %s seconds ---" % (time.time() - start_time))

I run the python on host, and the work completes after ~220 seconds. So, I got an avg. speed of ~45000 records per second.

UPDATE 15.06.29

This time I used only 1 VM (5 GB RAM, 3 cores, CentOS x64, pseudo-distributed Hadoop). I run 2 python processes on Windows 7 host to send 2 halves of data to the OpenTSDB. The avg. speed of putting data was ~100,000 records per second.

Corcoran answered 22/6, 2015 at 15:10 Comment(5)

I think you'll get better answers there : softwarerecs.stackexchange.com – Gann 22/6, 2015 at 15:41

If each sample is 1 byte that is about 100 TeraBytes. – Snavely 22/6, 2015 at 16:46

Could be less than 100TB if time series is regular, i.e. 50 ms rate is fixed and guaranteed and data folding is enabled with typical historian compression algorithms such as swinging door and de-duplication of repeat values. Lots of values will be unchanged since physical state doesn't change that often provided meaningful precision limit is in place. Could you share part of the dataset - e.g. a list of metrics/sensor types? – Carpophagous 22/6, 2015 at 17:6

@SergeiRodionov We have 2 types of values: Digital and Analog. In a process plant, there are about 30,000 analog sensors and 20,000 digital sensors. Usually, the sampling rate of an analog signal is 50 ms and sampling rate of a digital signal is 1 second or more. However if a measured value does not change (compared with previous measured value) -> we don't need to store (that is what we have been doing). Each sample consists of timestamp (millisecond precision), tag ID (2 bytes), and value (4 bytes for analog signal, 1 byte for digital). – Corcoran 23/6, 2015 at 1:43

Hbase should be able to handle million small samples if batched together and distributed over a cluster of machines. But if you want the data to be available asap then its going to be more challenging. – Demonic 23/6, 2015 at 6:37

To handle the million writes per seconds, you will need to put some serious engineering in place.

Not all databases will be able to store that amount of data in a compact form.

For example ATSD uses 5 to 10 bytes per sample (float data type), depending on observed variance.

There is a type of distributed (clustered) databases built on HBase that will be able to handle this kind of load.

For example, you can try looking at openTSDB and ATSD.

Update 1.

We have run the following test for your particular use case:

30.000 analog sensors writing float type data, resulting in 540.000.000 records

20.000 digital sensors writing short type data (zeros and ones), resulting in 552.000.000 records

The data took up 3.68 gigabytes. The compression was lossless.

Resulting in an average 3.37 bytes per record.

This was a storage efficiency test.

Full disclosure, I work for the company that develops ATSD.

Mode answered 22/6, 2015 at 16:36 Comment(4)

Sorry, could you please explain more clearly about your test? What do you mean 30.000 analog entities resulting in 540.000.000 records and 20.000 digital entities resulting in 552.000.000 records? and avg. 3.37 bytes per record -> what does this mean? – Corcoran 25/6, 2015 at 1:43

The test consisted of 50.000 sensors writing 1 million data points per second. 30.000 of them were analog and 20.000 were digital. Analog sensors were writing float type data and digital were writing short type data. Total of 1.092.000.000 data points were generated at a sampling rate of 50 ms, taking up about 3.68 gigabytes of disk space, which means that each data point took about 3.37 bytes of disk space. – Mode 25/6, 2015 at 7:41

Did you use ATSD? How is about your hardware configuration? – Corcoran 25/6, 2015 at 8:24

ATSD community edition was used for the test. It was not a throughput test, so a simple virtual machine was used. The goal was to see how much data can be stored, how efficiently it can be stored and for how long. – Mode 25/6, 2015 at 8:47

InfluxDB won't handle a sustained million writes per second right now, but that's within the performance target for later this year. The bigger challenge I see is the sheer volume of data you want to store. If you need to keep three years' worth at full resolution with no downsampling, that's hundreds of terabytes of data. If that's not all on SSD, the query performance will not be good. If that is all on SSD, that's an extremely expensive database. Also, with that much raw data, it would be very easy to craft a query that explodes the RAM no matter how much is installed.

I would say check back with the InfluxDB team in 8-12 weeks and we might have a better idea how to handle your problem. My advice, though, is to find a way to split that up. If you really are sampling 50k machines at 50ms intervals, it's a huge amount of data, network traffic, etc.

Aboriginal answered 22/6, 2015 at 18:34 Comment(1)

Yes. In a process plant (e.g. a thermal power plant), 50 sensors is a normal situation (as I know). The sampling rate is between 50 ms and couple of seconds depending on type of signal (analog and digital). The historian needs to be stored at least for 1 year and up to 3 years depending on plant. – Corcoran 23/6, 2015 at 1:50

Very interresting use case.

I cannot speak for others since I only strongly evaluated and used KairosDB & Cassandra but I can speak of my own experience.

KairosDB + Cassandra would stand the throughput but for 1M writes per second you would need a cluster with several front-ends (I'd recommend at least 10 nodes, but you have to test) and back-end servers (they can be colocated if needed).

Anyway with a throughput of 200 samples per second for each sensor... Retrieving historical data could be a problem for long queries (1s for historical data is interesting but you have to determine the amount of samples of the query duration).

As far as I can tell storage efficiency will not be as goot as ATSD (probably twice the size, which is already good).

What I personally like with KairosDB & Cassandra and why we adopted it: it's simple, really distributed, KairosDB doesn't mess with the data, the design is elegant and doesn't involve complex systems, Cassandra manages the data and kairosDB offers the TSDB services, and performances and efficiency are not bad at all... Moreover Cassandra is much easier to manage as a cluster than its counterparts.

Glossotomy answered 29/7, 2015 at 15:9 Comment(1)

Yes Indeed ! I think I was a bit tired :) – Glossotomy 14/10, 2015 at 9:55

MemSQL can handle 1M writes a second. I would recommend to use Kafka and MemSQL column stores for that. Depending on the type of queries subsecond response is very possible.

Rhinehart answered 24/6, 2015 at 23:25 Comment(0)

Recommended topics

Hot tags