storing telemetry data from 10000s of nodes
Asked Answered
C

2

8

I need to store telemetry data that is being generated every few minutes from over 10000 nodes (which may increase), each supplying the data over the internet to a server for logging. I'll also need to query this data from a web application.

I'm having a bit of trouble deciding what the best storage solution would be..

Each node has a unique ID, and there will be a timestamp for each packet of variables. (probably will need to be generated by the server).

The telemetry data has all of the variables in the same packet, so conceptually it could easily be stored in a single database table with a column per variable. The serial number + timestamp would suffice as a key. The size of each telemetry packet is 64 bytes, including the device ID and timestamp. So around 100Gb+ per year.

I'd want to be able to query the data to get variables across time ranges and also store aggregate reports of this data so that I can draw graphs.

Now, how best to handle this? I'm pretty familiar with using MySQL, so I'm leaning towards this. If I were to go for MySQL would it make sense to have a separate table for each device ID? - Would this make queries much faster or would having 10000s of tables be a problem?

I don't think querying the variables from all devices in one go is going to be needed but it might be. Or should I just stick it all in a single table and use MySQL cluster if it gets really big?

Or is there a better solution? I've been looking around at some non relational databases but can't see anything that perfectly fits the bill or looks very mature. MongoDB for example would have quite a lot of size overhead per row and I don't know how efficient it would be at querying the value of a single variable across a large time range compared to MySQL. Also MySQL has been around for a while and is robust.

I'd also like it to be easy to replicate the data and back it up.

Any ideas or if anyone has done anything similar you input would be greatly appreciated!

Colchester answered 9/5, 2012 at 21:34 Comment(0)
D
6

Have you looked at time-series databases? They're designed for the use case you're describing and may actually end up being more efficient in terms of space requirements due to built-in data folding and compression.

I would recommend looking into implementations using HBase or Cassandra for raw storage as it gives you proven asynchronous replication capabilities and throughput.

HBase time-series databases:

Dichy answered 28/7, 2015 at 18:7 Comment(0)
F
3

If you want to go with MySQL, keep in mind that although it will keep on going when you throw something like a 100GB per year at it easily on modern hardware, do be advised that you cannot execute schema changes afterwards (on a live system). This means you'll have to have a good, complete database schema to begin with.

I don't know if this telemetry data might grow more features, but if they do, you don't want to have to lock your database for hours if you need to add a column or index.

However, some tools such as http://www.percona.com/doc/percona-toolkit/pt-online-schema-change.html are available nowadays which make such changes somewhat easier. No performance problems to be expected here, as long as you stay with InnoDB.

Another option might be to go with PostgreSQL, which allows you to change schemas online, and sometimes is somewhat smarter about the use of indexes. (For example, http://kb.askmonty.org/en/index-condition-pushdown is a new trick for MySQL/MariaDB, and allows you to combine two indices at query time. PostgreSQL has been doing this for a long time.)

Regarding overhead: you will be storing your 64 bytes of telemetry data in an unpacked form, probably, so your records will take more than 64 bytes on disk. Any kind of structured storage will suffer from this.

If you go with an SQL solution, backups are easy: just dump the data and you can restore it afterwards.

Faught answered 9/5, 2012 at 22:29 Comment(2)
Thanks for your input. I think that if schema changes are required then I'll be creating a new table that accepts data for that schema, and each device would therefore know what schema it is using, so there will not need to be schema changes on the table (which I'd definitely want to avoid).Colchester
This isn't really the answer I was looking for, but in the absence of any other answers I'll set this as accepted.Colchester

© 2022 - 2024 — McMap. All rights reserved.