Storing time-series data, relational or non?
Asked Answered
P

8

197

I am creating a system which polls devices for data on varying metrics such as CPU utilisation, disk utilisation, temperature etc. at (probably) 5 minute intervals using SNMP. The ultimate goal is to provide visualisations to a user of the system in the form of time-series graphs.

I have looked at using RRDTool in the past, but rejected it as storing the captured data indefinitely is important to my project, and I want higher level and more flexible access to the captured data. So my question is really:

What is better, a relational database (such as MySQL or PostgreSQL) or a non-relational or NoSQL database (such as MongoDB or Redis) with regard to performance when querying data for graphing.

Relational

Given a relational database, I would use a data_instances table, in which would be stored every instance of data captured for every metric being measured for all devices, with the following fields:

Fields: id fk_to_device fk_to_metric metric_value timestamp

When I want to draw a graph for a particular metric on a particular device, I must query this singular table filtering out the other devices, and the other metrics being analysed for this device:

SELECT metric_value, timestamp FROM data_instances
    WHERE fk_to_device=1 AND fk_to_metric=2

The number of rows in this table would be:

d * m_d * f * t

where d is the number of devices, m_d is the accumulative number of metrics being recorded for all devices, f is the frequency at which data is polled for and t is the total amount of time the system has been collecting data.

For a user recording 10 metrics for 3 devices every 5 minutes for a year, we would have just under 5 million records.

Indexes

Without indexes on fk_to_device and fk_to_metric scanning this continuously expanding table would take too much time. So indexing the aforementioned fields and also timestamp (for creating graphs with localised periods) is a requirement.

Non-Relational (NoSQL)

MongoDB has the concept of a collection, unlike tables these can be created programmatically without setup. With these I could partition the storage of data for each device, or even each metric recorded for each device.

I have no experience with NoSQL and do not know if they provide any query performance enhancing features such as indexing, however the previous paragraph proposes doing most of the traditional relational query work in the structure by which the data is stored under NoSQL.

Undecided

Would a relational solution with correct indexing reduce to a crawl within the year? Or does the collection based structure of NoSQL approaches (which matches my mental model of the stored data) provide a noticeable benefit?

Pontine answered 27/1, 2011 at 8:31 Comment(2)
Very valid question, I myself has pondered over this whether relational DB is the right way to store a data structure which is actually hierarchical (SNMP structure). Sometimes when I write a query to fetch even trivial data, the query is over-complicated, I felt the data had to be mangled in to a form which is not its own. For example matching ifnames and their indexes is supposedly a trivial task, both being children of the same parent oid. But the way it is stored in relational DB, doesnt relate to its original structure and I feel it is more efficient to store it in a hierarchical fashion.Piezoelectricity
"For a user recording 10 metrics for 3 devices every 5 minutes for a year, we would have just under 5 million records." Isn't 10 * 3 * 365 * 24 * 12 approximately equal to 3 million which is not just under 5 million?Stationery
O
161

Definitely Relational. Unlimited flexibility and expansion.

Two corrections, both in concept and application, followed by an elevation.

Correction

  1. It is not "filtering out the un-needed data"; it is selecting only the needed data. Yes, of course, if you have an Index to support the columns identified in the WHERE clause, it is very fast, and the query does not depend on the size of the table (grabbing 1,000 rows from a 16 billion row table is instantaneous).

  2. Your table has one serious impediment. Given your description, the actual PK is (Device, Metric, DateTime). (Please don't call it TimeStamp, that means something else, but that is a minor issue.) The uniqueness of the row is identified by:

       (Device, Metric, DateTime)
    
    • The Id column does nothing, it is totally and completely redundant.

      • An Id column is never a Key (duplicate rows, which are prohibited in a Relational database, must be prevented by other means).
      • The Id column requires an additional Index, which obviously impedes the speed of INSERT/DELETE, and adds to the disk space used.

      • You can get rid of it. Please.

Elevation

  1. Now that you have removed the impediment, you may not have recognised it, but your table is in Sixth Normal Form. Very high speed, with just one Index on the PK. For understanding, read this answer from the What is Sixth Normal Form ? heading onwards.

    • (I have one index only, not three; on the Non-SQLs you may need three indices).

    • I have the exact same table (without the Id "key", of course). I have an additional column Server. I support multiple customers remotely.

      (Server, Device, Metric, DateTime)

    The table can be used to Pivot the data (ie. Devices across the top and Metrics down the side, or pivoted) using exactly the same SQL code (yes, switch the cells). I use the table to erect an unlimited variety of graphs and charts for customers re their server performance.

    • Monitor Statistics Data Model.
      (Too large for inline; some browsers cannot load inline; click the link. Also that is the obsolete demo version, for obvious reasons, I cannot show you commercial product DM.)

    • It allows me to produce Charts Like This, six keystrokes after receiving a raw monitoring stats file from the customer, using a single SELECT command. Notice the mix-and-match; OS and server on the same chart; a variety of Pivots. Of course, there is no limit to the number of stats matrices, and thus the charts. (Used with the customer's kind permission.)

    • Readers who are unfamiliar with the Standard for Modelling Relational Databases may find the IDEF1X Notation helpful.

One More Thing

Last but not least, SQL is a IEC/ISO/ANSI Standard. The freeware is actually Non-SQL; it is fraudulent to use the term SQL if they do not provide the Standard. They may provide "extras", but they are absent the basics.

Ophelia answered 3/2, 2011 at 9:33 Comment(13)
Which relational database you were using for the graph generation and all? Does this all apply to all types of RDBMS or there are only few dbs which support this kind of graphing support?Clem
@Ophelia would you use the suggested schema for a setup that has to handle ~3 million measures with a 1 minute frequency? How would you order the PK for such a table? Wouldn't Device, Metric, DateTime create fragmentation and force the RDBMS to lots of page split? Instead putting DateTime first would reduce fragmentation (I'm assuming time ordered inserts) but make reads worst.Initiative
@Buchi. I use Sybase ASE. But this is not a platform issue (sure, the high platforms provide performance that is orders of magnitude better than the low end; three orders of magnitude better than Oracle, but that is not the point), erection of the chart from the table "works" on any platform. Use the right tool for the job. The RDBMS is a database tool, not a graphing tool. gnuplot, Apple Numbers (or if you like paying ten times as much, for half as much, MS Excel) are charting tools, not database tools. These days we use layers of tools to produce a result, the monolith is a dinosaur.Ophelia
@Buchi. My "six keystrokes" are reqd to connect the tools. My platforms are Unix or Mac based, my customers are MS based. I give them Unix/Sybase_ASE->Windows/App for Transactions and reports, and Unix/Sybase_ASE->Windows/ODBC->Excel for charts. These particular charts & graphs are Apple Numbers, I use Unix/gnuplot for serious graphs (fragmentation patterns, etc).Ophelia
@marcob. Your question is a good one, but it cannot be answered properly in the comments. If you open a new question, and email me (go to profile), I will answer it. For the quick answer here. (1) ~3 million Metrics. Great, the more the merrier, it spreads the INSERT points beautifully, yours would guarantee conflicts on the last page. The server is multi-threaded, yes ? Partition the table. Use FILLFACTOR and leave space for inserts, and thus avoid page splits. (2) ~3 Mill indicates that the Metrics are not Normalised, if you correct that, it will be faster still.Ophelia
@marcob. (3) I use the given index precisely to spread the inserts under load, which ensures no conflicts. (4) Therefore, my method obtains both inserts with no conflicts and high-performance on SELECTs.Ophelia
@Ophelia really good answer. Hope you still follow this 5 year thread! Just 2 questions: i) would you normalize the timestamp by separating the year? Only 1 year of timestamps would exist, adding the year as a key to have the complete timestamp. ii) how would do you model "empty" datetime/device/metric values? Do you store null/0 or do not store at all? I would prefer not to store, but am wary of complications on querying/graphing.Fideicommissary
This thread is quite old, and by now great solutions exist as time series database. Those solutions would be more efficient, scale better and show unbeatable performances. None of them support SQL, which is indeed a standard, but not as good for querying time series data compared to TSDB.Likeness
@DuarteCarreira. Please pardon the delay. Thanks. (i) Separating the year is not a normalisation-or-not issue. It is breaking 1NF: all data [columns] must be Atomic. All commercial platforms support DATE and DATETIME correctly, and you can query for the component (eg. month or day) that you need to inspect or display, quite easily.Ophelia
(ii.a) Any empty column should be just that: empty. If it is a key column, that means there will not be a row for it. (It would be insane, in a database that stores facts, to store non-facts.) As an absolute rule, I do not allow null to be stored in any of my databases. The code is much, much cleaner than when nulls are stored. Performance is also better. (ii.b) Now if you need to store the fact that xx/yy/zz has a value of zero, then it is a fact (identified by the key), and stored. Again, never null. It has nothing to do with the key having empty components xx//zz.Ophelia
@Loic. Why on earth would anyone, who has an investment (data; code) in an SQL platform, which handles time series data easily and with very high performance (as detailed in the answer), migrate to a TSDB with no SQL; unknown speed for anything except time series data ? Why would anyone who has a requirement that exceeds time-series-data-only, not use an SQL platform ? The mind boggles. TSDB is faster than Relational only in the sad instance when the data is stored in a db but not normalised Relationally. Eg. when Id columns are used, as "keys". As advised by the "theoreticians".Ophelia
@Ophelia Could you elaborate a bit more on the reason you included a server column in the schema of each log?Parvenu
@AlexJ. The data (Device, Metric, DateTime) is related to a single server. Each customer has multiple servers, therefore (Server, Device, Metric, DateTime). I support multiple customers; multiple servers remotely, using a single database. It allows me to (a) load (b) erect the graphs for (c) and therefore compare, the metrics of more than one Server.Ophelia
S
21

Found very interesting the above answers. Trying to add a couple more considerations here.

1) Data aging

Time-series management usually need to create aging policies. A typical scenario (e.g. monitoring server CPU) requires to store:

  • 1-sec raw samples for a short period (e.g. for 24 hours)

  • 5-min detail aggregate samples for a medium period (e.g. 1 week)

  • 1-hour detail over that (e.g. up to 1 year)

Although relational models make it possible for sure (my company implemented massive centralized databases for some large customers with tens of thousands of data series) to manage it appropriately, the new breed of data stores add interesting functionalities to be explored like:

  • automated data purging (see Redis' EXPIRE command)

  • multidimensional aggregations (e.g. map-reduce jobs a-la-Splunk)

2) Real-time collection

Even more importantly some non-relational data stores are inherently distributed and allow for a much more efficient real-time (or near-real time) data collection that could be a problem with RDBMS because of the creation of hotspots (managing indexing while inserting in a single table). This problem in the RDBMS space is typically solved reverting to batch import procedures (we managed it this way in the past) while no-sql technologies have succeeded in massive real-time collection and aggregation (see Splunk for example, mentioned in previous replies).

Searchlight answered 20/3, 2011 at 13:18 Comment(0)
I
7

You table has data in single table. So relational vs non relational is not the question. Basically you need to read a lot of sequential data. Now if you have enough RAM to store a years worth data then nothing like using Redis/MongoDB etc.

Mostly NoSQL databases will store your data on same location on disk and in compressed form to avoid multiple disk access.

NoSQL does the same thing as creating the index on device id and metric id, but in its own way. With database even if you do this the index and data may be at different places and there would be a lot of disk IO.

Tools like Splunk are using NoSQL backends to store time series data and then using map reduce to create aggregates (which might be what you want later). So in my opinion to use NoSQL is an option as people have already tried it for similar use cases. But will a million rows bring the database to crawl (maybe not , with decent hardware and proper configurations).

Infectious answered 27/1, 2011 at 12:53 Comment(2)
Could you explain how the table is "de-normalised" ? Marcus does have an error in the table, but it is not a normalisation error.Ophelia
i will correct myself, tables are normalized in the traditional sense. I meant de-normalised in the sense that the use case has all the data in one table here.Infectious
E
4

Create a file, name it 1_2.data. weired idea? what you get:

  • You save up to 50% of space because you don't need to repeat the fk_to_device and fk_to_metric value for every data point.
  • You save up even more space because you don't need any indices.
  • Save pairs of (timestamp,metric_value) to the file by appending the data so you get a order by timestamp for free. (assuming that your sources don't send out of order data for a device)

=> Queries by timestamp run amazingly fast because you can use binary search to find the right place in the file to read from.

if you like it even more optimized start thinking about splitting your files like that;

  • 1_2_january2014.data
  • 1_2_february2014.data
  • 1_2_march2014.data

or use kdb+ from http://kx.com because they do all this for you:) column-oriented is what may help you.

There is a cloud-based column-oriented solution popping up, so you may want to have a look at: http://timeseries.guru

Emplacement answered 26/9, 2014 at 12:59 Comment(1)
I wrote a blog post about the topic. with google translate you may find it helpful: blog.michaelwittig.info/die-spaltenorientierte-datenbank-kdbEmplacement
Z
4

You should look into Time series database. It was created for this purpose.

A time series database (TSDB) is a software system that is optimized for handling time series data, arrays of numbers indexed by time (a datetime or a datetime range).

Popular example of time-series database InfluxDB

Zahn answered 14/7, 2017 at 19:14 Comment(2)
add timescaledb to this list nowNonconcurrence
QuestDB and GridDB also Great option for Time Series DataMustache
R
2

I think that the answer for this kind of question should mainly revolve about the way your Database utilize storage. Some Database servers use RAM and Disk, some use RAM only (optionally Disk for persistency), etc. Most common SQL Database solutions are using memory+disk storage and writes the data in a Row based layout (every inserted raw is written in the same physical location). For timeseries stores, in most cases the workload is something like: Relatively-low interval of massive amount of inserts, while reads are column based (in most cases you want to read a range of data from a specific column, representing a metric)

I have found Columnar Databases (google it, you'll find MonetDB, InfoBright, parAccel, etc) are doing terrific job for time series.

As for your question, which personally I think is somewhat invalid (as all discussions using the fault term NoSQL - IMO): You can use a Database server that can talk SQL on one hand, making your life very easy as everyone knows SQL for many years and this language has been perfected over and over again for data queries; but still utilize RAM, CPU Cache and Disk in a Columnar oriented way, making your solution best fit Time Series

Romain answered 16/8, 2013 at 19:46 Comment(0)
P
2

5 Millions of rows is nothing for today's torrential data. Expect data to be in the TB or PB in just a few months. At this point RDBMS do not scale to the task and we need the linear scalability of NoSql databases. Performance would be achieved for the columnar partition used to store the data, adding more columns and less rows kind of concept to boost performance. Leverage the Open TSDB work done on top of HBASE or MapR_DB, etc.

Pavkovic answered 30/5, 2015 at 13:17 Comment(1)
"RDBMS do not scale to the task" - why wouldn't they? code.facebook.com/posts/190251048047090/…Ostracon
M
1

I face similar requirements regularly, and have recently started using Zabbix to gather and store this type of data. Zabbix has its own graphing capability, but it's easy enough to extract the data out of Zabbix's database and process it however you like. If you haven't already checked Zabbix out, you might find it worth your time to do so.

Mummy answered 27/1, 2011 at 11:11 Comment(2)
Yes, Zabbix is nice and already integrates with SNMP monitoring. Zabbix can use MySQL or PostgreSQL and works more or less out of the box on Ubuntu.Ghirlandaio
Thanks, I have knowledge of Zabbix and a lot of other SNMP tools. However I am developing this project as an educational process, in the topic discussed here and many other aspects. A good point though!Pontine

© 2022 - 2024 — McMap. All rights reserved.