How to efficiently store and query a billion rows of sensor data

Asked 10/1, 2016 at 18:31 Answered 12/10, 2019 at 21:40

sql-server hadoop azure-table-storage azure-hdinsight bigdata

Situation: I've started a new job and been assigned the task of figuring out what to do with their sensor data table. It has 1.3 billion rows of sensor data. The data is pretty simple: basically just a sensor ID, a date and the sensor value at that point in time (double).

Currently, the data is stored in a table in a MSSQL Server database.

By the end of this year, I expect the number of rows to have increased to 2-3 billion.

I'm looking for a better way to store and query this data (by date), and since there is alot of "big data" products our there, and I have no real experience managing such big data sets, I'm asking here for any pointers.

It's not a large company, and our resources are not unlimited ;)

Some more details about our use case:

The data is plotted in graphs and shows sensor values over time.
We're planning to create an API to let our customers fetch sensor data for any time period that they are interested in (... data from 2 years back is just as relevant as the last month's data).

My research so far has led me to consider the following solutions:

Keep the data in SQL Server

but partition the table (it's not partitioned right now). That will require enterprise version of SQL Server, which costs alot.
Move the data to Azure SQL Server.

There we'll get the partitioning feature for alot less money, but once our DB grows above 250GB it costs alot more (and way too much above 500gb).
Use several Databases

We could use 1 DB per customer. Several smaller DBs will be cheaper than 1 huge DB, but we've got alot of customers and plans for more, so I don't really like thinking about managing all these databases.
Azure Storage Tables

This is the option I like best so far. We can partition the data by company/sensor/year/month, use the date for row key and store the sensor value.

I haven't yet had time to test query performance, but from what I read it should be good. But there is one major downside, and that's the limit of 1000 items returned per HTTP request. If we need to fetch all sensor data for a week, we need to make alot of HTTP requests. I'm not sure right now how big an issue this is for our use case.
Azure HDInsight (Hadoop in Azure)

As mentioned I have no experience with big data, and currently I don't get Hadoop well enough to know if it fits our case (expose sensor data, for a given timespan, through an API). Should I dig deeper and learn or is my time better spent pursuing another alternative?

Does anyone have experience from a similar case. What works for you? Keep in mind that price matters, and a "simple" solution could be preferred over a very complex one even though the complex one performs some seconds better.

UPDATE 1: To answer some of the questions in comments below.

There are roughly 12 000 sensors, that can potentially report a value each 15 second. That translates to ~70 million a day. In reality, not all of these sensors have "reporting" turned on, so we're not getting that much data each day, but since we naturally want to expand with more customers and sensors, I really need a solution that could scale up to many millions of sensor values a day.
Partitioning is a solution, and using several databases and/or several tables, is something I have though of yes, but I see this as a fallback if/when I've exhausted other solutions.
I've read some more about HBase, http://opentsdb.net/ and google's https://cloud.google.com/bigtable/ and it seems like Hadoop could be a real alternative at least.

UPDATE 2: Today I experienced a bit with both azure table storage and HDInsight (HDI). We don't require much in query "flexibility", and so I think Azure Table Storage looks real promising. It's a bit slow to pull out data due to the 1000 item limit per request as I mentioned, but in my tests I think it's fast enough for our use cases.

I've also stumbled upon OpenTSDB, which is what lead me to try out HDI in the first place. Following a tutorial on Azure (https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hbase-tutorial-get-started/) I was able to pretty quickly store a million records and test some queries. It was a lot faster to query than Azure Table Storage. I could even pull down 300 000 records in one http request (took 30 seconds though).

But it costs quite a bit more than Azure Table Storage, and I think I can optimize my code to improve query performance with Azure Table Storage (more fine grained partition key and running requests in parallel). So right now I'm leaning towards Azure Table Storage due to the simplicity, price and "good-enough" performance.

I'm going to present my findings to an external consultant soon, so I'm excited to learn his view on things as well.

Greathearted answered 10/1, 2016 at 18:31 Comment(7)

Before I would attempt anything, read about partitioned tables in SQL Server. Or if you intend to store the data across multiple servers, read about partitioned views (see section Partitioned Views). – Farmann 10/1, 2016 at 18:59

You mention customers... If you have the sensor data in one big table without a customerID, how is the customer bound to this? Is there a mapping with the sensor? Why I'm asking: I guess, that you're queries will not query across all customers but always on the data of one specific customer - right? If yes: How many rows are there for each customer? You might think of one table for each customer, all with the same structure, indexes, constraints... This would need one TVF with dynamic SQL, the rest could stay quite the same... – Peppard 10/1, 2016 at 20:31

Also, if you regularly require a standard set of aggregates to be reported, research Indexed Views which will entirely manage the process of caching, in a separate index, various predefined aggregates. – Mentality 10/1, 2016 at 21:26

Is there any kind of logical grouping of the data? Are sensors in groups? are time periods in groups? You might be able to store all of this in a cube. You need to understand more thoroughly how the data will be reported before you can apply any smarts. – Chukar 10/1, 2016 at 22:42

Some grouping can be done yes: customers have several locations and each location has several sensors. Each sensor has a unique ID, which is mapped to a location and the location to a customer. But there's no further grouping than that. Also, as mentioned, we intend to provide this data through an API for our customers, so that they can decide what they'd like to use the data for. So we really don't know what sort of timespan they want to query for ahead of time (although we'll probably implement some sort of limit on the timespan per request). I'll read up on cubes. Thanks – Greathearted 10/1, 2016 at 23:7

If they wan't a very low level of detail (i.e. every second for a day) then cubes are probably not the answer. If it is definitely split by customer then that is definitely something you can work with by partitioning the data on customer and probably clustering on the time period. This is a relational DB centric approac. – Chukar 11/1, 2016 at 3:7

github.com/hhblaze/DBreeze – Belga 21/8, 2017 at 16:29

So you're going to have 3 bln records by the end of this year (which have just begun). Each record is 4 bytes ID + 4 bytes datetime + 8 bytes double value which totals in 3*10^9 * (4+4+8) == 48GB.

You can easily store and process this 48GB in a in-memory database like Redis, CouchBase, Tarantool, Aerospike. All of them are open-source, so you don't need to pay a license fee.

There might be some additional overhead on memory consumption of 10-30%, so 48GB can grow up to 64GB or slightly more. You should feed those databases with your real data to pick the most economical one for your case.

Only one physical machine should be enough for the whole workload because in-memory databases are able to handle 100K-1M queries/updates per second per node (the real number depends on your specific workload pattern) . For the sake of better availability I would setup two servers - a master and a slave.

The price of a physical server with 64GB on board up to my experience is $2-3K. Notice that you don't even need an SSD disk. A spinning one should be fine because all the reads hit RAM and all the writes only append to the transaction log. This is how in-memory databases work. I can elaborate on this if you have any questions.

Himyarite answered 17/1, 2016 at 17:29 Comment(4)

Thanks, I'll look into this a bit more as I haven't considered in-memory db's yet. Although, keeping data for several years and being able to query the historic data is part of the business model, so the data will just keep on growing in size. – Greathearted 18/1, 2016 at 19:53

You're welcome :) The data will keep on growing in size but the price of memory will keep on falling in dollars. – Himyarite 18/1, 2016 at 20:0

Could one not put a in memory Database in front of a standard database / Table store? – Bendy 30/11, 2016 at 15:38

Aerospike free tier limit to 4 billions record now, i forgot since 2 years ago i maybe? aerospike.com/products/product-matrix/… – Interne 30/9, 2020 at 14:11

So I have used all the Technologies you listed in some manner or another. What kind of queries do you need to perform? Because depending on that, you could rule some of the solutions. If you don't need to query a lot of different ways Table Storage could work out well for you. Its' going to scale real well if you follow the guidelines, and is cheap. But if you cant just do a point query for the data you need then it might not work so well, or be to complicated to be a good option. Opentsdb is great if you want a time series database. The will limit you to time series type querys. There a lot of time series dbs out there and there a lot of applications that are built on top of it like Bosun and Grafana, to list a two that I use. The last option HDI, I would store the data in parquet format (or some columnar format), create a hive table on top the data and query with Spark SQL. Really you don't need to use Spark, you could use Hive as well. But what you should stay away from is traditional Map Reduce, that paradigm is basically dead now days, and you should not write new code in it. On top of that if you don't know it, there is steep learning curve around it. I us all of technologies, and we use them for different parts are system and it depends really on the read and write requirments of the application. I would look at using spark and parquet if I were you but it a lot new tooling that might not be needed.

Dolorous answered 13/1, 2016 at 1:15 Comment(0)

3 billion of data points per year is pretty low number for modern time series databases such as VictoriaMetrics. It can persist this amount of data points in less than 3 minutes at ingestion speed of 19 millions of samples per second on a computer with 64 vCPUs. See this article for details.

There are VictoriaMetrics production setups with up to 10 trillions of data points per single node. And it scales to multiple nodes.

Haemic answered 12/10, 2019 at 21:40 Comment(0)

Recommended topics

Hot tags