Situation: I've started a new job and been assigned the task of figuring out what to do with their sensor data table. It has 1.3 billion rows of sensor data. The data is pretty simple: basically just a sensor ID, a date and the sensor value at that point in time (double).
Currently, the data is stored in a table in a MSSQL Server database.
By the end of this year, I expect the number of rows to have increased to 2-3 billion.
I'm looking for a better way to store and query this data (by date), and since there is alot of "big data" products our there, and I have no real experience managing such big data sets, I'm asking here for any pointers.
It's not a large company, and our resources are not unlimited ;)
Some more details about our use case:
- The data is plotted in graphs and shows sensor values over time.
- We're planning to create an API to let our customers fetch sensor data for any time period that they are interested in (... data from 2 years back is just as relevant as the last month's data).
My research so far has led me to consider the following solutions:
Keep the data in SQL Server
but partition the table (it's not partitioned right now). That will require enterprise version of SQL Server, which costs alot.
Move the data to Azure SQL Server.
There we'll get the partitioning feature for alot less money, but once our DB grows above 250GB it costs alot more (and way too much above 500gb).
Use several Databases
We could use 1 DB per customer. Several smaller DBs will be cheaper than 1 huge DB, but we've got alot of customers and plans for more, so I don't really like thinking about managing all these databases.
Azure Storage Tables
This is the option I like best so far. We can partition the data by company/sensor/year/month, use the date for row key and store the sensor value.
I haven't yet had time to test query performance, but from what I read it should be good. But there is one major downside, and that's the limit of 1000 items returned per HTTP request. If we need to fetch all sensor data for a week, we need to make alot of HTTP requests. I'm not sure right now how big an issue this is for our use case.
Azure HDInsight (Hadoop in Azure)
As mentioned I have no experience with big data, and currently I don't get Hadoop well enough to know if it fits our case (expose sensor data, for a given timespan, through an API). Should I dig deeper and learn or is my time better spent pursuing another alternative?
Does anyone have experience from a similar case. What works for you? Keep in mind that price matters, and a "simple" solution could be preferred over a very complex one even though the complex one performs some seconds better.
UPDATE 1: To answer some of the questions in comments below.
- There are roughly 12 000 sensors, that can potentially report a value each 15 second. That translates to ~70 million a day. In reality, not all of these sensors have "reporting" turned on, so we're not getting that much data each day, but since we naturally want to expand with more customers and sensors, I really need a solution that could scale up to many millions of sensor values a day.
- Partitioning is a solution, and using several databases and/or several tables, is something I have though of yes, but I see this as a fallback if/when I've exhausted other solutions.
- I've read some more about HBase, http://opentsdb.net/ and google's https://cloud.google.com/bigtable/ and it seems like Hadoop could be a real alternative at least.
UPDATE 2: Today I experienced a bit with both azure table storage and HDInsight (HDI). We don't require much in query "flexibility", and so I think Azure Table Storage looks real promising. It's a bit slow to pull out data due to the 1000 item limit per request as I mentioned, but in my tests I think it's fast enough for our use cases.
I've also stumbled upon OpenTSDB, which is what lead me to try out HDI in the first place. Following a tutorial on Azure (https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hbase-tutorial-get-started/) I was able to pretty quickly store a million records and test some queries. It was a lot faster to query than Azure Table Storage. I could even pull down 300 000 records in one http request (took 30 seconds though).
But it costs quite a bit more than Azure Table Storage, and I think I can optimize my code to improve query performance with Azure Table Storage (more fine grained partition key and running requests in parallel). So right now I'm leaning towards Azure Table Storage due to the simplicity, price and "good-enough" performance.
I'm going to present my findings to an external consultant soon, so I'm excited to learn his view on things as well.