What is the best open source solution for storing time series data? [closed]

Asked 26/8, 2009 at 13:47 Answered 13/3, 2012 at 14:22

python database statistics time-series schemaless

I am interested in monitoring some objects. I expect to get about 10000 data points every 15 minutes. (Maybe not at first, but this is the 'general ballpark'). I would also like to be able to get daily, weekly, monthly and yearly statistics. It is not critical to keep the data in the highest resolution (15 minutes) for more than two months.

I am considering various ways to store this data, and have been looking at a classic relational database, or at a schemaless database (such as SimpleDB).

My question is, what is the best way to go along doing this? I would very much prefer an open-source (and free) solution to a proprietary costly one.

Small note: I am writing this application in Python.

Domitiladomonic answered 26/8, 2009 at 13:47 Comment(1)

You are probably looking for some sort of binning solution. You may find the discussion in this related question helpful: #1249315 – Alfreda 26/8, 2009 at 13:56

HDF5, which can be accessed through h5py or PyTables, is designed for dealing with very large data sets. Both interfaces work well. For example, both h5py and PyTables have automatic compression and supports Numpy.

Spirelet answered 26/8, 2009 at 14:16 Comment(1)

This seems very interesting, I'll check it out. – Domitiladomonic 26/8, 2009 at 15:53

RRDTool by Tobi Oetiker, definitely! It's open-source, it's been designed for exactly such use cases.

EDIT:

To provide a few highlights: RRDTool stores time-series data in a round-robin data base. It keeps raw data for a given period of time, then condenses it in a configurable way so you have fine-grained data say for a month, averaged data over a week for the last 6 months, and averaged data over a month for the last 2 years. As a side effect you data base remains the same size all of the time (so no sweating you disk may run full). This was the storage side. On the retrieval side RRDTool offers data queries that are immediately turned into graphs (e.g. png) that you can readily include in documents and web pages. It's a rock solid, proven solution that is a much generalized form over its predecessor, MRTG (some might have heard of this). And once you got into it, you will find yourself re-using it over and over again.

For a quick overview and who uses RRDTool, see also here. If you want to see which kinds of graphics you can produce, make sure you have a look at the gallery.

Goodlooking answered 26/8, 2009 at 14:29 Comment(5)

I was aware of RRDTool, it's good to have another "vote" to it. I will look into it more deeply. As an aside, do you know if you can interface with it in Python? – Domitiladomonic 26/8, 2009 at 15:55

@Domitiladomonic I haven't tried it myself, but the docs explicitly list Python bindings (oss.oetiker.ch/rrdtool/prog/rrdpython.en.html) – Goodlooking 26/8, 2009 at 15:58

it has Python bindings. but last time I looked (long ago), they didn't work great. I end up just wrapping the CLI with subprocess calls like this class does: code.google.com/p/perfmetrics/source/browse/trunk/lib/rrd.py – Unrig 26/8, 2009 at 16:9

@Corey Right, that's how I've used RRDtool, and it's quite natural to do so. – Goodlooking 26/8, 2009 at 16:12

You might look at PyRRD - it's not 100% amazing, but does a good job with the basics. – Telles 1/12, 2010 at 6:41

plain text files? It's not clear what your 10k data points per 15 minutes translates to in terms of bytes, but in any way text files are easier to store/archive/transfer/manipulate and you can inspect the directly, just by looking at. fairly easy to work with Python, too.

Ebonyeboracum answered 26/8, 2009 at 14:31 Comment(0)

This is pretty standard data-warehousing stuff.

Lots of "facts", organized by a number of dimensions, one of which is time. Lots of aggregation.

In many cases, simple flat files that you process with simple aggregation algorithms based on defaultdict will work wonders -- fast and simple.

Look at Efficiently storing 7.300.000.000 rows

Database choice for large data volume?

Kearney answered 26/8, 2009 at 16:52 Comment(0)

There is an open source timeseries database under active development (.NET only for now) that I wrote. It can store massive amounts (terrabytes) of uniform data in a "binary flat file" fashion. All usage is stream-oriented (forward or reverse). We actively use it for the stock ticks storage and analysis at our company.

https://code.google.com/p/timeseriesdb/

// Create a new file for MyStruct data.
// Use BinCompressedFile<,> for compressed storage of deltas
using (var file = new BinSeriesFile<UtcDateTime, MyStruct>("data.bts"))
{
   file.UniqueIndexes = true; // enforces index uniqueness
   file.InitializeNewFile(); // create file and write header
   file.AppendData(data); // append data (stream of ArraySegment<>)
}

// Read needed data.
using (var file = (IEnumerableFeed<UtcDateTime, MyStrut>) BinaryFile.Open("data.bts", false))
{
    // Enumerate one item at a time maxitum 10 items starting at 2011-1-1
    // (can also get one segment at a time with StreamSegments)
    foreach (var val in file.Stream(new UtcDateTime(2011,1,1), maxItemCount = 10)
        Console.WriteLine(val);
}

Mandrake answered 13/3, 2012 at 14:22 Comment(0)

Recommended topics

Hot tags