Storing millions of log files - Approx 25 TB a year

Asked 9/10, 2010 at 5:36 Answered 13/10, 2010 at 19:53

mongodb couchdb storage distributed logfiles

As part of my work we get approx 25TB worth log files annually, currently it been saved over an NFS based filesystem. Some are archived as in zipped/tar.gz while others reside in pure text format.

I am looking for alternatives of using an NFS based system. I looked at MongoDB, CouchDB. The fact that they are document oriented database seems to make it the right fit. However the log files content needs to be changed to JSON to be store into the DB. Something I am not willing to do. I need to retain the log files content as is.

As for usage we intend to put a small REST API and allow people to get file listing, latest files, and ability to get the file.

The proposed solutions/ideas need to be some form of distributed database or filesystem at application level where one can store log files and can scale horizontally effectively by adding more machines.

Ankur

Building answered 9/10, 2010 at 5:36 Comment(11)

Just to do the math: that's 500GB/week or 100GB each business day. – Whitman 9/10, 2010 at 5:39

@Whitman Thanks for the math. We already have a years worth data. @chaosThese log files come from storage arrays installed globally. – Building 9/10, 2010 at 5:51

@Ankur, would a JSON format work for you if it had one object per log message, with one of the object's properties being the original log message and the others being queryable fields extracted from that log message? It increases the data storage requirements, but would allow MongoDB and CouchDB to be considered. – Corrade 9/10, 2010 at 6:29

@jim what an idea ?. I didn't think of that. Thanks. I think it does make couchdb and mongodb a contender. I don't want to query the logfiles only store and provide a REST API on top. – Building 9/10, 2010 at 6:35

Take a look at Vertica too, it seems to be quite good at this sort of thing. – Corrade 9/10, 2010 at 6:41

So, all you need to do is store files and retrieve them by file name? How is a file system not suited to that task? – Vanadium 9/10, 2010 at 6:47

@Vanadium It's currently on top of an NFS as I mentioned. I am looking for something better. Faster seek time, Automatic compression. Yes I can always write code to do these. Is there an off the shelf product for this ? Like Jim mentioned above I could also put mongoDB etc to use. So just learning what my options are. – Building 9/10, 2010 at 6:56

@Ankur Gupta: My thought was that if you're just storing and retrieving files (and printing the file list), a database is not the best solution. Files systems are exactly what you need, so that's what I'd suggest looking into. If listing the files takes too long, break them into several folders (perhaps each week or each month). – Vanadium 9/10, 2010 at 7:5

It seems to me that all is needed is a smart folder structure with automatically generated sub folders to prevent too many files in one folder. And a little bit of code for compression and decompression. Afaik MongoDB and CouchDB don't support compression and decompression. – Prado 10/10, 2010 at 4:21

mongodb works with memory mapped files. You can't store more data than the available virtual address space you have available. Keep in mind that most 64 bit machines only supports 48 bit of virtual address space, so you'll run out when you hit about 281 TB :-) – Handclasp 13/10, 2010 at 19:58

Have you considered Logstash? It's an open source log collector, which can store logs in a distributed ElasticSearch cluster, which should be able to scale horizontally. – Megillah 21/3, 2013 at 10:30

Since you dont want queriying features, You can use apache hadoop.

I belive HDFS and HBase will be nice fit for this.

You can see lot of huge storage stories inside Hadoop powered by page

Incorporated answered 11/10, 2010 at 7:28 Comment(2)

Look at the flume connector for hadoop. Hadoop has a lot of plugins for managing large amounts of data. – Fine 11/10, 2010 at 17:52

@Incorporated what if you want querying features? – Drafty 30/5, 2014 at 16:56

Take a look at Vertica, a columnar database supporting parallel processing and fast queries. Comcast used it to analyze about 15GB/day of SNMP data, running at an average rate of 46,000 samples per second, using five quad core HP Proliant servers. I heard some Comcast operations folks rave about Vertica a few weeks ago; they still really like it. It has some nice data compression techniques and "k-safety redundancy", so they could dispense with a SAN.

Update: One of the main advantages of a scalable analytics database approach is that you can do some pretty sophisticated, quasi-real time querying of the log. This might be really valuable for your ops team.

Corrade answered 9/10, 2010 at 6:40 Comment(0)

Have you tried looking at gluster? It is scalable, provides replication and many other features. It also gives you standard file operations so no need to implement another API layer.

http://www.gluster.org/

Coalfield answered 12/10, 2010 at 18:24 Comment(1)

Forgot to mention that it is open source as well. – Coalfield 12/10, 2010 at 18:25

I would strongly disrecommend using a key/value or document based store for this data (mongo, cassandra, etc.). Use a file system. This is because the files are so large, and the access pattern is going to be linear scan. One thing problem that you will run into is retention. Most of the "NoSQL" storage systems use logical delete, which means that you have to compact your database to remove deleted rows. You'll also have a problem if your individual log records are small and you have to index each one of them - your index will be very large.

Put your data in HDFS with 2-3 way replication in 64 MB chunks in the same format that it's in now.

Imposition answered 13/10, 2010 at 17:16 Comment(0)

If you are to choose a document database:

On CouchDB you can use the _attachement API to attach the file as is to a document, the document itself could contain only metadata (like timestamp, locality and etc) for indexing. Then you will have a REST API for the documents and the attachments.

A similar approach is possible with Mongo's GridFs, but you would build the API yourself.

Also HDFS is a very nice choice.

Williemaewillies answered 13/10, 2010 at 19:53 Comment(0)

Recommended topics

Hot tags