Scalable, fast, text file backed database engine?

Asked 29/7, 2010 at 20:54 Answered 12/2, 2012 at 15:18

Solved database csv scientific-computing plaintext large-data

I am dealing with large amounts of scientific data that are stored in tab separated .tsv files. The typical operations to be performed are reading several large files, filtering out only certain columns/rows, joining with other sources of data, adding calculated values and writing the result as another .tsv.

The plain text is used for its robustness, longevity and self-documenting character. Storing the data in another format is not an option, it has to stay open and easy to process. There is a lot of data (tens of TBs), and it is not affordable to load a copy into a relational database (we would have to buy twice as much storage space).

Since I am mostly doing selects and joins, I realized I basically need a database engine with .tsv based backing store. I do not care about transactions, since my data is all write-once-read-many. I need to process the data in-place, without a major conversion step and data cloning.

As there is a lot of data to be queried this way, I need to process it efficiently, utilizing caching and a grid of computers.

Does anyone know of a system that would provide database-like capabilities, while using plain tab-separated files as backend? It seems to me like a very generic problem, that virtually all scientists get to deal with in one way or the other.

Repetition answered 29/7, 2010 at 20:54 Comment(5)

I would hardly consider terrabytes of delimited data "robust". – Republicanize 29/7, 2010 at 20:58

How are TB of plain data self-documenting? – Lighterage 29/7, 2010 at 21:0

I'd bet a pretty rock that joins are murder without indexes. – Ballistic 29/7, 2010 at 21:6

@InsertNickHere: You do not need any special software to look at the data. So if I was someone who never dealt with the data before, I can just open a file, read the header, look at the actual data and write a script that would process it in some way. If I was to store the data in a format that is not human readable, I would need software+documentation. – Repetition 29/7, 2010 at 22:6

Stefan Kendall: It is robust in the same sense that paper is robust. Your software can crash, your computer can fry, but as long as you have a printed version of your data, you can use it. If your dog eats one page, then you are missing just that one page, the integrity of the rest is not lost. And you do not need to be a DBA to read a piece of paper - anyone can do it. – Repetition 29/7, 2010 at 22:13

There is a lot of data (tens of TBs), and it is not affordable to load a copy into a relational database (we would have to buy twice as much storage space).

You know your requirements better than any of us, but I would suggest you think again about this. If you have 16-bit integers (0-65535) stored in a csv file, your .tsv storage efficiency is about 33%: it takes 5 bytes to store most 16-bit integers plus a delimiter = 6 bytes, whereas the native integers take 2 bytes. For floating-point data the efficiency is even worse.

I would consider taking the existing data, and instead of storing raw, processing it in the following two ways:

Store it compressed in a well-known compression format (e.g. gzip or bzip2) onto your permanent archiving media (backup servers, tape drives, whatever), so that you retain the advantages of the .tsv format.
Process it into a database which has good storage efficiency. If the files have a fixed and rigorous format (e.g. column X is always a string, column Y is always a 16-bit integer), then you're probably in good shape. Otherwise, a NoSQL database might be better (see Stefan's answer).

This would create an auditable (but perhaps slowly accessible) archive with low risk of data loss, and a quickly-accessible database that doesn't need to be concerned with losing the source data, since you can always re-read it into the database from the archive.

You should be able to reduce your storage space and should not need twice as much storage space, as you state.

Indexing is going to be the hard part; you'd better have a good idea of what subset of the data you need to be able to query efficiently.

Donough answered 29/7, 2010 at 21:40 Comment(1)

I usually have floating point numbers of varying precision and strings. The problem is that the data is currently being accessed with many different scripts, legacy and third-party software. The database would provide additional functionality, allowing users to do large queries, but having simple access to the plain text data is important. I would probably need to use a Fuse filesystem that provides an uncompressed view of the data. – Repetition 29/7, 2010 at 22:23

One of these nosql dbs might work. I highly doubt any are configurable to sit on top of flat, delimited files. You might look at one of the open source projects and write your own database layer.

Republicanize answered 29/7, 2010 at 21:0 Comment(3)

I guess it would take some major hacking, but I will look around. I am curious about Drizzle and its microkernel architecture - it might be a way. – Repetition 29/7, 2010 at 22:35

I found luciddb.org - it seems it might do the trick! It supports flat file connectivity - you can query a .csv file as if it was a table, using common SQL commands. – Repetition 2/11, 2010 at 17:46

Oh, awesome! You might want to post that as an answer in case someone comes looking later. I'd still wonder about "scalable", but I guess it's easy enough to test if that meets your business requirements, and given the fact that you posted this, it probably does? :P – Republicanize 2/11, 2010 at 18:9

Scalability begins at a point beyond tab-separated ASCII.

Just be practical - don't academicise it - convention frees your fingers as well as your mind.

Raila answered 29/7, 2010 at 21:30 Comment(1)

I wish I could keep a convention, but the requirements are changing, and the data is often coming from hardware and software that changes schema over time. Some of the queries are completely ad-hoc (e.g. somebody comes with a random translation table in Excel and wants to use it as a part of their query). – Repetition 29/7, 2010 at 22:1

You can do this with LINQ to Objects if you are in a .NET environment. Streaming/deferred execution, functional programming model and all of the SQL operators. The joins will work in a streaming model, but one table gets pulled in so you have to have a large table joined to a smaller table situation.

The ease of shaping the data and the ability to write your own expressions would really shine in a scientific application.

LINQ against a delimited text file is a common demonstration of LINQ. You need to provide the ability to feed LINQ a tabular model. Google LINQ for text files for some examples (e.g., see http://www.codeproject.com/KB/linq/Linq2CSV.aspx, http://www.thereforesystems.com/tutorial-reading-a-text-file-using-linq/, etc.).

Expect a learning curve, but it's a good solution for your problem. One of the best treatments on the subject is Jon Skeet's C# in depth. Pick up the "MEAP" version from Manning for early access of his latest edition.

I've done work like this before with large mailing lists that need to be cleansed, dedupped and appended. You are invariably IO bound. Try Solid State Drives, particularly Intel's "E" series which has very fast write performance, and RAID them as parallel as possible. We also used grids, but had to adjust the algorithms to do multi-pass approaches that would reduce the data.

Note I would agree with the other answers that stress loading into a database and indexing if the data is very regular. In that case, you're basically doing ETL which is a well understood problem in the warehouseing community. If the data is ad-hoc however, you have scientists that just drop their results in a directory, you have a need for "agile/just in time" transformations, and if most transformations are single pass select ... where ... join, then you're approaching it the right way.

Recrudesce answered 29/7, 2010 at 21:8 Comment(1)

Thank you, that definitely sounds like a possibility. Sadly, this is running on Linux, but I could possibly borrow the idea from LINQ, while providing my own implementation. – Repetition 29/7, 2010 at 21:59

I would upvote Jason's recommendation if I had the reputation. My only add is that if you do not store it in a different format like the database Jason was suggesting you pay the parsing cost on every operation instead of just once when you initially process it.

Skite answered 29/7, 2010 at 21:47 Comment(1)

I think I could maybe come up with a hybrid solution - compress only the most space-consuming file types, store only the most time-critical part of data in the database. It is all about the economy - where it pays to trade ease of access and management for speed. – Repetition 29/7, 2010 at 22:30

You can do this with VelocityDB. It is is very fast at reading tab seperated data into C# objects and databases. The entire Wikipedia text is a 33GB xml file. This file takes 18 minutes to read in and persist as objects (1 per Wikipedia topic) and store in compact databases. Many samples are shown for how to read in tab seperated text files as part of the download.

Turku answered 12/2, 2012 at 6:53 Comment(0)

The question's already been answered, and I agree with the bulk of the statements.

At our centre, we have a standard talk we give, "so you have 40TB of data", as scientists are newly finding themselves in this situation all the time now. The talk is nominally about visualization, but primarly about managing large amounts of data for those that are new to it. The basic points we try to get across:

Plan your I/O
- Binary files
- As much as possible, large files
- File formats that can be read in parallel, subregions extracted
- Avoid zillions of files
- Especially avoid zillions of files in single directory
Data Management must scale:
- Include metadata for provenance
  - Reduce need to re-do
- Sensible data management
  - Hierarchy of data directories only if that will always work
- Data bases, formats that allow metadata
Use scalable, automatable tools:
- For large data sets, parallel tools - ParaView, VisIt, etc
- Scriptable tools - gnuplot, python, R, ParaView/Visit...
- Scripts provide reproducability!

We have a fair amount of stuff on large-scale I/O generally, as this is an increasingly common stumbling block for scientists.

Butterfingers answered 12/2, 2012 at 15:18 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags