database vs. flat files
Asked Answered
J

11

90

The company I work for is trying to switch a product that uses flat file format to a database format. We're handling pretty big files of data (ie: 25GB/file) and they get updated really quick. We need to run queries that randomly access the data, as well as in a contiguous way. I am trying to convince them of the advantages of using a database, but some of my colleagues seem reluctant to this. So I was wondering if you guys can help me out here with some reasons or links to posts of why we should use databases, or at least clarify why flat files are better (if they are).

Joyous answered 1/3, 2010 at 15:28 Comment(11)
You should mention what kind of data structure you're talking about here. If each of those 25 GB files translates into 25 rows of 1 GB each, you're probably better off with your flat files.Sontich
I'm actually more curious to here why your colleagues do not want to use a Relational Database as your datastore? GeezusAciculum
it all depends on all sorts of variables. Impossible to say one is better than the other.Godman
@JD: job security probably, not sure whyJoyous
@Josh Davis: just a tab delimited structure that holds various kinds of info necessary for our businessJoyous
How big is the average chunk of data?Delaryd
@Dean J: You mean how big is the average chunk of data that needs to be retrieved from the file? If so, that's small, but it needs random access to the "index" of the file, which btw is missing :)Joyous
Duplicate of https://mcmap.net/q/246083/-databases-versus-plain-textBasra
@hyperboreean: if you need random access then your colleagues are Dilbertarians.Basra
That's changing, for non-critical but pure performance, websockets, performance api, flat json files are faster, including with high loads. Handling concurrency with php is also quiet easy, and php will handle the memory part. SQL is faster only on windows servers. When writing your app, think about scalability. If well done, it will be easy to jump to SQL//REDIS later.Ogletree
Not talking about SSD that are now everywhere.Ogletree
D
108
  1. Databases can handle querying tasks, so you don't have to walk over files manually. Databases can handle very complicated queries.
  2. Databases can handle indexing tasks, so if tasks like get record with id = x can be VERY fast
  3. Databases can handle multiprocess/multithreaded access.
  4. Databases can handle access from network
  5. Databases can watch for data integrity
  6. Databases can update data easily (see 1) )
  7. Databases are reliable
  8. Databases can handle transactions and concurrent access
  9. Databases + ORMs let you manipulate data in very programmer friendly way.
Drub answered 1/3, 2010 at 15:37 Comment(1)
Databases handle writing and multiple entities/tables better. If you're read-only, a flat file is fine. 1. This is true. 2. Flat files can be indexed. 3. Flat files can be handled asynchronously too. And cached! 4. I would argue that connecting via network creates overhead. 5. True. This makes flat files more fragile. 6. Flat files are not ideal for frequent updates. 7. Flat files are likely more reliable, unless you're referring to schema, which you would have to validate yourself. 8. This was already mentioned. 9. Some ORMs allow you to use flat files (ActiveRecord/ActiveModel)Kindhearted
L
48

This is an answer I've already given some time ago:

It depends entirely on the domain-specific application needs. A lot of times direct text file/binary files access can be extremely fast, efficient, as well as providing you all the file access capabilities of your OS's file system.

Furthermore, your programming language most likely already has a built-in module (or is easy to make one) for specific parsing.

If what you need is many appends (INSERTS?) and sequential/few access little/no concurrency, files are the way to go.

On the other hand, when your requirements for concurrency, non-sequential reading/writing, atomicity, atomic permissions, your data is relational by the nature etc., you will be better off with a relational or OO database.

There is a lot that can be accomplished with SQLite3, which is extremely light (under 300kb), ACID compliant, written in C/C++, and highly ubiquitous (if it isn't already included in your programming language -for example Python-, there is surely one available). It can be useful even on db files as big as 140 terabytes, or 128 tebibytes (Link to Database Size), possible more.

If your requirements where bigger, there wouldn't even be a discussion, go for a full-blown RDBMS.

As you say in a comment that "the system" is merely a bunch of scripts, then you should take a look at pgbash.

Limerick answered 1/3, 2010 at 15:53 Comment(0)
I
9

Don't build it if you can buy it.

I heard this quote recently, and it really seems fitting as a guide line. Ask yourself this... How much time was spent working on the file handling portion of your app? I suspect a fair amount of time was spent optimizing this code for performance. If you had been using a relational database all along, you would have spent considerably less time handling this portion of your application. You would have had more time for the true "business" aspect of your app.

Indenture answered 1/3, 2010 at 15:41 Comment(3)
Actually, the whole application are just a couple of weird bash scripts ... the whole system is a one man show moving files around. Sad, I know ...Joyous
Cool, but last I checked the best databases are free.Barrator
Alas, the converse is equally true. A better saying is "Buy good solutions that are tailored to your needs if they exist, otherwise build it"Godman
D
6

They're faster; unless you're loading the entire flat file into memory, a database will allow faster access in almost all cases.

They're safer; databases are easier to safely backup; they have mechanisms to check for file corruption, which flat files do not. Once corruption in your flat file migrates to your backups, you're done, and you might not even know it yet.

They have more features; databases can allow many users to read/write at the same time.

They're much less complex to work with, once they're setup.

Delaryd answered 1/3, 2010 at 15:49 Comment(0)
C
4

What types of files is not mentioned. If they're media files, go ahead with flat files. You probably just need a DB for tags and some way to associate the "external BLOBs" to the records in the DB. But if full text search is something you need, there's no other way to go but migrate to a full DB.

Another thing, your filesystem might provide the ceiling as far as number of physical files are concerned.

Crackling answered 1/3, 2010 at 15:47 Comment(0)
H
3

Databases all the way.

However, if you still have a need for storing files, don't have the capacity to take on a new RDBMS (like Oracle, SQLServer, etc), than look into XML.

XML is a structure file format which offers you the ability to store things as a file but give you query power over the file and data within it. XML Files are easier to read than flat files and can be easily transformed applying an XSLT for even better human-readability. XML is also a great way to transport data around if you must.

I strongly suggest a DB, but if you can't go that route, XML is an ok second.

Huckleberry answered 1/3, 2010 at 15:42 Comment(15)
But Oracle and SQL Server cost money, why pay for something when its better for free? MySQL all the way.Barrator
If they have a 25gb CSV file, this could easily double in size (if not more so) with XML tags for rows and columns. Just saying significant bloat is a consideration when moving from flat files to XML.Forty
@Scott Root: I personally tend to dislike XML because I see it as a heavy method of passing data around.Joyous
Instead of Oracle or SQL Server, you could also use PostgreSQL. Very powerfull and XML and csv are also possible as in- output. Plain XML will be very slow, way too much overhead.Postimpressionism
@Rook Interesting observation - that MySQL is better than Oracle and SQL Server. You obviously never worked with Enterprise level software.Slagle
Either that or you have no idea of how many features come out-of-box with SQL Server or Oracle. Transparent data encryption? Auditing? Automatic instance recovery? Ah, what serious companies need any of that, right? And don't even get me started on tools available to DBAs for backup and recovery and performance tuning (do such things even exist in MySQL?)Slagle
@Slagle Google and Facebook use MySQL, obviously you're enterprise pays too much for software.Barrator
@Rook Google uses primarily their own database, the beast they call Bigtable. It's a very neat thing actually - I've played a bit with it on AppEngine. I am not sure on the particulars of Facebook, but they use Cassandra+Hadoop. They claim to have the world's Hadoop cluster.Slagle
I used to have a similar view on Oracle: I thought it was a clunky old thing with lots of undocumented and unused features. But I learned to appreciate it after working with it. Sure, 95%+ of smaller companies and websites will work fine with MySQL. But the vast majority of Fortune 500 companies use Oracle... And you don't get to Fortune 500 by being stupid.Slagle
And for the record, I think pgSQL is the superior open source RDBMS.Slagle
@Slagle facebook uses mysql for sure, you can google if you don't agree. Google varies based on what application the use (i'm a google bug hunter, i know.) BigTable was developed for their search product. MySQL is beautiful and has many enterprise solutions.Barrator
@Rook Oh, I'm sure they use MySQL in some fashion - after all that's how they started out. Heck, even we use MySQL. "MySQL is beautiful and has many enterprise solutions" For which you have to pay money for :)Slagle
@Rook And of course, from a business perspective, it doesn't make sense for a computer company to rely on a competitor's product. Google would probably rather use punch cards than Oracle. For similar reasons, Oracle offers OVM for free - because believe it or not, Oracle hates VMWare.Slagle
@Slagle lol, yeah thats a good point. However Google bought orkut which was all asp/iis/msssql. Just by scanning google ip ranges you'd be surprised how many windows machines pop up. Although youtube was written in PHP, and google rewrote it to python.Barrator
@Rook Are you sure Orkut was bought by Google? I always thought it was a Google product from the start. Used to be pretty big in Brazil, but now everyone there is switching to Facebook.Slagle
S
3

What about a non-relational (NoSQL) database such as Amazon's SimpleDB, Tokio Cabinet, etc? I've heard that Google, Facebook, LinkedIn are using these to store their huge datasets.

Can you tell us if your data is structured, if your schema is fixed, if you need easy replicability, if access times are important, etc?

Stillmann answered 1/3, 2010 at 15:42 Comment(1)
We're looking into that as well ... first we need to make sure that we're all on the same page. Though, if you need to run some complex reports, I am not sure how nosql handles this.Joyous
T
3

Difference between database and flat files are given below:

  • Database provide more flexibility whereas flat file provide less flexibility.

  • Database system provide data consistency whereas flat file can not provide data consistency.

  • Database is more secure over flat files.
  • Database support DML and DDL whereas flat files can not support these.

  • Less data redundancy in database whereas more data redundancy in flat files.

Thieve answered 25/12, 2017 at 4:55 Comment(0)
G
2

SQL ad hoc query abilities are enough of a reason for me. With a good schema and indexing on the tables, this is fast and effective and will have good performance.

Goiter answered 1/3, 2010 at 15:36 Comment(0)
A
2

Unless you are loading the files into memory each time you boot, use a database. Simple as that.

That is assuming that your colleges already have the program to handle queries to the files. If not, then use a database.

Anastaciaanastas answered 8/4, 2013 at 6:31 Comment(0)
B
0

Although other answers are good, I would like to emphasize a point that was not really well talked about: The developer's ease of use. databases are much simpler to work with! If you don't have any strong reason(s) for using files, use a database.

Byroad answered 30/11, 2022 at 12:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.