Is it faster to access data from files or a database server? [closed]
Asked Answered
C

13

75

If I have a static database consisting of folders and files, would access and manipulation be faster than SQL server type databases, considering this would be used in a CGI script?

When working with files and folders, what are the tricks to better performance?

Craunch answered 27/1, 2010 at 15:4 Comment(1)
This question is opinion based - because you can't measure how long an operation takes to run over different implementation models. You can't even approximate it. It's all just opinion. No reality here.Ardatharde
B
74

I'll add to the it depends crowd.

This is the kind of question that has no generic answer but is heavily dependent on the situation at hand. I even recently moved some data from a SQL database to a flat file system because the overhead of the DB, combined with some DB connection reliability issues, made using flat files a better choice.

Some questions I would ask myself when making the choice include:

  1. How am I consuming the data? For example will I just be reading from the beginning to the end rows in the order entered? Or will I be searching for rows that match multiple criteria?

  2. How often will I be accessing the data during one program execution? Will I go once to get all books with Salinger as the author or will I go several times to get several different authors? Will I go more than once for several different criteria?

  3. How will I be adding data? Can I just append a row to the end and that's perfect for my retrieval or will it need to be resorted?

  4. How logical will the code look in six months? I emphasize this because I think this is too often forgotten in designing things (not just code, this hobby horse is actually from my days as a Navy mechanic cursing mechanical engineers). In six months when I have to maintain your code (or you do after working another project) which way of storing and retrieving data will make more sense. If going from flat files to a DB results in a 1% efficiency improvement but adds a week of figuring things out when you have to update the code have you really improved things.

Bumpkin answered 27/1, 2010 at 15:45 Comment(2)
Can you add which tool you would use given your questions? It seems like the pattern is: if the first part of question is yes then use a file, if it's the second use a db but I'm not sure.Firenew
@Firenew 1. Use flat file if reading from start to finish 2. Use flat file if going once to get all books 3. Use flat file if appending is fineMahau
C
36

As a general rule, databases are slower than files.

If you require indexing of your files, a hard-coded access path on customised indexing structures will always have the potential to be faster if you do it correctly.

But 'performance' is not the goal when choosing a database over a file based solution.

You should ask yourself whether your system needs any of the benefits that a database would provide. If so, then the small performance overhead is quite acceptable.

So:

  1. Do you need to deal with multiple users and concurrent updates? (Well; you did say it's static.)
  2. Do you need flexibility in order to easily query the data from a variety of angles?
  3. Do you have multiple users, and could gain from making use of an existing security model?

Basically, the question is more of which would be easier to develop. The performance difference between the two is not worth wasting dev time.

Copula answered 27/1, 2010 at 15:24 Comment(2)
I would add that the performance benefit only exists if you know what you're doing. Creating a good and fast indexing scheme is not easy. Databases have had several years to fine tune their algorithms even if they are data generic. Most people I've known who try to beat a database with flat files fail at it. But there are some who succeed for the rare case that you need it.Gwenny
If you're running lambdas, start-up time is critical. There are ways of keeping lambdas warm but they don't work adequately and are harder to implement than you think. If you've got a mass of static data and you only want to read a small fraction of it before the process dies, starting up a database to do it is not going to work. Flat files are. It's often difficult to assess questions like this without bringing all the preconceptions you have from "normal" scenarios that you're used to.Ardatharde
C
26

Depends on what your information is and what your access patterns and scale are. Two of the biggest benefits of a relational databases are:

  1. Caching. Unless you're very clever, you can't write a cache as good as that of a DB server

  2. Optimizer.

However, for certain specialized applications, neither of these 2 benefits manifest itself compared to files+folders data store - therefore the answer is a resounding "depends".

As for files/folders, the tricks are:

  • Cache the contents of frequently requested files
  • Have small directories (files in deeply nested small directories are much faster to access than in a flatter structure, due to the time it takes to read the contents of a big directory).
  • There are other, more advanced optimizations (slice across disks, placement on different places in a disk or different partition, etc..) - but if you have need of THAT level, you are better off with a database in the first place.
Cyprian answered 27/1, 2010 at 15:18 Comment(9)
I have to disagree with a lot of what you've written: 1) Caching on a DB Server has to be generic. If you write your own given specific application knowledge - you should be able to thump it hands down. 2) Optimiser - Again the optimiser has to be generic; with specific application knowledge you can code significantly more efficient access paths, you could also utilise structures not available within typical RDBMS indexing options. 3) Large directories are only slower if you have to 'search' for files; if you have full path to a file, you won't need to "read the contents of a big directory".Copula
@Sinan - OK, color me needing coffee jolt. What do you allude to by "CGI-specific" issues as far as DB vs files?Cyprian
@Craig - I don't know what his usage patterns are. Even what his data is. So your points may or may not be valid - it depends. But does your custom file structure know about placing most used data in faster areas of the disk? Are you an expert in writing good caches? this is why I said "it depends" - without knowing details of his app, I'm not prepared to judge one way or the other on how easy it is to write a custom file based structure for his needs which will beat the DBCyprian
The real benefit of DB over files comes down to: I need an index, so do I write: a) "CREATE INDEX xxx ON Table(Col1, Col2)" or b) Define the index structure in code, write code to read underlying data and populate the index structures, write code that decides when and how to access the indexes created, write code to deal with error scenarios such as the index wasn't created or has been deleted.Copula
@DVK: It doesn't matter whether you know what his usage patterns are. The point is that he knows (or should), and his application specific knowledge makes it possible to write better caching. You don't have to be "very clever" to write good caching when you have 'home-field advantage'. The question pertained to general performance difference: A poorly implemented DB will perform just as badly as a poorly implemented file based solution. But a well implemented file based solution will outperform a well implemented DB solution (but at a much higher cost in dev time).Copula
@Cyprian CGI scripts incur a significant start-up penalty that partially depends on the number of Perl modules and C libraries that are loaded in. If the location in the filesystem of the information queried or to be stored can be very easily deduced from CGI parameters, it is going to be hard to beat a filesystem based approach.Shah
@Sinan - I am spoiled by mod_perl - it eliminates the library load and compilation penalty for the startup speed.Cyprian
...Oh and if your database access layer libraries are smart enough to utilize a common pool of DB connections, you will also avoid thecost of opening DB connection to boot (our company's do that)Cyprian
@Cyprian I know the advantages of mod_perl. However, the question specifically states a CGI environment.Shah
M
14

From my little bit of experience, server-based databases (even those served on the local machine) tend to to have very slow throughput compared to local filesystems. However, this depends on some things, one of which being asymptotic complexity. Comparing scanning a big list of files against using a database with an index to look up an item, the database wins.

My little bit of experience is with PostgreSQL. I had a table with three million rows, and I went to update a mere 8,000 records. It took 8 seconds.

As for the quote "Premature optimization is the root of all evil.", I would take that with a grain of salt. If you write your application using a database, then find it to be slow, it might take a tremendous amount of time to switch to a filesystem-based approach or something else (e.g. SQLite). I would say your best bet is to create a very simple prototype of your workload, and test it with both approaches. I believe it is important to know which is faster in this case.

Menfolk answered 1/2, 2010 at 4:23 Comment(0)
H
7

As others have pointed out: it depends!

If you really need to find out which is going to be more performant for your purposes, you may want to generate some sample data to store in each format and then run some benchmarks. The Benchmark.pm module comes with Perl, and makes it fairly simple to do a side-by-side comparison with something like this:

use Benchmark qw(:all) ;

my $count = 1000;  # Some large-ish number of trials is recommended.

cmpthese($count, {
    'File System' => sub { ...your filesystem code... },
    'Database'    => sub { ...your database code... }
});

You can type perldoc Benchmark to get more complete documentation.

Hupp answered 27/1, 2010 at 15:29 Comment(0)
D
4

It is very useful to use files instead of db when it comes to images if site structure is suitable. Create folders representing your matching data and place images inside. For example you have an article site, you store your articles in db. You don't have to place your image paths on db, name folders with your primary keys like 1,2,3.. and put images inside. E-books, music files, videos, this approach can be used in all media files. Same logic works with xml files if you won't search for something.

Delenadeleon answered 28/11, 2013 at 9:21 Comment(0)
P
4

database certainly can be faster,

quoting SQLite test,

SQLite reads and writes small blobs (for example, thumbnail images) 35% faster¹ than the same blobs can be read from or written to individual files on disk using fread() or fwrite().

Furthermore, a single SQLite database holding 10-kilobyte blobs uses about 20% less disk space than storing the blobs in individual files.

The performance difference arises (we believe) because when working from an SQLite database, the open() and close() system calls are invoked only once, whereas open() and close() are invoked once for each blob when using blobs stored in individual files. It appears that the overhead of calling open() and close() is greater than the overhead of using the database. The size reduction arises from the fact that individual files are padded out to the next multiple of the filesystem block size, whereas the blobs are packed more tightly into an SQLite database.

The measurements in this article were made during the week of 2017-06-05 using a version of SQLite in between 3.19.2 and 3.20.0. You may expect future versions of SQLite to perform even better.

Peruzzi answered 13/7, 2020 at 12:49 Comment(0)
O
3

Like other said DB is a tool, and it creates some overhead, but in case if your data is static and it's read-only data reading directory from files will be faster: Here are some tests that I've done: I had files with the name of the file as .csv In database I had indexed column as 'date' in order to find the same records in the database. Each day has 30K-50K records/rows and 100 columns of different type of data (90% floats).

DB Info: PostgreSQL 11.5, 16GB of RAM

  Table:
    335,162,867 records
    Table size: 110GB
    Index size: 7GB
    Total size: 117GB
  Files:
    Number of files: 8033
    Total Files size: 158GB
    Number of records/lines per file/date: 30K - 50K

Reading data for a random date (1986-2019) from a file was constantly 4-5 times faster than reading data for the same date in PostgreSQL

Oneman answered 17/3, 2019 at 11:31 Comment(0)
A
2

It depends on the profile of the data and what logic you are going to be using to access it. If you simply need to save and fetch named nodes then a filesystem-based database may be faster and more efficient. (You could also have a look at Berkeley DB for that purpose.) If you need to do index-based searches, and especially if you need to join different sets of data based on keys, then an SQL database is your best bet.

I would just go with whatever solution seems the most natural for your application.

Asymmetric answered 27/1, 2010 at 15:20 Comment(0)
L
2

As others have said, it depends: on the size and nature of the data and the operations you're planning to run on it.

Particularly for a CGI script, you're going to incur a performance hit for connecting to a database server on every page view. However if you create a naive file-based approach, you could easily create worse performance problems ;-)

As well as a Berkeley DB File solution you could also consider using SQLite. This creates a SQL interface to a database stored in a local file. You can access it with DBI and SQL but there's no server, configuration or network protocol. This could allow easier migration if a database server is necessary in the future (example: if you decide to have multiple front-end servers, but need to share state).

Without knowing any details, I'd suggest using a SQLite/DBI solution then reviewing the performance. This will give flexibility with a reasonably simple start up and decent performance.

Like answered 27/1, 2010 at 15:38 Comment(0)
T
2

To quickly access files, depending on what you are doing, an mmap can be very handy. I just wrote about this in the Effective Perl blog as Memory-map files instead of slurping them.

However, I expect that a database server would be much faster. It's difficult to say what would be faster for you when we have no idea what you are doing, what sort of data you need to access, and so on.

Toffee answered 1/2, 2010 at 4:13 Comment(0)
A
1

I'm going to give you the same answer everyone else gave you, It Depends

In a simple scenario with a single server that returns data (READ Only), Yes file system will be great and easy to manage.

But, when you have more than one server you'll have to manage distributed files system like glusterfs, ceph, etc..

A database is a tool to manage all of it for you, distributed files system, compression, read/write, locks etc..

hope that's helpful.

Alow answered 26/3, 2018 at 17:1 Comment(1)
Did you mean "it depends" ?Centurion
V
1

I chose file/folder system for my office-managing-site (1M workers throughout USA and Brazil) which was my first project (as a job).

Actually it was the most irresponsible but luckily most suitable decision I took ever since.

To deal with this file system I made a php function set as well to simplify stuff. Things I consider to make those functions were:

  • Creating table file (like csv)
  • Creating folder containing those files,
  • Letting repeat a function for each row in a selected file,
  • Letting change / delete selected column/row data in a selected file iteratively
  • And save binary files as separate files

That's all.

But when using it I had to think about optimizing so much. For example when it comes to last seen of someone..

When someone X user want to see lastseen of a some other Y I made a folder called dataRequestingCluster and under X which contains everything that X wants to see last seen (maybe X's friendlist if X is viewing his friends page) and added that cluster in to Y's last seen table (which contains all "Y's last seen requesters separate clusterFiles" and updates every time Y interact with server)

And then added a mechanism to remove dataRequestClusters when not used for 3mins (WHEN X Leaves his friends page no need to update last seen for him)

I had to follow this approach because updating last seen in the ALL_USERS File is a complete joke when it comes to time.

Most important thing is this method is way faster than using a database system with MySql ... But you have to use the brain and 30% more time but happiness and completeness ibatined by this is awesome

This is just a one case "last seen"

So many times I used different algorithms to succeed each problem and it was always efficient than sql/ db system.

Conclusion: File systems make stuff complex but its the only way to improve the quality in a shocking way...if you are not good at thinking or have less time or you are not interested about what you do you should probably go SQL/DB system.

After 6 months my server data storage became 20GB (Without media files.. just user info ) but any individual file never exceeded 200kb.. it was amazingly successful with least time for processing.

Verile answered 13/7, 2020 at 11:13 Comment(1)
Can you share some code, language or tools you used ?Frighten

© 2022 - 2024 — McMap. All rights reserved.