How can I calculate database design storage costs?
Asked Answered
S

2

16

I often have a couple different schema's in mind when starting project. After making rough guesses I realize that some are less optimized for growth or storage space than others. Obviously, the size of the column value is the main thing. But table metadata, indexes, and row headers all play a part as well.

In addition, RDBMS use a completely different approach to data storage than object or key-value databases.

What are some good resources for trying to figure out the cost (or room needed) for database storage?

Note, my question has little to do with choosing the database, but rather knowing how to properly make use of each database's design for the most efficiently. Databases like PostgreSQL, MySQL, CouchDB, all have different target use cases and multiple ways to solve the same problem. So knowing the storage cost of each solution will help add to the choice of the best solution for the schema.

Schach answered 23/2, 2012 at 21:57 Comment(8)
Why would you want to calculate that when designing a schema.. that sounds like an unreasonable thing to try since the schema alone will not at all determine database size. Also considering that storage space cost will be the least important factor for overall cost of e.g. selecting the database you need.Sitwell
@ManfredMoser, a database schema is the flesh of your application data design. How it is built shows what your plans are for the data storage.Schach
Yes.. but a LOT of other factors will significantly influence storage so that any assessment from the schema alone without further requirements like performance (caching, indexes..) or querying (data warehouse on top of an OLTP) becomes completely meaningless... imho you are wasting your time.Sitwell
@ManfredMoser, yes I don't doubt that caching is a necessity. However, lets focus on one thing at a time. First we need to know where we can get information for weighting our options, then we can make plans about design, and last we can add CDN's and caching to our design docs to make sure it all works.Schach
Once you factor all that stuff in as well as thinking about training costs for different technologies, maintenance and so on .. storage cost differences will be completed negligible ..Sitwell
let us continue this discussion in chatSchach
Fair enough.. so there. I think it is a waste to think about storage cost at that early phase of the project you seem to be in (sorry I missed the chat..)Sitwell
Did you ever listen of nuvolabase.com? It's a distributed database on the cloud (orientdb). I think it perfectly fits small/medium/medium-large projects with very large records.Artillery
C
7

RDBMS use a completely different approach to data storage than object or key-value databases.

The relational model assumes you don't know what data will be needed in the future, or how data will be accessed in the future. This has proven to be a pretty reliable assumption in my experience.

That's one reason a SQL dbms will let you add indexes as they're needed, and let you drop indexes that have proven useless. It will let you add constraints as they become known--constraints that sometimes require adding more tables--and drop constraints as the requirements change. It will let you add columns as you discover more things that would be good to know. It will let you replace tables with views and replace views with tables. Some dbms will let you create materialized views--their impact on query speed can be dramatic, and their impact on disk usage, devastating.

Useful databases extend their reach. A SQL database, designed according to the relational model, makes it relatively easy to add features nobody dreamed of during the initial design, and without crushing other parts of the system. So they're called often called upon to do things their initial designers didn't imagine.

All of these things

  • adding and dropping indexes over time,
  • adding and dropping constraints over time,
  • adding and dropping columns over time,
  • adding and dropping tables over time,

make any estimate of disk usage look like a waste of time. Any one of them alone can drastically change the disk space required for a database.

You can calculate the space required by a row and a page fairly accurately. (Try Google for "YourDBMSname row layout" and "YourDBMSname page layout".) But when you try to multiply by the number of rows required you have to estimate the number of rows. That puts you at the big end of what Steve McConnell calls "the cone of uncertainty".

If you haven't measured disk usage in multiple projects over time at your own company, estimating the impact of those bullet points above is just guessing.

The last Fortune 100 company I worked for had an operational database that had been in production since the 1970s. Hundreds of applications, written in more than 25 programming languages over the course of 40 years hit that thing every day. (I think it was built on IBM's IMS originally; today it runs on Oracle.)

Even just a few years ago, nobody there imagined that their database would be used to translate engineering drawings and bills of materials into Chinese, and also to produce the customs documents they'd need to get finished products out of China. Implementing those new features required storing additional data about every part and about every design document in their live inventory. Early in that project, our estimates were pretty far off. That's the big end of the cone. (We estimated several things, but not disk usage. We were required to succeed, so whatever design I came up with, somebody would be required to supply the needed disk space.) But when we went live, we knew the exact value for every estimate, because we'd already done the work. (That's the narrow end of the cone.)

So, how do you mitigate the risk of guesswork in a database design and deployment environment? Take a lesson from 1972.

Build a prototype, and measure it.

Chemical engineers learned long ago that a process that works in the laboratory cannot be implemented in a factory in only one step. An intermediate step called the pilot plant is necessary to give experience in scaling quantities up and in operating in nonprotective environments. . . .

. . . Project after project designs a set of algorithms and then plunges into construction of customer-deliverable software on a schedule that demands delivery of the first thing built. . . .

The management question, therefore, is not whether to build a pilot system and throw it away. You will do that. The only question is whether to plan in advance to build a throwaway, or to promise to deliver the throwaway to customers.

Fred Brooks, Jr., in The Mythical Man-Month, p 116.

Counterpoison answered 3/3, 2012 at 22:50 Comment(2)
I totally agree with storage costs taking second place to flexibility and power. However, my question isn't about choosing one database over another, or even making a choice based solely on saving space. I choose the database based on the requirements. My question is about storage costs when choosing one route in the database over another. For example, calculating the cost of one approach, or another alternative (and equally valid) approach where space might also be a deciding factor swinging the final judgement one way.Schach
@Xeoncross: I think you misread my answer. I didn't say anything about choosing a dbms or a technology. I said, in essence, that you can't express a "requirement" in terms of disk space for a SQL dbms using anything more accurate than guesswork. (This is especially true if you're using agile methods.) So you can't express cost of disk space for a SQL dbms using anything more accurate than guesswork. (Unless a Java programmer designs the database, in which case all the constraints, half the indexes, and half the data will probably end up in application code.)Methodology
W
5

Here's an AskTom article I found helpful. It is Oracle-specific though.

http://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:266215435203

Warrior answered 3/3, 2012 at 20:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.