Referential Data Integrity: Necessity, nice-to-have, or old hat?

Asked 31/8, 2010 at 11:57 Answered 31/8, 2010 at 11:57

ruby-on-rails nosql relational-database referential-integrity

Frameworks like Rails have encouraged moving a lot of the logic, even stuff like constraints and foreign keys, off the database - in my opinion. for the better, as it's more manageable and easy to change. Even so, some operations are easier faster, or just plain only possible in SQL.

The recent explosion in popularity of noSQL databases like MongoDB, Cassandra, etc, have change the approach to best practices in database development even more radically.

My question: is referential data integrity no longer a necessity?

I realize it often comes down to choosing the best tool for the job, but let's exclude financial applications and similar type apps where having transactions is a must-have and focus on more typical apps that make money but don't require banking-level integrity.

How necessary is referential data integrity? Can someone list some issues they have had when they're not using it?

Is using a database like PostgreSQL for more critical data, and MongoDB for less critical but highly requested data the smart strategy? How do you suggest defining exactly what data is "critical" and what is "non-critical?"

Hydroelectric answered 31/8, 2010 at 11:57 Comment(0)

I think the question and most of the answers here seem to be saying the same thing: that data integrity (RI is just one common aspect of data integrity) definitely IS necessary and is still as important to day as ever. Data integrity is probably even more important today than in the past due to increased concerns about governance, regulation and data protection.

It just happens that people are finding that the DBMS doesn't provide the facilities they need so they look to implement integrity rules elsewhere. This is strange, because after all the DBMS is closest to the data and therefore ought to be best placed to implement business rules efficiently. Declarative rules ought to be easier to maintain and validate than procedural ones. Maintaining rules centrally in the database also ought to be more cost effective than distributing the rules throughout many other layers and applications.

My conclusion is that if these things aren't proving to be true for some people, then that actually says a lot about the deficiencies of today's database software. It does not imply that integrity is unimportant - quite the reverse.

Medullated answered 31/8, 2010 at 11:57 Comment(1)

I interpret things differently. Too many programmers are addicts Their drug of choice causes both euphoric bliss and simultaneous stimulation. – Hangover 3/9, 2010 at 22:44

If you want to associate and refer to data, referential integrity will always be a valid concern. The modern question is not whether it is necessary, but whether to manage it in the traditional sql database fashion of validating foreign key fields through indexes managed by programmers and database administrators. Simple databases tailored to object access might hide traditional data integrity methods or may allow management of issues programmatically as exceptions, or such concerns can be managed manually.

That being said, the traditional methods work well for most applications (although apparently not eBay). Referential integrity seems silly until you have an integrity issue that is difficult to recover from. Since it is trivial to implement, you should start with it and only remove it once a performance need becomes apparent that can't be met by other means.

As for mongo, use it when it makes an application easier to implement and maintain. You definitely can use both if needed.

Move answered 31/8, 2010 at 11:57 Comment(1)

+1 all around. For my company, referential integrity is more important than raw performance (performance is not un important, just less important). Our app deals with financial information, so maintaining correct references is vital. Leaving the maintenance of referential data to the programmers, skilled as we are, is not a viable business option when those rules can be defined once and violated only with deliberate effort. – Afternoon 31/8, 2010 at 23:32

I think your final comment about having two data stores is the future for most of the new mid-sized apps coming out. One backend with referential integrity for things like connecting core components of the site and another one for larger, Internet scale data.

Legacy companies like eBay shouldn't be used as a comparison since they have the resources to do rigorous QA and to think through the implications of everything the developer does. A typical small-midsize startup doesn't have those resources and keeping critical data in a store with referential integrity prevents alot of application flaws from being able to sit silently in your site for a long time.

Check out Django's support for multiple databases. Keep in mind that moving from an ACID datastore to a CRUD datastore is much easier than the other way around.

Pardner answered 31/8, 2010 at 11:57 Comment(0)

A few years later I'd like to provide my own answer:

Use the right database for the job, but for many workloads a traditional RDBMS (relational database management system) is still a good choice. And since most of those databases provide functionality for further enhancing your schema rules, it's best to use them. Doing something like unique validation or foreign key validation it at the application layer requires unnecessary back and forth and doesn't let the database (like postgresql) do what' it's best at when storing structured, relational data.

Hydroelectric answered 31/8, 2010 at 11:57 Comment(0)

I think the other thing to consider is the lifecycle of the application and data store. If the data store is useful to the business it is likley to be accessed by more then one application and/or have interfaces to other data stores. The closer to the data that referential integrity sits the less risk of an interface or something else making a bad update.

And while the application you're working on now may now have interfaces what about 7 years down the track? (Apparently the average business application is retained for 7 years) When the business grows other tools will be used (e.g. either by implementation into the same business or by acquisition of another business)

Airs answered 31/8, 2010 at 11:57 Comment(0)

I have worked in a company (ebay.com) where the databases are huge. We are not supposed to use any referential integrity whatsoever in the database. This restriction was put in place keeping in mind the performance factor alone. We will not even define anything in the ORM (Object Relational Mapping) level. Everything has to be logically handled. I know its a bit tough to even imagine, but still thats what provides a better performance.

Now for your question, with too many abstractions happening at the ORM level, people do not even care about what is going on the database side. At least the new ones coming out to coding hardly take care of writing Triggers, declaring referential integrity directly in a database (such as oracle) where you can do lots by writing store procedures. But still people prefer and feel easier to code everything at the ORM level. So, IMO, I feel that its becoming a old hat.

Gibbs answered 31/8, 2010 at 11:57 Comment(1)

If one has a billion dollar software engineering budget and extreme performance requirements, one can justify a lot. The truth remains that the formalism available in the dbms is by far the best formalism for expressing data integrity because the formalism was specifically designed for that purpose and to separate the concern of data management. The real question is how best to physically distribute the formalism all the way out to client computers to achieve best performance. – Hangover 3/9, 2010 at 22:50

Nowadays, the answer to your question is not a generic one but depends on the application and the business requirements.

The answer to your question is actually: always think about your data integrity strategy

For example:

In Europe we have the GDPR. You are not allowed to keep personal data longer than necessary, so in some applications, you would like to be able to delete complete customer records, while keeping the order data (another way is to anonymize the customer record)
When a nosql database is the best for your application you will not have referential data integrity provided by the database, so you will have to manage in your application. You either choose to allow/support broken references in your application or implement referential integrity at application level.
In nosql referential integrity is part of your schema design as well. When you store orders, references to order lines will not be references but embedded, avoiding referential integrity problems.

Pollak answered 31/8, 2010 at 11:57 Comment(0)

Recommended topics

Hot tags