Relational v Hierarchical data models

Asked 20/9, 2017 at 12:4 Answered 2/10, 2017 at 21:34

Solved relational-database hierarchical-data relational-algebra

When the relational model was put forward by F.E. Codd, the established databases of the time used the hierarchical model. My understanding is that the relational model was felt to be a significant improvement on the hierarchical approach.

My intuition is that this "makes sense" for a few reasons.

The relational model seems to be "query agnostic" in that instead of the data's shape reflecting the way that you're likely to query it, it is instead structured so that any question can be asked relatively easily.
The relational model also makes mutability simple. You assert or retract facts by adding rows to a table (adding tuples to a set), or removing them. In contrast, in a hierarchical setting you need to add or remove from some other object, which introduces secondary questions such as does the parent object need to be created if it doesn't exist, or removed if it is empty.
The relational model readily models relationships that don't easily fit in to a parent child approach, such as the relation between three entities.
The relational model seems like it is better suited to schema growth as new kinds of fact can be added using new tables. Doing so with some care need not disrupt existing tables (and facts), or services that depend on them.

However, while it feels like the relational data model has advantages, I'd like to have some insight on why it was definitely believed to be significantly superior at the time, and presumably, still is.

I'd really appreciate the arguments in some kind of distilled form, or ideally, one or more papers or other documents, or a canonical reference that go through the reasoning behind this.

For clarity, I'm not asking about actual implementations of either approach, or their relative resource use in terms of storage or compute, unless this is of great salience to the answer.

Thanks.

Ftc answered 20/9, 2017 at 12:4 Comment(7)

This is a historical question & is both too broad & a duplicate. "why it was definitely believed to be significantly superior at the time"--it wasn't. The mess that is SQL shows that neither its abstract aspects or its implementation potentials were understood. "is"--for the reasons you give. Also reasoning, period, is simpler, because relation operators correspond to logic operators. – Quintonquintuple 20/9, 2017 at 21:22

I didn't know know that about the historical point @philipxy, thank you. I'm not specifically interested in the historical view, but the general developed reasoning of why it works well. It seemed to me that the historic view point might be a good opening to this line of thought, but if it is not, I could adjust the question. I was hoping for a bit more than "it works because it uses logical operators" :-) It seems to beg the question, "Why do logical operators work well (compared to hierarchical models)"! – Ftc 20/9, 2017 at 22:16

Also @philipxy, could you please point me to the question that this is a duplicate of? – Ftc 20/9, 2017 at 22:18

There are a zillion Stack Overflow (& Database Administrators & Software Engineering) posts re relational vs non/hierarchical/network/NoSQL. Google many. (See mine re "non-relational" or then -"non-relational" plus another keyword.) Also non-Stack Exchange. Tens of textbooks are free online. I just didn't feel like searching right now, and I haven't yet found answer I really like, but I'm not that motivated because such questions, though faqs, are off-topic here. (I close-voted per asking for offsite resources.) – Quintonquintuple 21/9, 2017 at 0:35

Re "logical": Did you read my link?? Every relational query subexpression corresponds to a predicate expression. Benefits/advantages are all about consequences of that. Because saying things is straightforward. (See my SO posts re predicate(s).) (Idiosyncratic but I'm in programming language semantics & that's just the simplest sound explanation/justification.) Read the expository parts of Codd's original paper. (The math has been greatly clarified, improved & corrected since.) – Quintonquintuple 21/9, 2017 at 0:41

See also Codd's Turing Award Paper Relational Database: A Practical Foundation for Productivity wherein he discusses some of the problems of previous systems and motivations for the relational model. – Lcm 21/9, 2017 at 11:8

@Lcm The Turing Award paper is superb. It is pretty much the answer that I was looking for. In particular, it states Codd's specific objections with existing systems and his goals for a replacement. Then it shows how the relational approach meets the goals, rather than justifying the relational approach in the opposite direction. It's really a terrific source, thanks. I've added an answer referencing the document and linking to your comment. – Ftc 5/10, 2017 at 9:8

It's not quite right to say "established databases of the time used the hierarchical model".

Firstly (to nit-pick), it's database management systems that do/don't use some physical structure. "databases" -- that is database designs might use all sorts of abstractions. Entity-Relational modelling was and still is popular as a design tool irrespective of the eventual physical platform.

Secondly, at the time whilst hierarchical models were usual for 'big iron' large databases, indexed-sequential was far more common on what used to be called 'mini-computers' (like DEC PDP-8/-11; IBM System/34,/36; ICL 1900/ME29; Honeywell DPS4/DPS7).

We might say indexed-sequential organisation on disk grew out of punched-card or magtape systems using batch update-by-copy. That's where the "sequential" comes from.

You say you don't want to ask about the actual implementation; but the answer is all about actual implementation. Reading disk sequentially is more efficent than random access (which needs the read-head to jump about). That's why in contrast to disk, memory is called "Random Access Memory". (This was long before RAM became so cheap we could keep a whole database in memory.)

Similarly the hierarchical model was organised to provide rapid access for commonly-needed query paths. The hierarchy put closely-linked nodes on the same patch of physical disk. So it was easy to navigate from Customer to that Customer's Orders to that Order's Item Lines.

The downside was it was difficult to navigate 'across' the hierarchy -- for example to find all the Order Lines for Item P5432, irrespective of which Customer/Order. (Furthermore if you then want to retrieve the Customers who are ordering P5432, you need to work 'backwards' up the hierarchy. If it's all on the same patch of disk, hopefully you don't need to look too far/perhaps it's in the same disk bucket loaded to RAM.)

Similarly the indexed-sequential organisation favoured one particular index -- the primary key. If you wanted to search by Customer name rather than number, that required a 'secondary index' with all sorts of ugly organisation to keep the index buckets somewhere near the data. And the notorious 'bucket overflow' which could stop a machine dead as you amended one teensy weensy spelling mistake in a name, so shifting it to an entirely different alphabetic position.

(By the way, NoSQL databases, being key-value stores with only one key, seem destined to fall into all those traps to do with secondary indexing. They need a second key-value store to provide an alternative index, with all sorts of fun keeping them in synch. Back to the future!)

The biggest problem Codd had in implementing the Relational Model was to persuade IBM top brass that the model could efficiently support querying through multiple 'access paths'. You'll see a lot of his early papers talking about abstracting the 'navigation' away from the query-writer/programmer. In fact the original System/R design had lots of compromises because

a) the IBM Engineers just didn't understand the mathematical abstractions Codd was talking about;

b) they were scared shitless it would perform like a dog and they'd lose their jobs.

[ahem! personal opinion: but the group got together long after, and there's some reminiscences somewhere around on the web.] Those compromises have persisted 'til today in SQL; which is frankly a pile of crud and should have been killed off as merely an interesting proof of concept.

How did Codd's model succeed (or rather the SQL model, not Codd's)?

Disk technology improved -- particularly seek times
somebody figured out hash-indexing and b-trees, and keeping all the indexes for a table in separate memory to the actual data; rather than trying to hold it like a magtape serial store.
Larry Ellison sniffed out something was afoot, and stole members of the IBM engineering team to build the same thing at Oracle. Also Michael Stonebreaker formed Ingres.

The race was on! There was no time to stop and get everything right. Implement what you've got (i.e. the SQL proof of concept) and rush it to market, ready or not. (Sound like a familiar story?)

Your points about the superiority of the Relational Model are all well-made. They essentially follow from normalisation techniques. I would say, though, that they weren't well understood in the late '70's/'80's. Schema designs looked a lot like the hierarchical or indexed-sequential data models, just transposed to 'flat' tables. In particular, there was a tendency to design 'wide' tables to bring together on one patch of disk everything we know about some Customer, rather than vertically partitioning. (For fear of performance hits from joining together the partitions.) That meant a lot of not-applicable or 'not known' fields -- which is the abomination of SQL's null.

So your "improvements" are as yet only partially attained. One day perhaps we'll see a DBMS engineered to the Relational Model. For now we'll have to put up with SQL.

Tatia answered 21/9, 2017 at 3:9 Comment(2)

The IBM System R reminscences/1995 reunion -- you can see how much the Engineers didn't understand what they were doing. – Tatia 21/9, 2017 at 3:34

Ah, I see you're looking for "canonical references". The trouble is it was all fiercely commercially sensitive at the time. Codd 1970 that Philip refs is indeed the classic. wikipedia on System R is also a good source of links, especially the 1979 ACM SIGMOD on Access Path Selection: the Engineers separated the query request ('declarative' in SQL) from how to fulfill it -- which the DBMS can figure out. Contrast that in a hierarchical model, you have to tell the DBMS what paths to follow. (Highly likely one will be a lot more efficient.) – Tatia 21/9, 2017 at 8:23

AntC gives a super answer with lots of historical background. I wanted to collect an answer from links made in comments to ensure they're not missed.

An absolutely terrific read is the write up of Codd's 1981 Turing Award Lecture. Thanks to reaanb for suggesting it.

In the paper, Codd sets out the case against contemporary databases thus:

These systems burdened application programmers with numerous concepts that were irrelevant to their data retrieval and manipulation tasks, forcing them to think and code at a needlessly low level of structural detail (the "owner-member set" of CODASYL DBTG is an outstanding example.

No commands were provided for processing multiple records at a time--in other words, DBMS did not support set processing and, as a result, programmers were forced to think and code in terms of iterative loops that were often unnecessary (here we use the word "set" in its traditional mathematical sense, not the linked structure sense of CODASYL DBTG};

The needs of end users for direct interaction with databases, particularly interaction of an unanticipated nature, were inadequately recognised. A query capability was assumed to be something one could add on to a DBMS at some later time.

With the problem described, shortly after Codd lays out his requirements for a replacement – the objectives that he wants it to fulfil.

The most important motivation for the research work that resulted in the relational model was the objective of providing a sharp and clear boundary between the logical and physical aspects of database management including database design, data retrieval, and data manipulation. We call this the data independence objective.

A second objective was to make the model structurally simple, so that all kinds of users and programmers could have a common understanding of the data, and could therefore communicate with one another about the database. We call this the communicability objective.

A third objective was to introduce high-level language concepts but not specific syntaxI to enable users to express operations upon large chunks of information at a time. This entailed providing a foundation for set oriented processing i.e., the ability to express in a single statement the processing of multiple sets of records at a time. We call this the set-processing objective.

There were other objectives, such as providing a sound theoretical foundation for database organization and management, but these objectives are less relevant to our present productivity theme.

And then finally he introduces the relational model and shows how it meets the desired objectives and solves the objections to the contemporary systems.

In my view this paper remains extremely relevant.

It gives a strong sense of why Codd was proposing the Relational approach, and it actually throws down the gauntlet for current systems to match his expectations and hopes. I think they fall short, particularly in the way we tend to deploy in heterogeneous / polyglot client server systems. Can we claim that they overcome Codd's objections?

The paper is an approachable and easy read for anyone with only passing familiarity with the field.

A final rather extraordinary extract from the paper stands as a warning to orthodoxy (emphasis mine):

The relational algebra, which includes these and other operators, is intended as a yardstick of power. It is not intended to be standard language, to which all relational systems should adhere. The set-processing objective of the relational model is intended to be met by means of a data sublanguages having at least the power of the relational algebra without making use of iteration or recursion statements.

So Codd is not claiming that the relational model is the best or only approach. He says it has necessary properties to achieve his objectives, but need not be the final word in database design.

It's also worth looking at Codd's Original Paper on relational databases. Thanks to philipxy for this. It is also a readable paper and has plenty of justification and background. It does have more formal content, but there's nothing you won't be able to get a handle on if you've some understanding of relational databases.

Ftc answered 2/10, 2017 at 21:34 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags