I am learning the Relational Model and data modeling.
And I have some confusion in my mind regarding sub types.
I know that data modeling is an iterative process and there are many different ways to model things.
But I don't know how to choose between different options.
Example
Suppose we want to model the particles (molecule, atom, proton, neutron, electron, ...).
Let's ignore Quark and other particles for simplicity.
Since all particles of the same type behave the same, we are not going to model the individual particles.
Put it in another way, we are not going to store every Hydrogen atom.
Instead, we will store Hydrogen, Oxygen and other atom types.
What we are going to model is actually particle types and relationships between them.
I am using the word "type" carelessly.
A Hydrogen atom is an instance. Hydrogen is a type. Hydrogen is also a type of Atom.
Yes, there is a hierarchy of types involved. And we are ignoring the lowest level (individual particles).
Approaches
I can think of several approaches to model them.
1. One table (relation, entity) for each type of things (particle types).
1.1 The first approach that comes into my mind.
Proton (Proton)
Neutron (Neutron)
Electron (Electron)
Atom (Atom)
Atom_Proton (Atom, Proton, Quantity)
Atom_Neutron (Atom, Neutron, Quantity)
Atom_Electron (Atom, Electron, Quantity)
Molecule (Molecule)
Molecule_Atom (Molecule, Atom, Quantity)
1.2 Since there is only one kind of proton/neutron/electron, we can simplify it.
Atom (Atom, ProtonQuantity, NeutronQuantity, ElectronQuantity)
Molecule (Molecule)
Molecule_Atom (Molecule, Atom, Quantity)
In this simplified model, facts about Proton are lost.
2. All things in one table, with associative tables representing relationships between them.
2.1 one associative table for each relationship
Particle (Particle)
Atom_Proton(Particle, Particle, ProtonQuantity)
Atom_Neutron(Particle, Particle, NeutronQuantity)
Atom_Electron(Particle, Particle, ElectronQuantity)
Molecule_Atom (Particle, Particle, AtomQuantity)
2.2 single associative table
Particle (Particle)
ParticleComposition (Particle, Particle, Quantity)
This simplification doesn't lose anything. I think it's better.
But if there're facts that are specific to Atom_Proton/Atom_Neutron/Atom_Electron, 2.1 may be better.
2.3 combine 2.1 and 2.2
Particle (Particle)
Atom_Proton (Particle, Particle, other attributes)
Atom_Neutron (Particle, Particle, other attributes)
Atom_Electron (Particle, Particle, other attributes)
Molecule_Atom (Particle, Particle, other attributes)
ParticleComposition(Particle, Particle, Quantity, other attributes)
In this approach, common attributes about particle composition go in ParticleComposition,
while special attributes about particle composition go in special tables.
3. Use sub type tables.
3.1 A table for base type Particle, and additional tables for sub types (Atom, Molecule, ...).
Particle (Particle)
Proton (Particle, other attributes)
Neutron (Particle, other attributes)
Electron (Particle, other attributes)
Atom (Particle, other attributes)
Molecule (Particle, other attributes)
Atom_Proton (Particle, Particle, ProtonQuantity)
Atom_Neutron (Particle, Particle, NeutronQuantity)
Atom_Electron (Particle, Particle, ElectronQuantity)
Molecule_Atom (Particle, Particle, AtomQuantity)
3.2 We can also combine the Atom_XXXQuantity tables in Atom and remove Pronton/Neutron/Electron.
Particle (Particle)
Atom (Particle, ProtonQuantity, NeutronQuantity, ElectronQuantity)
Molecule (Particle, other attributes)
Molecule_Atom (Particle, Particle, AtomQuantity)
It's simpler, but information about Proton/Neutron/Electron is lost as in 1.2.
3.3 We can change the name of Molecule_Atom to make it more generic.
Particle (Particle)
Atom (Particle, ProtonQuantity, NeutronQuantity, ElectronQuantity)
Molecule (Particle, other attributes)
ParticleComposition (Particle, Particle, Quantity)
This looks like 2.2, with additional tables for sub types (Atom, Molecule).
It seems 2.2 is a special case of 3.3.
3.4 We can combine all the above approaches and get a generic model.
Particle (Particle)
Proton (Particle, other attributes)
Neutron (Particle, other attributes)
Electron (Particle, other attributes)
Atom (Particle, other attributes)
Molecule (Particle, other attributes)
ParticleComposition (Particle, Particle, Quantity, other attributes)
Atom_Proton (Particle, Particle, other attributes)
Atom_Neutron (Particle, Particle, other attributes)
Atom_Electron (Particle, Particle, other attributes)
Molecule_Atom (Particle, Particle, other attributes)
It seems that Atom_Proton, Atom_Neutron, Atom_Electron and Molecule_Atom can be thought of as sub types of ParticleComposition.
This approach is the most complex one, it contains many tables but each table has its role.
Questions
- Does any of the above designs break the rules of Relational Model?
- Which approach is the best? Does it depend on how we think about the data? Does it depend on the requirements?
If it depends on the requirements, shall we choose the simplest design at first and then make it more generic to accommodate new requirements?
Although the resulting data models share a lot of similarities, the initial design may influence the naming of the tables/columns, and the domains of the keys are different.- If we choose to use one table for each type of things, we could choose incompatible keys for Atom and Molecule, such as atom weight for Atom and molecule name for Molecule.
- If we choose to use the generic approach, we may choose a common key for all particles.
Changing Keys may have greater impact on the system, so it may not be easy to evolve from a simple design to a generic one.
What do you think?
PS: This may not be an appropriate example and the solutions may be problematic, and there may be more variations of the approaches, but you can get the point hopefully.
If you have better designs, please share with me.
Update 1
What is the data to model?
Initially, I was trying to model the particles because
- I think there are sub-typing relationships between them, which is exactly what I am looking for.
- They are well-understood (?) by people.
- It is a good example of how people understand the world.
Here is the picture in my mind.
I didn't state this clearly because I was not very clear about what I was trying to model either.
Firstly I thought Atom is the parent of Proton/Neutron/Electron, and Molecule is the parent of Atom.
Then I realized that this is about composition, not about subtyping, and not about Type Hierarchy.
Types
I have been thinking about types for a while, as well as grouping and classification.
Here is a quote from "SQL and Relational Theory":
So what is a type, exactly? In essence, it’s a named, finite set of values ─ all possible values of some specific kind: for example, all possible integers, or all possible character strings, or all possible supplier numbers, or all possible XML documents, or all possible relations with a certain heading (and so on).
People coined the name "Integer" to represent the set of integer values.
Actually, people coined concepts and names to identify things, grouping things so that we can understand/model the world.
Proton is a set of real protons, Hydrogen is a set of hydrogen atoms, and so on.
In this sense, the real particles stay at the lowest level of the type hierarchy.
I was trying to model all the particles at first, but then I was stuck because
- I couldn't think of an appropriate key to identify each real particle;
- there are too many of them to store in a database.
So I decided to ignore the real particles and model the types instead.
When we say "a molecule is composed of atoms", it means "a real H2O molecule is composed of two real Hydrogen atoms and one Oxygen atom", it also means "any (type of) molecule is composed of (some types of) atoms".
Instead of stating every fact about the real particles, we can just state facts about the particle types.
That is the benefit we get by grouping things and coined names (types).
Particle Type Hierarchy As Sets
The hierarchy can be translated into set definitions.
Second level - types above the real particles:
S_proton = { p | p satisfied the definition of a proton }
S_neutron = { n | n satisfied the definition of a neutron }
S_electron = { e | e satisfied the definition of an electron }
S_hydrogen = { h | h satisfied the definition of a hydrogen }
S_oxygen = { o | o satisfied the definition of an oxygen }
S_h2o = { w | w satisfied the definition of a h2o }
S_o2 = { o | o satisfied the definition of a o2 }
Higher Levels
Using the terminology of Set Theory, type A is a subtype of B if A is a subset of B.
I first thought we could define the Atom type as:
S_atom = S_hydrogen union S_oxygen union ...
However, the sets are relations and the elements are tuples, so the union doesn't work if tuples in the relations are incompatible.
The approaches that use subtype tables solve the problem and models the subset relationship.
But in the subtyping approach, Atom is still in the second level.
Higher level types are defined as sets of sets.
S_atom = { S_hydrogen, S_oxygen, ... }
S_molecule = { S_h2o, S_o2, ... }
S_particle = { S_proton, S_neutron, S_electron, S_atom, S_molecule }
which means Particle is the type of Atom, and Atom is the type of Hydrogen.
This way, the relationships between particles can be represented at a high level.
The new data model
4. Treat types as a hierarchy of types
ParticleType (ParticleType, Name)
ParticleTypeHierarchy (ParticleType, ParentType)
ParticleComposition (PartileType, SubParticleType, Quantity)
Sample data:
ParticleType | ParticleType | Name | |--------------+----------| | Particle | Particle | | Proton | Proton | | Neutron | Neutron | | Electron | Electron | | Atom | Atom | | Molecule | Molecule | | H | Hydrogen | | O | Oxygen | | H2O | Water | | O2 | Oxygen | ParticleTypeHierarchy | ParticleType | ParentType | |--------------+------------| | Proton | Particle | | Neutron | Particle | | Electron | Particle | | Atom | Particle | | Molecule | Particle | | Hydrogen | Atom | | Oxygen | Atom | | H2O | Molecule | | O2 | Molecule | ParticleComposition | PartileType | SubParticleType | Quantity | |-------------+-----------------+----------| | H | Proton | 1 | | H | Electron | 1 | | He | Proton | 2 | | He | Neutron | 2 | | He | Electron | 2 | | H2O | H | 2 | | H2O | H | 2 | | H2O | O | 1 | | CO2 | C | 1 | | CO2 | O | 2 |
For comparison, this is the sample data for a subtype table approach.
Particle | ParticleId | ParticleName | |------------+----------------| | H | Hydrogen | | He | Helium | | Li | Lithium | | Be | Beryllium | | H2O | Water | | O2 | Oxygen | | CO2 | Carbon Dioxide | Molecule | MoleculeId | some_attribute | |------------+----------------| | H2O | ... | | O2 | ... | | CO2 | ... | Atom | AtomId | ProtonQuantity | NeutronQuantity | ElectronQuantity | |--------+----------------+-----------------+------------------| | H | 1 | 0 | 1 | | He | 2 | 2 | 2 | | Li | 3 | 4 | 3 | | Be | 4 | 5 | 4 | ParticleComposition | ParticleId | ComponentId | Quantity | |------------+-------------+----------| | H2O | H | 2 | | H2O | O | 1 | | CO2 | C | 1 | | CO2 | O | 2 | | O2 | O | 2 |
sub-atom
These particle types are defined by people and people keep defining new concepts to model new aspects of the reality.
We can define "sub-atom" and the hierarchy will look like:
Approach 4 can accommodate this type hierarchy change more easily.
Update 2
The facts to record
- There are different types of particles in the world: protons, neutrons, electrons, atoms, molecules.
- Atoms are composed of protons, neutrons, and electrons.
- Molecules are composed of atoms.
- There are many different types of atoms: Hydrogen, Oxygen, ....
- There are many different types of molecules: H2O, O2, ....
- A Hydrogen atom is composed of one proton and one electron; ...
- A H2O molecule is composed of two Hydrogen atoms and one Oxygen atom; ...
- Different types of particles may have special properties, e.g. An atom has atom weight, etc.
- ...