Clear explanation of the "theta join" in relational algebra?
Asked Answered
C

4

10

I'm looking for a clear, basic explanation of the concept of theta join in relational algebra and perhaps an example (using SQL perhaps) to illustrate its usage.

If I understand it correctly, the theta join is a natural join with a condition added in. So, whereas the natural join enforces equality between attributes of the same name (and removes the duplicate?), the theta join does the same thing but adds in a condition. Do I have this right? Any clear explanation, in simple terms (for a non-mathmetician) would be greatly appreciated.

Also (sorry to just throw this in at the end, but its sort of related), could someone explain the importance or idea of cartesian product? I think I'm missing something with regard to the basic concept, because to me it just seems like a restating of a basic fact, i.e that a set of 13 X a set of 4 = 52...

Conflagrant answered 27/9, 2011 at 1:58 Comment(2)
Readers may also be confused because of this "theta join" [sic] & my answer.Clarino
You have added back a pile of noise that also obfuscates how many questions you are asking.Clarino
H
12

Leaving SQL aside for a moment...

A relational operator takes one or more relations as parameters and results in a relation. Because a relation has no attributes with duplicate names by definition, relational operations theta join and natural join will both "remove the duplicate attributes." [A big problem with posting examples in SQL to explain relation operations, as you requested, is that the result of a SQL query is not a relation because, among other sins, it can have duplicate rows and/or columns.]

The relational Cartesian product operation (results in a relation) differs from set Cartesian product (results in a set of pairs). The word 'Cartesian' isn't particularly helpful here. In fact, Codd called his primitive operator 'product'.

The truly relational language Tutorial D lacks a product operator and product is not a primitive operator in the relational algebra proposed by co-author of Tutorial D, Hugh Darwen**. This is because the natural join of two relations with no attribute names in common results in the same relation as the product of the same two relations i.e. natural join is more general and therefore more useful.

Consider these examples (Tutorial D):

WITH RELATION { TUPLE { Y 1 } , TUPLE { Y 2 } , TUPLE { Y 3 } } AS R1 ,
     RELATION { TUPLE { X 1 } , TUPLE { X 2 } } AS R2 :
R1 JOIN R2

returns the product of the relations i.e. degree of two (i.e. two attributes, X and Y) and cardinality of 6 (2 x 3 = 6 tuples).

However,

WITH RELATION { TUPLE { Y 1 } , TUPLE { Y 2 } , TUPLE { Y 3 } } AS R1 ,
     RELATION { TUPLE { Y 1 } , TUPLE { Y 2 } } AS R2 :
R1 JOIN R2

returns the natural join of the relations i.e. degree of one (i.e. the set union of the attributes yielding one attribute Y) and cardinality of 2 (i.e. duplicate tuples removed).

I hope the above examples explain why your statement "that a set of 13 X a set of 4 = 52" is not strictly correct.

Similarly, Tutorial D does not include a theta join operator. This is essentially because other operators (e.g. natural join and restriction) make it both unnecessary and not terribly useful. In contrast, Codd's primitive operators included product and restriction which can be used to perform a theta join.


SQL has an explicit product operator named CROSS JOIN which forces the result to be the product even if it entails violating 1NF by creating duplicate columns (attributes). Consider the SQL equivalent to the latter Tutoral D exmaple above:

WITH R1 AS (SELECT * FROM (VALUES (1), (2), (3)) AS T (Y)), 
     R2 AS (SELECT * FROM (VALUES (1), (2)) AS T (Y))
SELECT * 
  FROM R1 CROSS JOIN R2;

This returns a table expression with two columns (rather than one attribute) both called Y (!!) and 6 rows i.e. this

SELECT c1 AS Y, c2 AS Y 
  FROM (VALUES (1, 1), 
               (2, 1), 
               (3, 1), 
               (1, 2), 
               (2, 2), 
               (3, 2)
       ) AS T (c1, c2);

** That is, although there is only one relational model (i.e. Codd's), there can be more than one relational algebra (i.e. Codd's is but one).

Hale answered 27/9, 2011 at 6:3 Comment(6)
The current version of Tutorial D does not "lack" cartesian product. The operator is called TIMES. (And although the formal Tutorial D grammar in the Manifesto book does not include TIMES, appendix A in the same book clearly suggests that D&D already had TIMES in mind as a separate operator at the time of writing of the Manifesto book. Not mentioning it in the formal grammar is an incompleteness they fixed in Database Explorations.)Scat
@onedaywhen, thank you so much for expounding on this. I'm having trouble with some of the basic concepts. Why is the natural join (in your first (Tutorial D) example the same as the product? I'm really just asking, why do you multiply at all? Doesn't the resulting set have 5 elements? I'm clearly missing something basic here... Also, could you explain in simple terms what a tuple is? Is it helpful to think of it as being almost synonymous to a row?Conflagrant
@Erwin Smout: I am aware of TIMES and originally included it in my answer but then removed it: see revision history, I had a genuine dilemma about this. Please edit my answer to correct it as you see fit :)Hale
@LuxuryMode: yes, tuple in a relation is analogous to a row in a table. TheHale
@LuxuryMode: "Doesn't the resulting set have 5 elements?" -- if we are talking about set product then the result would be a set of pairs. But with relational product (surely what we are discussing) the result is a relation and the result you seem to be suggesting is not a valid relation: REL{TUP{Y 1},TUP{Y 2},TUP{Y 3}, TUP{X 1},TUP{X 2}}; an attribute cannot be both X and Y.Hale
Theta join was originally defined on relations with lists as headings. It output the number of columns input. For sets as headings & output column names taken from inputs, it is least perverted by defining it to still preserve all input columns & so to just not allow inputs to share column names. Codd also at one point had a notion of output column names reflecting input column names plus argument names, so in that semantics one can have sets as headings and shared column names on input. What is the difference between theta join and inner join?Clarino
S
3

You're not quite right - a theta join is a join which may include a condition other than = - in SQL, typically < or >= etc. See TechNet

As for cartesian product (or CROSS JOIN), it is an operation rather than an idea or concept. It's important because sometimes you need to use it! It is a basic fact that set of 13 x set of 4 = 52, and cartesian product is based on this fact.

Stanton answered 27/9, 2011 at 3:11 Comment(7)
strictly speaking, Θ-join is a proper superset; it includes = as well as other comparisonsNarbada
I'm still not super clear on this. So Exp1, theta join Exp2 is equal to selecting with a condition from the cross-product of exp1 X exp2?Conflagrant
@LuxyryMode yes that is correct. A theta join is a subset of a cartesian product, not a natural join.Stanton
In fact all joins are a subset of a Cartesian product!Manuscript
@Manuscript except an outer join against an empty set - which I suppose is not a real join.Stanton
@KirkBroadhurst Any two relations can be joined. That includes relations with no tuples. (And it includes relations with no attributes.)Clarino
@Manuscript Outer joins & natural joins (& SQL USING joins) are not subsets of Cartesian products. Maybe you are trying to say, "all" calls of SQL inner join ON.Clarino
C
2

If you understand equijoin, you should understand theta join. If you change the symbol = (equals) in equijoin to >=, then you have already done theta join.

However, it is difficult to see the practicality of theta join compared to equijoin since the join clause that we normally use is V.primarykey = C.foreignkey. If you want to change to theta join, then it may depend on the value; as such you are doing selection.

Natural join is similar to equijoin; the difference is that it gets rid of the redundant attributes.

Cactus answered 9/2, 2012 at 8:23 Comment(0)
H
0

Inner joins can be thought of as beginning with a cross product and then weeding out certain rows. A natural join weeds out all rows in which columns of the same name in the two tables being joine have different values. An equijoin weeds out all rows in which the specified columns have different values. And a theta-join weeds out all rows in which the specified columns do not stand in the specified relationship (<, >, or whatever; in principle it could be is_prefix_of as a relationship between strings).

Outer joins cannot be understood this way, because they synthesize information (that is, nulls) out of nothing.

Hegira answered 31/5, 2019 at 0:51 Comment(1)
(In the sort of relational algebra you are talking about, although there are others:) Natural join is not a filter of a cross join. It is a certain projection of a filter of a cross join. PS "Conceptually" is not helpful. We are talking about operators. They are already "conceptual". Any other sense you mean is not clear. PS See my comments on the question & another answer.Clarino

© 2022 - 2024 — McMap. All rights reserved.