What is a self-join for? (in English)
Asked Answered
D

2

0

What would I seek to achieve from a self-join?

I already know what a self-join does. Operational descriptions on Stack Overflow with tables or join lists or the self-referencing employee-manager example do not help.

My usage is in a university course, from a Relational Algebra angle. I have done some SQL but the instructor loves to do self-joins (after renaming one or more fields). This is not something often done in SQL, so I'm wondering what he is trying to perform.

Desma answered 1/4, 2016 at 9:9 Comment(0)
S
2

The reason why the employee-manager example is so common, is because it hits the nail on the head. A self join on a table looks for pairs of rows, like any join, but with both rows coming from the same table. Nothing special really.

Soane answered 1/4, 2016 at 9:18 Comment(11)
Thanks for the reply but I think you're missing what I'm asking.. Employee-manager does hit that particular nail on the head precisely; and it's, Relational Algebra-icly speaking, a very particular type of self-referencing behaviour. That's why I specifically excluded it from replies. On the surface it appears the self join [natural, since we're talking Relational Algebra, not SQL] is "nothing special" so why do it? I'm thinking : Query1 = tableX β‹ˆ 𝛿 col1↦col1, col2↦col2 (tableX) Query2 = Οƒ col1 = col1 ` (Οƒ col2 = col2'(Query1)) would"return full list of unique lines in tableX" – Desma
Oh, I think I it's not that particular at all. It occurs all the time, whenever two objects of the same type are related. Bank accounts, saving and checking? But you seem to miss that the join is almost always cross-column. It's not because a self join on a single column returns the table - it usually doesn't. It can return far, far more rows. Consider a self-join on a table of people where the genders are identical. It will return every possible pair of males and every possible pair of females. Well-defined but rarely needed. – Soane
Sorry, are you still join on about the employee-manager thing ("particular"). Or are you answering the question I asked? If you are your answer is very confusing. so far my answer appears, from my own research to be "as to create a double entry on all related items" ... ....thus to be set up for removal of true duplicates (select on <> for renamed fields) so to leave true instances where their are multiple cases of common rows. That can be removed for single, or projected to show reoccurances. – Desma
@mist42nz: I was talking about the general case in everything but the first sentence of my answer. "Removal of duplicates" is only one example where you do a self join. This is clear from my second sentence : a self join looks for any pair of rows. (I think you call that a "double entry"). Duplication removal' is a specific case in which you want such pairs: you intend to delete one of the two rows in each pair. – Soane
Thanks that's much more along the lines of what I was asking. So self-join (relational algebra speaking) is helpful in : self-referrencing cases (employee-manager), in duplication removal (via removing rows which self-pair). Any other particularily useful operations that follow in that line. This means I can now speak to people in lay language and say "a way of doing duplication removal is to first do a self join". Thus it becomes a English discuss (or entrance to pseudo-code) which is exactly what I'm looking for. Thank you again! – Desma
@Desma What do you mean, "return full list of unique lines in tableX"? Since tableX is a relation, every "line" (row?) is "unique". Almost none of your English description of query results or cases in these comments makes sense. (And from your use of scare quotes you seem to know it.) To be understood in natural language means you have to put in the (sometimes considerable) effort to be clear. And if you don't know what you mean, then make that clear. By forcing yourself to write clearly you will improve your ability to understand & explain questions like this one for yourself. – Ghyll
Sorry Phil but your inability to understand is not my problem. If you do a self, i.e. Natural join, there is a lot of duplicated rows. I was using lines because I was trying to steer clear of the SQL is everything idea and deal with the logic involved. It's not so much that the lines themselves are clones, but the information in the line is duplicated, which is a little tough to cover in this little sound bite box. With MSalters reply and the Prof at university I was able to get usable information – Desma
Where the prof has trouble getting down to the layperson.programmer level of discussion. One of the results of the self join is that the triples, (line) will carry its own value, and that of other similar lines. Eg if matched on an id, the. One row for each ID will have double values that are identical. If there is only one ID in the table, that one row will have duplicates. Where if the ID has two entries, then the renamed column/field will show up as an entry where that field is different from the original table. So by removing all the unique "same data" we're left with a complete list – Desma
Of just the information held in the duplicate ID's. We can the use that on a line by line basis, eg if there's a start and end date, we can see if there are overlaps in the period for that ID. Or we can look at the data, knowing that it represents a group of all data with "more than one occurrence". Thus we could remove this group from the original group, and thus leave only those with unique/single/one off data in the rename field(s), or instead we could do a second self join on our duplicates and remove those with only two entries in the original table. This – Desma
I knew what you were trying to say in your question and comments. I was just telling you that you weren't managing to say it. (The same goes for your comments since.) This is a problem for you if you want to be able to communicate with other people. (Thanks for the effort in those last 3 comments though.) – Ghyll
This answer begs the question. – Ghyll
G
0

The database designer gives each base table a predicate (sentence template parameterized by column names).

-- person PERSON is parent of person CHILD
-- shorthand Parent(PERSON, CHILD)
create table Parent(person, child)

-- person PERSON likes food FOOD
-- shorthand Likes(PERSON, FOOD)
create table Likes(person, food)

Relational algebra is designed so that the value of a relational expression (base table name or operator call) holds the rows that make a true proposition (statement) from its predicate.

/* (PERSON, CHILD) rows where
person PERSON is parent of person CHILD
i.e. where
Parent(PERSON, CHILD)
*/
Parent

The predicate of an expression that is a call to operator NATURAL JOIN is the AND of the predicates of its inputs.

/* (PERSON, CHILD, FOOD) rows where
person PERSON is parent of person CHILD AND person PERSON likes food FOOD
i.e. where
Parent(PERSON, CHILD) AND Likes(PERSON, FOOD)
*/
Parent NATURAL JOIN Likes

Ditto for UNION and OR, MINUS and AND NOT, PROJECT kept column(s) and EXISTS the dropped column(s), RESTRICT condition and AND condition, and RENAME of a column & rename of a parameter.

/* (CHILD, FOOD) rows where
there EXISTS a value for PERSON such that
    person PERSON is parent of person CHILD AND person CHILD likes food FOOD
i.e. where
EXISTS PERSON:
    Parent(PERSON, CHILD) AND Likes(CHILD, FOOD)
*/
PROJECT child, food (Parent NATURAL JOIN (RENAME person:=child Likes))

So every query expression's value holds the rows that make its predicate into a true statement.

Suppose we define algebraic self-join of a table as NATURAL JOIN of two tables got from an original via sequences of zero or more renamings. Per the above we NATURAL JOIN for rows that satisfy the AND of predicates.

A self-join arises when we want the rows that satisfy a result predicate that we expressed via multiple uses of a predicate that differ only in parameters/columns.

/* (PERSON, FOOD, CHILD) rows where
person PERSON and person CHILD like the same food FOOD
i.e. where
person PERSON likes food FOOD AND person CHILD likes food FOOD
i.e. where
Likes(PERSON, FOOD) AND Likes(CHILD, FOOD)
*/
Likes NATURAL JOIN (RENAME person:=child Likes)

There's nothing special about a self-join arising in a given query in a given application other than that.


SQL SELECT DISTINCT statements can be described via algebraic operators. That determines query predicates. First FROM table columns that aren't SQL NATURAL JOIN common columns are RENAMEd by prefixing a table alias (correlation name) and a dot. Then for CROSS & INNER JOIN the new tables are CROSS JOINed, but assuming each input has no duplicate column names, after dotting that's NATURAL JOIN. Then ON and WHERE RESTRICT per a condition. Then SELECT DISTINCT uses RENAME or = to give dotless returned columns and PROJECTS away unwanted dotted columns. SQL accomplishes what algebra RENAMEs do but it is via SELECT AS or implicitly dotting.

We can convert SQL to predicates directly: A FROM table contributes a predicate like its predicate but with the parameters dotted with its alias; if it's a subquery, rename conflicting uses of that alias within its predicate. CROSS & INNER JOIN give AND. ON and WHERE give AND with a condition. Each dot-free result column gives an AND that it equals its dotted version. Finally dropping all dotted columns gives EXISTS.

/* same as above */
/* (PERSON, FOOD, CHILD) rows where
person PERSON and person CHILD like the same food FOOD
i.e. where
there EXISTS values for P.* & C.* such that
        PERSON = P.PERSON AND CHILD = C.PERSON AND FOOD = P.FOOD
    AND person P.CHILD likes food P.FOOD
    AND person C.CHILD likes food C.FOOD
    AND P.FOOD = C.FOOD
i.e. where
EXISTS P.*, C.*:
        PERSON = P.PERSON AND CHILD = C.PERSON AND FOOD = P.FOOD
    AND Likes(P.CHILD, P.FOOD)
    AND Likes(C.CHILD, C.FOOD)
    AND P.FOOD = C.FOOD
*/
SELECT DISTINCT p.person AS person, c.person AS child, p.food AS food
FROM Likes p INNER JOIN Likes c
ON p.food = c.food

Again: In SQL we say there is a self-join when multiple table aliases of a JOIN are associated with the same table value; in application terms that means we can express a query meaning in terms of predicates differing in some parameters/columns; there's nothing special about applications or table meanings for this to arise.

A query with scalar subqueries in SELECT and/or WHERE can be rearrranged to only involve JOINs of the subqueries in a FROM. So we often call such a query a self-join. Correspondingly, such a query's predicate has parameters and quantifiers (names introduced by EXISTS and FORALL) appearing as or equal to arguments to the subquery predicates, and the query can be rearranged to instead AND those subquery predicates in the part of a rearranged query predicate that corresponds to the original's FROM.


Re SQL query semantics see my answer at Is there any rule of thumb to construct SQL query from a human-readable description?. And it happens to give a self-join as an example.

For another self-join see my answer at How to get matching data from another SQL table for two different columns: Inner Join and/or Union?.

Re SQL NATURAL JOINs see my answer at Inner Join vs Natural Join vs USING clause: are there any advantages?.

Ghyll answered 23/5, 2016 at 6:50 Comment(6)
Thank you for taking the time. But please read the question properly next time. There are plenty of SQL and even relational algebra replies on the Web, and even on Stackoverflow about the mechanics of self join. What you completely and utterly failed to do, was to give an English description of what it does (I.e. Why you'd want to do a self join), which was the entire point of the question. IIRC it was, by renaming one or more fields; and using self join to match and remove via redundancy other fields, the self join allows search of duplicates, overlaps, and multiplicity in a table. – Desma
This answer explains exactly "Why you'd want to do a self join" in terms of reasoning about the business/application: "when we want the rows that satisfy predicates that differ only in parameters [of sentence templates]". You asked to "to steer clear of the SQL is everything idea and deal with the logic involved". Logic is by definition the manipulation of predicates. The only thing this answer has to do with algebra & SQL is that it tells you how to convert Engish statements about the business/business to them. It says why & when the joins & renamings arise. – Ghyll
Your comments do not reflect an understanding of what I wrote. I talk about business logic then, since the question asks how that leads to algebra/SQL inner join, I also talk about how that leads to inner join. I clearly do not "use" algebra or SQL for business logic, I use English. Formal logic is a subset of English, the language of precision in engineering (including software engineering) and science (including computer science). You are ignoring my use of "predicate" (which generalizes SQL use) as a statement using names for things, without which communication is impossible. – Ghyll
philip, why do we put data into tables. – Desma
Because "Each named table has a predicate (sentence template parameterized by column names)." and "The value of an expression that is a table's name holds the rows that make a true proposition (statement) from its predicate." You have only to give any example to see this. – Ghyll
@ADyson This is a clear, justified, illustrated, self-contained answer to the question that can be applied to any example. Whether it uses unfamiliar words or organization is irrelevant. (See also my previous comments.) If you can't follow or apply it then please ask for clarification where you get stuck. If you don't try to use it then you don't know whether it is useful. You might try to soundly & clearly justify that any query whatsoever that you write meets its specification in application terms, because that cannot be done more simply than using the notions in this answer. – Ghyll

© 2022 - 2024 β€” McMap. All rights reserved.