Hash Join vs. Nested Loop

Asked 15/1, 2018 at 15:12 Answered 2/6, 2020 at 22:23

I'm new to SQL and I have a question.

I have this SQL code:

DROP TABLE if exists students;
DROP TABLE if exists grades;

CREATE TABLE students(
    s_id integer NOT NULL PRIMARY KEY,
    s_name text,
    s_last_name text,
    curr_year integer
);

CREATE TABLE grades(
    s_id integer NOT NULL PRIMARY KEY,
    course text,
    c_year integer,
    grade integer,
    FOREIGN KEY (s_id) REFERENCES students
);

INSERT INTO students (s_id, s_name, s_last_name, curr_year)
VALUES (1, 'A', 'S', 3);

INSERT INTO students (s_id, s_name, s_last_name, curr_year)
VALUES (2, 'A', 'A', 2);

INSERT INTO students (s_id, s_name, s_last_name, curr_year)
VALUES (3, 'V', 'B', 1);

INSERT INTO students (s_id, s_name, s_last_name, curr_year)
VALUES (4, 'K', 'N', 2);

INSERT INTO grades (s_id, course, c_year, grade)
VALUES (1, 'DB', 2, 98);

INSERT INTO grades (s_id, course, c_year, grade)
VALUES (2, 'OS', 3, 90);

INSERT INTO grades (s_id, course, c_year, grade)
VALUES (3, 'DB', 2, 94);

EXPLAIN ANALYZE
    SELECT * 
    FROM students s JOIN grades gr
    ON s.s_id = gr.s_id
    WHERE curr_year > 0;



CREATE INDEX student_details ON students(s_id, s_name, s_last_name);
CREATE INDEX student_courses ON grades(s_id, course);

EXPLAIN ANALYZE
    SELECT * 
    FROM students s JOIN grades gr
    ON s.s_id = gr.s_id
    WHERE curr_year > 0;


DROP INDEX student_details;
DROP INDEX student_courses;
DROP TABLE students CASCADE;
DROP TABLE grades CASCADE;

And I try to understand the explain outputs. Before the INDEX creating I got hash merge. Here is the explain output:

 Hash Join  (cost=23.50..51.74 rows=270 width=116) (actual time=0.039..0.050 rows=3 loops=1)
   Hash Cond: (gr.s_id = s.s_id)
   ->  Seq Scan on grades gr  (cost=0.00..21.30 rows=1130 width=44) (actual time=0.005..0.008 rows=3 loops=1)
   ->  Hash  (cost=20.12..20.12 rows=270 width=72) (actual time=0.021..0.021 rows=4 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 9kB
         ->  Seq Scan on students s  (cost=0.00..20.12 rows=270 width=72) (actual time=0.006..0.011 rows=4 loops=1)
               Filter: (curr_year > 0)
 Planning time: 0.147 ms
 Execution time: 0.089 ms
(9 rows)

And after the INDEX creating I get Nested Loop:

Nested Loop  (cost=0.00..2.12 rows=1 width=116) (actual time=0.031..0.116 rows=3 loops=1)
   Join Filter: (s.s_id = gr.s_id)
   Rows Removed by Join Filter: 9
   ->  Seq Scan on students s  (cost=0.00..1.05 rows=1 width=72) (actual time=0.012..0.018 rows=4 loops=1)
         Filter: (curr_year > 0)
   ->  Seq Scan on grades gr  (cost=0.00..1.03 rows=3 width=44) (actual time=0.003..0.009 rows=3 loops=4)
 Planning time: 0.396 ms
 Execution time: 0.170 ms

But I can't figure out why? Why the indexing in my case made Nested Loop be preferred over Hash Join? I'll be very happy to get an explanation for that.

Thanks a lot!

Rugger answered 15/1, 2018 at 15:12 Comment(2)

For tables with just a handful of records optimisation makes no sense at all. Hint: the planner already spends more time generating the plan than executing it. – Alnico 15/1, 2018 at 15:29

If this is based on a real world example, those indexes are pointless. There is already a clustered index on sid in both tables, so including this column at the start of the index will render the index useless. – Stotts 15/1, 2018 at 15:35

"Nested loop" join is a bit of a misnomer. Technically it refers to a nested loop -- as its name implies:

for every row in table1
    for every row in table2
        compare values and execute join

In practice, this often works as:

for every row in table1
    for every matching row in table2
        execute join

The difference is subtle, but the "matching" means that the nested loop join can make use of an index. So, a nested loop join can have very poor performance (if the tables are relatively large and there are no indexes) or it can have really good performance (if it can make use of an index).

Burundi answered 15/1, 2018 at 15:23 Comment(0)

Glib answer: because the query planner thought it was faster.

Best guess: When you have the index, the query planner can use the order that it reads data out of the indexes to do the nested loop without a sort, faster than a hash. Without the index it would do a sort, and the combination of sort + loop is slower than hash.

Bikol answered 15/1, 2018 at 15:21 Comment(0)

No one seems to have answered this question. Neither execution plan uses indexes, so why when you create the indexes does the planner change from using HJ to using NL. My guess would be that the after creating the indexes there are better table and column stats available than before. I don't know if creating the indexes cause table stats to be populated or if the planner is using statistics on the indexes to choose the path. Would be interesting to trace the planner and find out. I'm an Oracle person and could do it there. I don't know off the top of my head how do do that analysis on PostgreSQL.

If you look for example at the sequential scan on

students HJ rows estimate is 1130 NL rows estimate is 1 actual rows is 4

grades HJ rows estimate is 270 NL rows estimate is 3 actual rows is 3

the NL stats are more accurate, so something has changed about the stats after you created the indexes. Quick test is to drop the indexes and see what the planner does. If it still does NLs with same stats, the it must be a change in the stats on the tables. If it goes back to HJ with the HJ estimates it must be stats on the indexes.

Gets weirder. So I drop the indexes, and it continues to use NL. Fine, makes sense. The create indexes must have updated table stats. So I drop the tables, recreate and then instead of creating the indexes, I analyze the tables. Guess what, it uses HJ but ... the stats are accurate, like the were with the NL. more questions ... Guessing the create index must do some stats analysis different from simply running analyze.

Savoy answered 2/6, 2020 at 22:23 Comment(1)

If I recall correctly creating indexes implicitly creates column stats. Those don’t go away when indexes are dropped and can easily swing CBO evaluations. – Astounding 2/6, 2020 at 23:49

Recommended topics

Hot tags