PostgreSQL - order by an array

Asked 27/3, 2013 at 16:40 Answered 28/3, 2013 at 5:0

Solved sql postgresql sql-order-by aggregate set-returning-functions

I have 2 tables - course that contains id and name of the courses and tagCourse that contains tags for each course.

course                    tagcourse
------------            ----------------
PK id_course            PK tag
   name                 PK, FK id_course

I'd like to write a function that searches courses by given array of tags and returns them ordered by quantity of matching tags. However I don't know how to write it correctly and in an efficient way. Please help me.

ie.

CREATE OR REPLACE FUNCTION searchByTags(tags varchar[])
RETURNS SETOF.....
  RETURN QUERY SELECT * FROM course c INNER JOIN tagcourse tc ON c.id_course = tc.id_course
  WHERE ???  ORDER BY ???

END....

Geomorphic answered 27/3, 2013 at 16:40 Comment(1)

How many records are in the tables? Will the query be accompanied with some kind of LIMIT? – Boracic 28/3, 2013 at 14:38

CREATE OR REPLACE FUNCTION search_by_tags(tags varchar[])
  RETURNS TABLE (id_course integer, name text, tag_ct integer)
  LANGUAGE sql AS
$func$
   SELECT id_course, c.name, ct.tag_ct
   FROM  (
      SELECT tc.id_course, count(*)::int AS tag_ct
      FROM   unnest($1) x(tag)
      JOIN   tagcourse tc USING (tag)
      GROUP  BY 1                      -- first aggregate ..
      ) AS ct
   JOIN   course c USING (id_course)   -- .. then join
   ORDER  BY ct.tag_ct DESC            --  more columns to break ties?
$func$;

Use unnest() to produce a table from your input array, like already demonstrated by @Clodoaldo.

You don't need plpgsql for this. Simpler with a plain SQL function.

I use unnest($1) (with positional parameter) instead of unnest(tags), since the later is only valid for PostgreSQL 9.2+ in SQL functions (unlike plpgsql). The manual:

In the older numeric approach, arguments are referenced using the syntax $n: $1 refers to the first input argument, $2 to the second, and so on. This will work whether or not the particular argument was declared with a name.

count() returns bigint. You need to cast it to int to match the declared return type or declare the the returned column as bigint to begin with.

Perfect occasion to simplify the syntax a bit with USING (equi-joins): USING (tag) instead of ON tc.tag = c.tag.

It's regularly faster to first aggregate, then join to another table. Reduces the needed join operations.
To address @Clodoaldo's comments, here is a fiddle demonstrating the difference:

db<>fiddle here
_{Old sqlfiddle}

OTOH, if you aggregate after the join, you don't need a subquery. Shorter, but probably slower:

SELECT c.id_course, c.name, count(*)::int AS tag_ct
FROM   unnest($1) x(tag)
JOIN   tagcourse tc USING (tag)
JOIN   course     c USING (id_course)
GROUP  BY 1
ORDER  BY 3 DESC;  --  more columns to break ties?

Bilge answered 28/3, 2013 at 5:0 Comment(5)

I never catch details like count() returns bigint without testing it. Did you produce a test? If not then the only explanation is that you must have a really abnormal mind :) – Koenraad 28/3, 2013 at 11:35

About the It's regularly faster to first aggregate. Do you mean the planner regularly does not do that? Could you point to some tests or documentation? – Koenraad 28/3, 2013 at 11:36

@ClodoaldoNeto: I have burnt my fingers a couple of times with count() vs. int, so I see this instantly. You will probably see it too, now. :) And I added a fiddle above to substantiate my point and to play with. – Bilge 28/3, 2013 at 12:33

I always though the planner would care about and solve it but now I guess it is more complex to achieve than my naivety supposed. So for now on I will pay attention to it. Since aggregating some result from joined tables is very common why didn't I ever see any advice on this? Did you discover it by your self or by reading documentation/comments/posts/lists? – Koenraad 28/3, 2013 at 12:53

@ClodoaldoNeto: I am not sure any more how I first found out. I have been running a lot of performance tests over the years. It's like I start to think like the Postgres query planner when I write SQL. Also, it seems reasonable when you look at it: after collapsing all the duplicates, you have far fewer joins to make. It avoids an intermediary Cartesian product (a limited cross coin). Of course, if the planner was smart enough, it might detect the optimization itself. But trying to be too smart is a dangerous thing for the planner. – Bilge 28/3, 2013 at 13:0

create or replace function searchByTags(tags varchar[])
returns table (id_course integer, name text, quantitiy integer)
as $$
    select *
    from (
        select c.id_course, c.name, count(*) quantity
        from
            course c
            inner join
            tagcourse tc on c.id_course = tc.id_course
            inner join
            unnest(tags) s(tag) on s.tag = tc.tag
        group by c.id_course, c.name
    ) s
    order by quantity desc, name
    ;
$$ language sql;

Koenraad answered 27/3, 2013 at 16:59 Comment(0)

Recommended topics

Hot tags