In Postgres, how to match multiple "tags" for best performance?
Asked Answered
H

4

8

Table: articles

+--------+------+------------+
| id     | title|    created |
+--------+------+------------+
|    201 | AAA  | 1482561011 |
|    202 | BBB  | 1482561099 |
|    203 | CCC  | 1482562188 |
+--------+------+------------+

Table: taggings

+-----------+------+
| articleid | tagid|
+-----------+------+
|    201    | 11   |
|    201    | 12   |
|    202    | 11   |
|    202    | 13   |
|    202    | 14   |
+-----------+------+

Now if given 3 tag ids, what is the best index design and query to select latest 10 articles that each article match the 3 tag ids at the same time?
I know there can be several ways to do it, but I'm concerning the performance, considering there maybe tens of thousands of articles in each tag

Hypochondria answered 24/12, 2016 at 7:11 Comment(3)
query to select latest 10 articles - please explain how do you define latest article ? Is there a date column in some table, which is not shown in the question ? Or does the latest mean the highest values in id columns?Selangor
@Selangor I added a column "created" to the table. And yes latest is highest value in id column. The id is "int serial".Hypochondria
This might be interesting for you: databasesoup.com/2015/01/tag-all-things.htmlLanded
O
22

As a_horse_with_no_name mentioned, this blog post has some really interesting performance benchmarks for finding rows matching more than one tag:

http://www.databasesoup.com/2015/01/tag-all-things.html

Storing tags in an array column in the main table and creating a GIN-index allows rows to be selected like this, without any joins:

select id
from articles
where tags @> array[11,13,14]
order by created desc
limit 10;

The column and index can be created like this:

alter table articles add column tags text[] not null default '{}';
create index tags_index on articles using gin (tags);

According to the blog, finding rows matching two tags was between 8 and 895 times faster when using the array-column than when joining in a tags table.

Overtop answered 29/7, 2018 at 11:50 Comment(2)
Has it changed since 2015?Barely
Would be better to see a new test with the latest version of PG. is there any citation on that?Fleabitten
L
2
select distinct on (a.id) a.*
from articles a 
  join taggings t on t.articleid = a.id
group by a.id 
having array_agg(t.tagid order by t.tagid) = array[11,13,14]
order by a.id, a.created
limit 10;

An index on taggings (articleid, tagid) will help for this.

Note that the above looks for articles with exactly those three tags. If you want to find those with at least those three tags (and possibly more) you can change the having clause to use the "contains" operator:

select distinct on (a.id) a.*
from articles a 
  join taggings t on t.articleid = a.id
where t.tagid in (11,13,14)
group by a.id 
having array_agg(t.tagid) @> array[11,13,14]
order by a.id, a.created
limit 10;

In that case the order by for array_agg() is not necessary

Landed answered 24/12, 2016 at 9:57 Comment(0)
B
1

You need have an index on articles.created for sorting, and another unique index on taggings(articleid, tagid) for querying:

CREATE INDEX ON articles(created);
CREATE UNIQUE INDEX ON taggings(articleid, tagid);

Then just make a select query with three taggings table aliases:

SELECT a.* FROM articles a, taggings t1, taggings t2, taggings t3
    WHERE a.id=t1.articleid AND a.id=t2.articleid AND a.id=t3.articleid
    AND t1.tagid=111 AND t2.tagid=222 AND t3.tagid=333
    ORDER BY created DESC LIMIT 10;
Burning answered 24/12, 2016 at 8:34 Comment(0)
D
0

In my case I had to apply a complex condition to every "tag", and @> was of no use. So I found another approach: a grid sensor array:

ARRAY[
  bool_or(true if tag1 is present), 
  bool_or(true if tag2 is present),
  ...
] = ARRAY[true, true, ...]

Example:

SELECT a.*
FROM articles a JOIN tags t ON(t.articleid = a.id)
GROUP BY a.id
HAVING 
  ARRAY[
    bool_or(t.tagid == 11),
    bool_or(t.tagid == 13),
    bool_or(t.tagid == 14)
  ] == ARRAY[true, true, true]

it has poor performance, but great flexibility for many2many relationships.

Duodecimo answered 31/8, 2020 at 16:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.