PostgreSQL where all in array
Asked Answered
S

9

29

What is the easiest and fastest way to achieve a clause where all elements in an array must be matched - not only one when using IN? After all it should behave like mongodb's $all.

Thinking about group conversations where conversation_users is a join table between conversation_id and user_id I have something like this in mind:

WHERE (conversations_users.user_id ALL IN (1,2))

UPDATE 16.07.12

Adding more info about schema and case:

  1. The join-table is rather simple:

                  Table "public.conversations_users"
         Column      |  Type   | Modifiers | Storage | Description 
    -----------------+---------+-----------+---------+-------------
     conversation_id | integer |           | plain   | 
     user_id         | integer |           | plain   | 
    
  2. A conversation has many users and a user belongs to many conversations. In order to find all users in a conversation I am using this join table.

  3. In the end I am trying to figure out a ruby on rails scope that find's me a conversation depending on it's participants - e.g.:

    scope :between, ->(*users) {
      joins(:users).where('conversations_users.user_id all in (?)', users.map(&:id))
    }
    

UPDATE 23.07.12

My question is about finding an exact match of people. Therefore:

Conversation between (1,2,3) won't match if querying for (1,2)

Sugarplum answered 13/7, 2012 at 10:21 Comment(6)
Can you add some sample input and output data to make that a bit clearer?Albite
Thank you for your comment, @a_horse_with_no_name. Added case and schema.Sugarplum
While looking for conversations between users (1,2), do you also want one between (1,2,3) in the result, or only conversations between (1,2) - and nobody else?Nembutal
@ErwinBrandstetter Only between (1,2)Sugarplum
In this case, you need the commented part in my answer. Or you can use the second query in Gordon's answer. All other answers so far fall short in that respect - you did not declare it explicitly, too.Nembutal
@ErwinBrandstetter sorry, you're right - my faultSugarplum
C
34

Assuming the join table follows good practice and has a unique compound key defined, i.e. a constraint to prevent duplicate rows, then something like the following simple query should do.

select conversation_id from conversations_users where user_id in (1, 2)
group by conversation_id having count(*) = 2

It's important to note that the number 2 at the end is the length of the list of user_ids. That obviously needs to change if the user_id list changes length. If you can't assume your join table doesn't contain duplicates, change "count(*)" to "count(distinct user_id)" at some possible cost in performance.

This query finds all conversations that include all the specified users even if the conversation also includes additional users.

If you want only conversations with exactly the specified set of users, one approach is to use a nested subquery in the where clause as below. Note, first and last lines are the same as the original query, only the middle two lines are new.

select conversation_id from conversations_users where user_id in (1, 2)
   and conversation_id not in
   (select conversation_id from conversations_users where user_id not in (1,2))
group by conversation_id having count(*) = 2

Equivalently, you can use a set difference operator if your database supports it. Here is an example in Oracle syntax. (For Postgres or DB2, change the keyword "minus" to "except.)

select conversation_id from conversations_users where user_id in (1, 2)
  group by conversation_id having count(*) = 2
minus
  select conversation_id from conversations_users where user_id not in (1,2)

A good query optimizer should treat the last two variations identically, but check with your particular database to be sure. For example, the Oracle 11GR2 query plan sorts the two sets of conversation ids before applying the minus operator, but skips the sort step for the last query. So either query plan could be faster depending on multiple factors such as the number of rows, cores, cache, indices etc.

Churl answered 21/7, 2012 at 5:28 Comment(6)
There is no use in actually counting violating rows. We know enough as soon as we find one. An EXISTS semi-join is generally faster in such a case.Nembutal
This doesn't count violating rows. It just filters them out as part of the where clause. The top level where clause takes effect before any counting is done for the having clause.Churl
Right, my first sentence is not precise, you are not actually counting. No need to collect all violating rows, that's the correct sentence. The part about EXISTS being faster still applies, though. Don't get me wrong, I upvoted your answer, because it's simple and smart. My comment is just about squeezing out a bit more performance.Nembutal
For Your first query, You can use ... having count(distinct user_id) = 2, then You don't need the unique constraint.Hesper
@ErwinBrandstetter I tried your suggested variation of an exists semi-join on Oracle and got the same query plan as the last query. I've also seen cases where an exists semi-join helps performance, but it doesn't always help, I find the last query above easier to read.Churl
@AlexBlakemore: This question is about PostgreSQL and it makes a difference there. I find NOT EXISTS clearer, that's obviously a matter of taste.Nembutal
R
8

I'm collapsing those users into an array. I'm also using a CTE (the thing in the WITH clause) to make this more readable.

=> select * from conversations_users ;
 conversation_id | user_id
-----------------+---------
               1 |       1
               1 |       2
               2 |       1
               2 |       3
               3 |       1
               3 |       2
(6 rows)       

=> WITH users_on_conversation AS (
  SELECT conversation_id, array_agg(user_id) as users
  FROM conversations_users
  WHERE user_id in (1, 2) --filter here for performance                                                                                      
  GROUP BY conversation_id
)
SELECT * FROM users_on_conversation
WHERE users @> array[1, 2];
 conversation_id | users
-----------------+-------
               1 | {1,2}
               3 | {1,2}
(2 rows) 

EDIT (Some resources)

Romona answered 16/7, 2012 at 21:34 Comment(0)
C
4

This preserves ActiveRecord objects.

In the below example, I want to know the time sheets which are associated with all codes in the array.

codes = [8,9]

Timesheet.joins(:codes).select('count(*) as count, timesheets.*').
           where('codes.id': codes).
           group('timesheets.id').
           having('count(*) = ?', codes.length)

You should have the full ActiveRecord objects to work with. If you want it to be a true scope, you can just use your above example and pass in the results with .pluck(:id).

Chrisman answered 15/5, 2015 at 17:13 Comment(3)
I don't think this works as expected. This will return all time sheets, with at least one of those codes, and exactly two codes, but not necessarily both of them together.Dredger
Works perfectly for me (adapted to my tables), thanks for this! @Dredger The GROUP clause ensures a timesheet row is included as many times as it has matching codes, so if a timesheet only has 1 of the codes, the HAVING clause will exclude it. This only fails, as far as I can tell, if codes.id (or your replacement) is not unique =)Quathlamba
@HenryBlyth I don't remember what prompted my original comment, but I think you're correct.Dredger
N
3

While @Alex' answer with IN and count() is probably the simplest solution, I expect this PL/pgSQL function to be the faster:

CREATE OR REPLACE FUNCTION f_conversations_among_users(_user_arr int[])
  RETURNS SETOF conversations AS
$BODY$
DECLARE
    _sql text := '
    SELECT c.*
    FROM   conversations c';
    i int;
BEGIN

FOREACH i IN ARRAY _user_arr LOOP
    _sql  := _sql  || '
    JOIN   conversations_users x' || i || ' USING (conversation_id)';
END LOOP;

_sql  := _sql  || '
    WHERE  TRUE';

FOREACH i IN ARRAY _user_arr LOOP
    _sql  := _sql  || '
    AND    x' || i || '.user_id = ' || i;
END LOOP;

/* uncomment for conversations with exact list of users and no more
_sql  := _sql  || '
    AND    NOT EXISTS (
        SELECT 1
        FROM   conversations_users u
        WHERE  u.conversation_id = c.conversation_id
        AND    u.user_id <> ALL (_user_arr)
        )
*/

-- RAISE NOTICE '%', _sql;
RETURN QUERY EXECUTE _sql;

END;
$BODY$ LANGUAGE plpgsql VOLATILE;

Call:

SELECT * FROM f_conversations_among_users('{1,2}')

The function dynamically builds executes a query of the form:

SELECT c.*
FROM   conversations c
JOIN   conversations_users x1 USING (conversation_id)
JOIN   conversations_users x2 USING (conversation_id)
...
WHERE  TRUE
AND    x1.user_id = 1
AND    x2.user_id = 2
...

This form performed best in an extensive test of queries for relational division.

You could also build the query in your app, but I went by the assumption that you want to use one array parameter. Also, this is probably fastest anyway.

Either query requires an index like the following to be fast:

CREATE INDEX conversations_users_user_id_idx ON conversations_users (user_id);

A multi-column primary (or unique) key on (user_id, conversation_id) is just as well, but one on (conversation_id, user_id) (like you may very well have!) would be inferior. You find a short rationale at the link above, or a comprehensive assessment under this related question on dba.SE

I also assume you have a primary key on conversations.conversation_id.

Can you run a performance test with EXPLAIN ANALYZE on @Alex' query and this function and report your findings?

Note that both solutions find conversations where at least the users in the array take part - including conversations with additional users.
If you want to exclude those, un-comment the additional clause in my function (or add it to any other query).

Tell me if you need more explanation on the features of the function.

Nembutal answered 21/7, 2012 at 12:54 Comment(0)
P
1

create a mapping table with all possible values and use this

select 
    t1.col from conversations_users as t1 
    inner join mapping_table as map on t1.user_id=map.user_id
group by 
    t1.col  
having  
    count(distinct conversations_users.user_id)=
    (select count(distinct user_id) from mapping)
Pairoar answered 13/7, 2012 at 10:40 Comment(0)
H
1
select id from conversations where not exists(
    select * from conversations_users cu 
    where cu.conversation_id=conversations.id 
    and cu.user_id not in(1,2,3)        
)

this can easily be made into a rails scope.

Hesper answered 18/7, 2012 at 12:40 Comment(2)
actually, it's now not clear to me what is meant: find conversations between exactly the given people, betweeen exactly the people plus others, or between people from the given set (possibly not all of them) and not others? My answer is for the last case.Hesper
This will also select conversations with only some of the users or no users at all.Nembutal
A
1

I am guessing that you don't really want to start messing with temporary tables.

Your question was unclear as to whether you want conversations with exactly the set of users, or conversations with a superset. The following is for the superset:

with users as (select user_id from users where user_id in (<list>)
              ),
     conv  as (select conversation_id, user_id
               from conversations_users
               where user_id in (<list>)
              )
select distinct conversation_id
from users u left outer join
     conv c
     on u.user_id = c.user_id
where c.conversation_id is not null

For this query to work well, it assumes that you have indexes on user_id in both users and conversations_users.

For the exact set . . .

with users as (select user_id from users where user_id in (<list>)
              ),
     conv  as (select conversation_id, user_id
               from conversations_users
               where user_id in (<list>)
              )
select distinct conversation_id
from users u full outer join
     conv c
     on u.user_id = c.user_id
where c.conversation_id is not null and u.user_id is not null
Adler answered 20/7, 2012 at 1:23 Comment(0)
M
1

Based on @Alex Blakemore's answer, the equivalent Rails 4 scope on you Conversation class would be:

# Conversations exactly with users array
scope :by_users, -> (users) { 
                           self.by_any_of_users(users)
                             .group("conversations.id")
                             .having("COUNT(*) = ?", users.length) -
                           joins(:conversations_users)
                             .where("conversations_users.user_id NOT IN (?)", users)
}
# generates an IN clause
scope :by_any_of_users, -> (users) { joins(:conversations_users).where(conversations_users: { user_id: users }).distinct }

Note you can optimize it instead of doing a Rails - (minus) you could do a .where("NOT IN") but that would be really complex to read.

Mirellamirelle answered 24/6, 2016 at 20:26 Comment(1)
I dont unserstand this query . Can you explain it ?Carisacarissa
S
0

Based on Alex Blakemore answer

select conversation_id
from conversations_users cu
where user_id in (1, 2)
group by conversation_id 
having count(distinct user_id) = 2

I have found an alternative query with the same goal, finding the conversation_id of a conversation that contains user_1 and user_2 (ignoring aditional users)

select *
from conversations_users cu1
where 2 = (
    select count(distinct user_id)
    from conversations_users cu2
    where user_id in (1, 2) and cu1.conversation_id = cu2.conversation_id
)

It is slower according the analysis that postgres perform via explain query statement, and i guess that is true because there is more conditions beign evaluated, at least, for each row of the conversations_users the subquery will get executed as it is correlated subquery. The possitive point with this query is that you aren't grouping, thus you can select aditional fields of the conversations_users table. In some situations (like mine) it could be handy.

Sodden answered 31/10, 2019 at 21:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.