Select values that meet different conditions on different rows
Asked Answered
K

6

34

Let's say I have a two-column table like this:

userid  |  roleid
--------|--------
   1    |    1
   1    |    2
   1    |    3
   2    |    1

I want to get all distinct userids that have roleids 1, 2 AND 3. Using the above example, the only result I want returned is userid 1. How do I do this?

Kaka answered 25/1, 2009 at 1:1 Comment(0)
C
31
SELECT userid
FROM UserRole
WHERE roleid IN (1, 2, 3)
GROUP BY userid
HAVING COUNT(DISTINCT roleid) = 3;

Just thinking out loud, another way to write the self-join described by cletus is:

SELECT t1.userid
FROM userrole t1
JOIN userrole t2 ON t1.userid = t2.userid
JOIN userrole t3 ON t2.userid = t3.userid
WHERE (t1.roleid, t2.roleid, t3.roleid) = (1, 2, 3);

This might be easier to read for you, and MySQL supports comparisons of tuples like that. MySQL also knows how to use covering indexes intelligently for this query. Just run it through EXPLAIN and see "Using index" in the notes for all three tables, which means it's reading the index and doesn't even have to touch the data rows.

I ran this query over 2.1 million rows (the Stack Overflow July data dump for PostTags) using MySQL 5.1.48 on my MacBook, and it returned the result in 1.08 seconds. On a decent server with enough memory allocated to innodb_buffer_pool_size, it should be even faster.

To anyone reading this: my answer is simple and straightforward, and got the 'accepted' status, but please do go read the answer given by cletus. It has much better performance.

Cassandracassandre answered 25/1, 2009 at 1:33 Comment(0)
L
121

Ok, I got downvoted on this, so I decided to test it:

CREATE TABLE userrole (
  userid INT,
  roleid INT,
  PRIMARY KEY (userid, roleid)
);

CREATE INDEX ON userrole (roleid);

Run this:

<?php
ini_set('max_execution_time', 120); // takes over a minute to insert 500k+ records

$start = microtime(true);

echo "<pre>\n";
mysql_connect('localhost', 'scratch', 'scratch');
if (mysql_error()) {
    echo "Connect error: " . mysql_error() . "\n";
}
mysql_select_db('scratch');
if (mysql_error()) {
    echo "Selct DB error: " . mysql_error() . "\n";
}

$users = 200000;
$count = 0;
for ($i=1; $i<=$users; $i++) {
    $roles = rand(1, 4);
    $available = range(1, 5);
    for ($j=0; $j<$roles; $j++) {
        $extract = array_splice($available, rand(0, sizeof($available)-1), 1);
        $id = $extract[0];
        query("INSERT INTO userrole (userid, roleid) VALUES ($i, $id)");
        $count++;
    }
}

$stop = microtime(true);
$duration = $stop - $start;
$insert = $duration / $count;

echo "$count users added.\n";
echo "Program ran for $duration seconds.\n";
echo "Insert time $insert seconds.\n";
echo "</pre>\n";

function query($str) {
    mysql_query($str);
    if (mysql_error()) {
        echo "$str: " . mysql_error() . "\n";
    }
}
?>

Output:

499872 users added.
Program ran for 56.5513510704 seconds.
Insert time 0.000113131663847 seconds.

That adds 500,000 random user-role combinations and there are approximately 25,000 that match the chosen criteria.

First query:

SELECT userid
FROM userrole
WHERE roleid IN (1, 2, 3)
GROUP by userid
HAVING COUNT(1) = 3

Query time: 0.312s

SELECT t1.userid
FROM userrole t1
JOIN userrole t2 ON t1.userid = t2.userid AND t2.roleid = 2
JOIN userrole t3 ON t2.userid = t3.userid AND t3.roleid = 3
AND t1.roleid = 1

Query time: 0.016s

That's right. The join version I proposed is twenty times faster than the aggregate version.

Sorry but I do this for a living and work in the real world and in the real world we test SQL and the results speak for themselves.

The reason for this should be pretty clear. The aggregate query will scale in cost with the size of the table. Every row is processed, aggregated and filtered (or not) through the HAVING clause. The join version will (using an index) select a subset of the users based on a given role, then check that subset against the second role and finally that subset against the third role. Each selection (in relational algebra terms) works on an increasingly small subset. From this you can conclude:

The performance of the join version gets even better with a lower incidence of matches.

If there were only 500 users (out of the 500k sample above) that had the three stated roles, the join version will get significantly faster. The aggregate version will not (and any performance improvement is a result of transporting 500 users instead of 25k, which the join version obviously gets too).

I was also curious to see how a real database (ie Oracle) would deal with this. So I basically repeated the same exercise on Oracle XE (running on the same Windows XP desktop machine as the MySQL from the previous example) and the results are almost identical.

Joins seem to be frowned upon but as I've demonstrated, aggregate queries can be an order of magnitude slower.

Update: After some extensive testing, the picture is more complicated and the answer will depend on your data, your database and other factors. The moral of the story is test, test, test.

Lurdan answered 25/1, 2009 at 1:5 Comment(6)
This dv didn't come from me...but seriously...would you put this in your system?Edgebone
Seems like something is missing. What are t2 and t3? Aliases of the same table? Would probably work if it were modified to be like that.Distended
Joins are almost always faster than equivalent aggregate queries. But I wouldn't sacrfice clarity unless there was a requirement for speed (determined by profiling)...Druce
MySQL has an optional AS. This should be SELECT t1.userid FROM userrole AS t1 JOIN userrole AS t2 ON t1.userid = t2.userid AND t2.roleid = 2 JOIN userrole AS t3 ON t2.userid = t3.userid AND t3.roleid = 3 AND t1.roleid = 1Pergola
Does it make any difference if the roleId tests for t2 and t3 are moved to the WHERE clause? It seems to me that (conceptually at least) they belong there, and I would hope it doesn't affect performance - but I haven't tested.Soelch
Just to be picky: be aware that the two queries aren't equivalent when there are double entries in the table, and if NULLs are present int the role_id columns, the results might vary as well. Although I'd guess the table has a pk on both columns.Ala
C
31
SELECT userid
FROM UserRole
WHERE roleid IN (1, 2, 3)
GROUP BY userid
HAVING COUNT(DISTINCT roleid) = 3;

Just thinking out loud, another way to write the self-join described by cletus is:

SELECT t1.userid
FROM userrole t1
JOIN userrole t2 ON t1.userid = t2.userid
JOIN userrole t3 ON t2.userid = t3.userid
WHERE (t1.roleid, t2.roleid, t3.roleid) = (1, 2, 3);

This might be easier to read for you, and MySQL supports comparisons of tuples like that. MySQL also knows how to use covering indexes intelligently for this query. Just run it through EXPLAIN and see "Using index" in the notes for all three tables, which means it's reading the index and doesn't even have to touch the data rows.

I ran this query over 2.1 million rows (the Stack Overflow July data dump for PostTags) using MySQL 5.1.48 on my MacBook, and it returned the result in 1.08 seconds. On a decent server with enough memory allocated to innodb_buffer_pool_size, it should be even faster.

To anyone reading this: my answer is simple and straightforward, and got the 'accepted' status, but please do go read the answer given by cletus. It has much better performance.

Cassandracassandre answered 25/1, 2009 at 1:33 Comment(0)
S
5

The classic way to do this is to treat it as a relational division problem.

In English: Select those users for whom none of the desired roleid values is missing.

I'll assume you have a Users table to which the UserRole table refers, and I'll assume the desired roleid values are in a table:

create table RoleGroup(
  roleid int not null,
  primary key(roleid)
)
insert into RoleGroup values (1);
insert into RoleGroup values (2);
insert into RoleGroup values (3);

I'll also assume all the relevant columns are not NULLable, so there are no surprises with IN or NOT EXISTS. Here's a SQL query that expresses the English above:

select userid from Users as U
where not exists (
  select * from RoleGroup as G
  where not exists (
    select R.roleid from UserRole as R
    where R.roleid = G.roleid
    and R.userid = U.userid
  )
);

Another way to write it is this

select userid from Users as U
where not exists (
  select * from RoleGroup as G
  where G.roleid not in (
    select R.roleid from UserRole as R
    where R.userid = U.userid
  )
);

This may or may not end up being efficient, depending on indexes, platform, data, etc. Search the web for "relational division" and you'll find a lot.

Sulphanilamide answered 7/8, 2009 at 20:8 Comment(2)
can you explain a bit more i.e. what every sub-query does ?Chap
It’s pretty much a direct translation of what I wrote at the top, here again with parenthetical annotations: Select those users (outermost SELECT) for whom none (first NOT EXISTS) of the desired roleid values (innermost SELECT) Is missing (innermost NOT). (“None is missing” is the same as “there is not any that are not among the ones that are there”jSulphanilamide
P
4

Assuming userid, roleid are contained in a unique index (meaning there cannot be 2 records where userid = x and roleid = 1

select count(*), userid from t
where roleid in (1,2,3)
group by userid
having count(*) = 3
Pelasgian answered 25/1, 2009 at 1:34 Comment(0)
S
1
select userid from userrole where userid = 1
intersect
select userid from userrole where userid = 2
intersect
select userid from userrole where userid = 3

Won't this solve the problem? How good a solution is this on typical Relational DBs? Will query optimizer auto optimize this?

Starobin answered 26/11, 2010 at 7:8 Comment(2)
This would have been the perfect answer for any self respecting RDBMS. Not mysql though: bugs.mysql.com/bug.php?id=31336Corpsman
The intent is not clear (for instance, Stack Overflow is not a forum), but questions do not belong in an answer, even if rhetorical.Maisey
E
-6

If you need any kind of generality here (different 3-role combinations or different n-role combinations)...I'd suggest you use a bit masking system for your roles and use the bitwise operators to perform your queries...

Edgebone answered 25/1, 2009 at 1:27 Comment(1)
-1 Terrible idea. Use a relational database as a relational database.Lurdan

© 2022 - 2024 — McMap. All rights reserved.