Rails: Performance issue with joining of records
Asked Answered
V

7

14

I have the following setup with ActiveRecord and MySQL:

  1. User has many groups through memberships
  2. Group has many users through memberships

There is also an index by group_id and user_id described in schema.rb:

add_index "memberships", ["group_id", "user_id"], name: "uugj_index", using: :btree

3 different queries:

User.where(id: Membership.uniq.pluck(:user_id))

(3.8ms) SELECT DISTINCT memberships.user_id FROM memberships User Load (11.0ms) SELECT users.* FROM users WHERE users.id IN (1, 2...)

User.where(id: Membership.uniq.select(:user_id))

User Load (15.2ms) SELECT users.* FROM users WHERE users.id IN (SELECT DISTINCT memberships.user_id FROM memberships)

User.uniq.joins(:memberships)

User Load (135.1ms) SELECT DISTINCT users.* FROM users INNER JOIN memberships ON memberships.user_id = users.id

What is the best approach for doing this? Why the query with join is much slower?

Vardar answered 14/10, 2015 at 14:28 Comment(2)
Please try with includes. I'm damn sure. It will take comparatively less time. User.uniq.includes(:memberships)Spicate
Unless you are planning to allow your users to be a member of the same group twice - you should make your index unique.Cryptozoite
R
9

The first query is bad because it sucks all of the user ids into a Ruby array and then sends them back to the database. If you have a lot of users, that's a huge array and a huge amount of bandwidth, plus 2 roundtrips to the database instead of one. Furthermore, the database has no way to efficiently handle that huge array.

The second and third approaches are both efficient database-driven solutions (one is a subquery, and one is a join), but you need to have the proper index. You need an index on the memberships table on user_id.

add_index :memberships, :user_id

The index that you already have, would only be helpful if you wanted to find all of the users that belong to a particular group.

Update:

If you have a lot of columns and data in your users table, the DISTINCT users.* in the 3rd query is going to be fairly slow because MySQL has to compare a lot of data in order to ensure uniqueness.

To be clear: this is not intrinsic slowness with JOIN, it's slowness with DISTINCT. For example: Here is a way to avoid the DISTINCT and still use a JOIN:

SELECT users.* FROM users
INNER JOIN (SELECT DISTINCT memberships.user_id FROM memberships) AS user_ids
ON user_ids.user_id = users.id;

Given all of that, in this case, I believe the 2nd query is going to be the best approach for you. The 2nd query should be even faster than reported in your original results if you add the above index. Please retry the second approach, if you haven't done so yet since adding the index.

Although the 1st query has some slowness issues of its own, from your comment, it's clear that it is still faster than the 3rd query (at least, for your particular dataset). The trade-offs of these approaches is going to depend on your particular dataset in regards to how many users you have and how many memberships you have. Generally speaking, I believe the 1st approach is still the worst even if it ends up being faster.

Also, please note that the index I'm recommending is particularly designed for the three queries you listed in your question. If you have other kinds of queries against these tables, you may be better served by additional indexes, or possibly multi-column indexes, as @tata mentioned in his/her answer.

Rubi answered 22/10, 2015 at 17:53 Comment(2)
'User Load (44.7ms) SELECT DISTINCT users.* FROM users INNER JOIN memberships ON memberships.user_id = users.`id' Thanks, index helped, but this query still 3 times slower than with pluck or selectVardar
@Vardar I have updated my answer to address your comment.Rubi
T
3

The query with join is slow because it loads all columns from database despite of the fact that rails don't preload them this way. If you need preloading then you should use includes (or similar) instead. But includes will be even slower because it will construct objects for all associations. Also you should know that User.where.not(id: Membership.uniq.select(:user_id)) will return empty set in case when there is at least one membership with user_id equal to nil while the query with pluck will return the correct relation.

Tillfourd answered 22/10, 2015 at 15:14 Comment(1)
No. The query with the join does not load all columns from both tables. It is slow because it can't use the given index.Rubi
I
2

Below is more efficient solution:

User.exists?(id: Membership.uniq.pluck(:user_id))

join will fetch all the columns from membership table , so it will take more time while in the other queries. Here, you are only fetching rhe user_id from memberships. Calling distinct from users will slow down the query.

Indiscipline answered 17/10, 2015 at 10:21 Comment(2)
depending on the size of the Membership table running .uniq on it can also slow down the query.Cryptozoite
No. Using a join does not automatically fetch all the columns from the memberships table. Also, your proposed solution returns true or false, essentially answering the question "Is there at least one user who has a membership?", which is completely different than the original queries.Rubi
C
2

I think that you have a problem with the declaration of your indexes.

you declared an index as:

add_index "memberships", ["group_id", "user_id"], name: "uugj_index", using: :btree If your primary key was ["user_id","group_id"] - you were good to go, but....

Making this in rails is not so trivial.

Therefore in order to query the data with JOIN with Users table - you need to have 2 indexes:

add_index "memberships", ["user_id", "group_id" ]

This is because of the way MySQL handles indexes (they are treated as concatenated strings)

You can read more about it here Multiple-Column Indexes

There are also other techniques to make it faster dependant on all your cases, but the suggested one is the simple one with ActiveRecord

Furthermore - I don't think that you need the .uniq here as the result should be unique anyway because of the terms on the table. Adding .uniq can make the MySQL to perform unnecessary sorting with filesort and usually it will also put a temporary table on disk.

You can run the command generated by rails directly on the mysql to check it with EXPLAIN

EXPLAIN <your command goes here>

Cryptozoite answered 22/10, 2015 at 15:5 Comment(2)
You are correct that the problem is the indexes. However, you don't need 2 indexes, and you don't need a multi-column index, at least for these particular queries. A single index just on user_id is the most helpful for queries 2 and 3.Rubi
This depends on your usage. If you need to query about all of the membership properties all the time - yes. it is enough. but if you need to get an info about a single group and your user might have thousands of them in the relation - your approach will not be sufficient and my approach - will work in both cases.Cryptozoite
T
1

@bublik42 and @user3409950 if I have to chose the Production environment Query then I would go for the First one:

User.where(id: Membership.uniq.pluck(:user_id))

Reason: Because it will use sql DISTINCT keyword to filter out the database result and then SELECT only 'user_id' column from the databse and return those values in a array form([1,2,3..]). Database level filtration of result is always faster than Active record query object.

For your second query:

User.where(id: Membership.uniq.select(:user_id))

It is same query as with the 'pluck' but with 'select' it will make a active record relation object with single field 'user_id'. In this query it has a overhead of building the active record object as: ([#<Membership user_id: 1>, #<Membership user_id: 2>, ... ], which was not the case for the first query. Though I haven't done any real bench marking for both, but the results are obvious with the steps followed by the queries.

Third case is expensive here because with 'Join' function It will fetch all the columns from memberships table and it will take more time to process the filtration of the result in comparison to other queries.

Thank you

Truitt answered 17/10, 2015 at 12:7 Comment(2)
No. For the second query, ActiveRecord is actually smart enough to do a subquery (look at the actual SQL posted in the question), so Rails is not loading those Membership records.Rubi
Thanking you @Rubi I am agree with your point for the first query for the large table that it will be slow.Truitt
K
0
SELECT  DISTINCT users.*
    FROM  users
    INNER JOIN  memberships
       ON memberships.user_id = users.id

is slower because it is performed something like this:

  1. Go through all of one table, collecting stuff as it goes.
  2. for each entry from step 1, reach into the other table.
  3. put that stuff into a tmp table
  4. dedup (DISTINCT) that table to deliver the results

If there are 1000 users and each has 100 memberships, then the table in step 3 will have 100000 rows, even though the answer will have only 1000 rows.

This is a "semi-join" and only checks that the user has at least one membership; it is much more efficient:

SELECT  users.*
    FROM  users  -- no DISTINCT needed
    WHERE  EXISTS 
      ( SELECT  *
            FROM  memberships ON memberships.user_id = users.id 
      ) 

If you don't really need that check, then this would be still faster:

SELECT users.*
    FROM  users

If Rails can't generate these queries, then grumble at it.

Keturahkeung answered 1/11, 2015 at 4:2 Comment(0)
S
-1

Here is a great example, demonstrating Include VS Join :

http://railscasts.com/episodes/181-include-vs-joins

Please try with includes. I'm damn sure. It will take comparatively less time.

User.uniq.includes(:memberships)
Spicate answered 21/10, 2015 at 15:57 Comment(1)
The original examples find users who have at least one membership. This query returns all users, regardless of whether they have a membership or not.Rubi

© 2022 - 2024 — McMap. All rights reserved.