Arangodb AQL Filter NOT IN collection, very slow
Asked Answered
S

2

5

I want to find the set of users not having a profile.

ArangoDB 2.4.3

LENGTH(users) -> 130k
LENGTH(profiles) -> 110k

users.userId -> unique hash index
profiles.userId -> unique hash index

This AQL snippet I made is slower than a snail crossing the Grand Canyon in mid-summer.

LET usersWithProfiles = ( /* This part is ok */
FOR i IN users
    FOR j IN profiles
        FILTER i.userId == j.userId
RETURN i
)

LET usersWithoutProfiles = ( /* This is not */
FOR i IN usersWithProfiles
    FILTER i NOT IN users
RETURN i
)

RETURN LENGTH(usersWithoutProfiles)

I'm pretty sure there is a perfectly sane way of doing it right, but I'm missing it. Any ideas?

Edit 1 (After @dothebart 's response):

This is the new query, but it is still very slow

LET userIds_usersWithProfile = (
FOR i IN users
    FOR j IN profile
        FILTER i.userId == j.userId
RETURN i.userId
)

LET usersWithoutProfiles = (
FOR i IN users 
    FILTER i.userId NOT IN userIds_usersWithProfile
RETURN i
)

RETURN LENGTH(usersWithoutProfiles)
Sullage answered 10/2, 2015 at 17:13 Comment(0)
Q
8

Note also that this part of the original query was extremely expensive:

LET usersWithoutProfiles = (
  FOR i IN usersWithProfiles
    FILTER i NOT IN users
    RETURN i
)

The reason is the FILTER using users, which at this point is an expression that builds all documents from the collections as an array. Instead of using this, I suggest this query, which will return the _key attribute of users that do not have an associated profile record:

FOR user IN users 
  LET profile = (
    FOR profile IN profiles 
      FILTER profile.userId == user.userId 
      RETURN 1
  ) 
  FILTER LENGTH(profile) == 0 
  RETURN user._key
Quadragesima answered 10/2, 2015 at 18:26 Comment(1)
Fantastic, worked like charm! Thank you (and for the explanation as well)! :)Sullage
A
4

The reason for the poor performance is that it will not be able to utilize indices for your operation, since it needs to do a full compare of each document in the collection.

You can for shure use the explain https://www.arangodb.com/2015/02/02/arangodb-2-4-2 utility to let arangodb tell you where the expenses of your query are.

Your query will probably not do what you expect from it. usersWithoutProfiles will be empty, since any user with a Profile will be found in the users collection. If you want to have the other part of the users collection, it could look like that:

LET usersWithProfiles = ( /* This part is ok */
FOR i IN users
    FOR j IN profiles
        FILTER i.userId == j.userId
RETURN i
)

/* now we pick the IDs, we could have done that in your first query... */
LET userWithProfilesIds = FOR i IN userWithProfiles RETURN i.userId;

/* now filter the user list by that */
LET usersWithoutProfiles = FOR i IN users 
     FILTER i.userId NOT IN userWithProfileIds
     RETURN i;

RETURN LENGTH(usersWithoutProfiles)

should give you a proper result.

Arrearage answered 10/2, 2015 at 17:55 Comment(1)
Thank you, I can't believe I made that logical mistake on the second part :o Nevertheless, the query is still slow :( - I restarted the db after 5 minutes of 100% cpu.Sullage

© 2022 - 2024 — McMap. All rights reserved.