Is there a probabilistic data structure for storing relations?
Asked Answered
K

4

20

I have database with user subscriptions to topics. There is currently about 20 000 topics, 20 mln users and 200 mln subscriptions stored in SQL database. Because of its size, the database is partitioned by topics, so I can't get the info in one database query. There are couple of topics with 10 mln subscriptions, couple with 100 000 and others have hundreds or less.

When an event occurs, it usually matches couple of topics, so to inform users, I need to perform query like "give me all users subscribed to topics x, y, z and perform union of sets", so that one user gets the news once even if he subscribed both topics x and z.

The constraints are:

  • There must be no duplicates in the union set. (users can't get the content twice)
  • There can be bounded amount of users missing from the union set. (if sometimes user doesn't get the content, it is not that bad, but it can't be always the same user for the same topic)
  • It is possible to subscribe to new topic without rebuilding whole thing.

I thought about using set of bloom filters for every topic, but they constraints are the other way round: "user either not subscribed for sure or probably subscribed". I need something like "user subscribed for sure or probably not".

Lossy hash tables might be good idea, but I am not sure, if they can be as memory efficient as bloom filters and I am afraid, that it would be always the same user, that is missing the content in his topic.

Do you know any other data structures, that mey be good for solving this problem?

Kingdom answered 12/9, 2015 at 16:13 Comment(22)
What is the number of topics per event?Soundproof
Usually about 5. Minimum 2.Kingdom
Why using regular association table with user_id and topic_id is not enough? You don' need to partition association tableInsular
Is it a possible approach (not probabilistic) to iterate through the topics and append results in a temporary table, then remove or somehow handle duplicates in the temporary table? (Too slow I guess?)Tamasha
@dfens: The association table is 200mln rows long. Times 2 (for two ids), times 4 (number of bytes in integer) is 1 600 000 000 bytes. Around 2GB. Without partitioning queries are slow.Kingdom
@A.S.H: We do exactly that. We check if user is in a hashtable. If he is not - than we add him and send notification. This way, we can start sending notifications before the whole union operation is completed. But this takes time and given the loose constraints, I wanted to check, if there is a better approach.Kingdom
@Kingdom how about storing this 2GB association data on separate database? If you have valid index on both columns on topic_id+user_id query should not take significantly long time.Insular
@Kingdom , I have yet another question on this very interesting problem. The constraint of "never alerting twice", is it the real hard constraint? I mean is the performance satisfactory if this constraint is relaxed?Tamasha
@tkowal: nowadays, 2GB is not that big. Just get a machine with enough RAM to hold the full table and the indexes (because you have indexes, right?) and forget about the problem.Hentrich
The basic process about this database is batch-wise ? (periodically sending notfications to users about topics that they subscribed to) or is it intended to be real-time/online?Subequatorial
@salva: We do have indexes. Maybe this problem really doesn't require probabilistic data structures. It would be cool to come with one, though :)Kingdom
@wildplasser: Both. Usually, we know in advance, that we need to send those messages at given time, but sometimes, we need to inform about something just after it happens.Kingdom
@tkowal: you can reverse the bloom filter, pushing all the elements that are not in the subset so, later, when you check a given element, you get to know if the element is not not-in-the-subset for sure or probably it is with some probability p (that you will have to calculate) and that reduces to your question of is it in the subset for sure or probably not.Hentrich
@tkowal: ... but I think that in practice, that is not going to work. I hadn't made the math but I believe that the values of p are going to be too low.Hentrich
@Kingdom 200M is not so big that you should need a leaky data structure. If each user can be rep'd bare bones as an 8-byte id, then it's only ~1.6Gb. This would be a good application for a redis server. It provides memory resident (but snapshotted) maps and sets directly and is quite fast. You'd store a topic->(set of user id's) map. You'd have to union over topics on the fly in any case. The approximation would be in keeping the redis server consistent with actual subscriptions. But you'd have the same problem with deliberately approximate ds's.Eastertide
@Kingdom I should have said Redis has a command to compute the topics union on the server as well...Eastertide
A different design choice for the problem that may be relevant: that the process to queue or send messages is what prevents "duplicate" messages from being sent. This has some other nice benefits (i.e. of idempotence) in a more robust and parallelizable generating process. Checking against a hash of a combination of the user_id and message content would be one approach (and doesn't suffer from the concern about users consistently missing data from a given topic). From a purely practical perspective, Redis is a great...although finding a scalable answer to the question is interesting.Cariole
@Kingdom just an idea that came up, what if we use a bloom filter (instead of the hash table) in the table of notifications that we are generating? That is, before adding a user, we check his hash in the bloom: if no match, we add him. If match, then he is probably already notified? the goal is to replace the hash table (which should be big if we want efficiency) by an appropriately sized bloom filter?Tamasha
Adding to what @Tamasha said, you may want to use a randomly generated number (may need to be a crypto random number rather than pseudorandom) to act as the user's hash, which could be changed each time it's used (or after some number of times it's used). This makes it less likely that a specific user will be singled out for the same query as their hash changes each time. The chance of two users with the same generated hash should be much less than the chance of a false positive with the bloom filter, so shouldn't be an issue there.Bunyabunya
@Bunyabunya yes, this is a good idea to satisfy this constraint, but probably not completely. I think we still need a way to change the order of appearence of the users, because even if we change randomly the hash for each event, the users who appear later (when scanning the subscribed users for a topic) will always be somehow less lucky as they have more likelihood to miss the notification, as the bloom starts to fill up.Tamasha
If generating these things as one-off event notifications, you can just check against the actual user_id (which applies to my comment above, as well); from a practical perspective (memory, time) there's no need to do anything fancier.Cariole
@A.S.H, you could use the the random hash as the order in which they are processed. Specifically, you could do an approximate sort rather than a full sort as the elements near the ends are more important than the middle. You could also shuffle the ID numbers if they are in order and run the code in that order. Though shuffling would only be practical periodically due to it's O(N) cost.Bunyabunya
C
1

This might not be the solution you were looking for, but you could utilize ElasticSearch's terms filter and to have one document like this for each user:

{
  "id": 12345,
  "topics": ["Apache", "GitHub", "Programming"]
}

Terms filters directly responds to the query "which users subscribe to at least one of these topics" and ES is very smart on how to cache and re-utilize filters.

It wouldn't be a probabilistic data structure but would very efficiently solve this problem. You'd need to use scan api for serializing retrieving potentially large JSON responses. If necessary you can scale this solution to billions of users spread on multiple computers and have response times like 10 - 100 milliseconds. You could also find correlations between topics (significant terms aggregation) and use ES as an engine for further analysis.


Edit: I implemented searching and scan / sroll API usage in Python and got some interesting results. I did "users who subscribe to any three of these topics" queries with that 20m users and 200m subscriptions dataset, and in general the search itself finishes in 4 - 8 milliseconds. Queries return 350.000 - 750.000 users.

Problems arise from getting user ids out of ES, even with the scan/scroll API. On Core i5 I seems to get only 8200 users / second so it is less than 0.5 million / minute (with "_source": false). The query itself looks like this:

{
  "filtered": {
    "filter": {
      "terms": {
        "topics": [ 123, 234, 345 ],
        "execution": "plain",
        "_cache": false
      }
    }
  }
}

In production I would use "execution": "bool" so that partial query results can be cached and re-utilized at other queries. I don't know what is the bottle-neck with getting results out, server's CPU usage is 50% and I run the client's python script at the same machine, utilizing elasticsearch.helpers.scan.

Caddoan answered 22/9, 2015 at 18:43 Comment(5)
This is about as reasonable as my answer. It looks like the elastic search filter loads documents into a bit set using the terms. So it's pretty much doing my suggestion only abstracting away the bit field.Plectognath
Logically, you're correct. Physically, Elasticsearch runs circles around a DB table-scan of 20MM records. Lucene would store these topics in an inverted-index (like the index of a book, with subscriber-ids in place of page-numbers)... so the initial resolution of Topic -> Subscribers would require a quick seek and contiguous read (if not already cached in memory). As a result, Elasticsearch would serve this up in 10s of milliseconds, whereas MySQL would have to examine the subscription bitset for each subscriber to test membership for the Topic (likely 10s of seconds with 20MM subscribers).Caneghem
Yeah I feel like doing a real benchmark on this with X users and Y topics, naturally topics' popularity should follow the power law to be realistic. I am also curious about the scan api which I haven't used yet in practice. I'm sure plain counts would be 5 - 20 ms but larger responses might take longer to generate and receive to the client.Caddoan
All the answers were great and gave me something to test against. I am accepting this one, because using Elastic Search sounds pretty easy to implement and test. I didn't know, how it works under the hood, but if it really is as efficient as bit sets, than it is a huge win. @NikoNyrh: You are right, we need benchmarks to see, if this is actually the best answer. Thank you guys!Kingdom
I generated a dataset with a power-law distribution: 22560 topics, 20 million users, 200 million subscriptions in total (using int ids instead of strings to represent topics), takes 2.9 GB of disk spce. Queries (for 3 topics) take 5 - 25 ms on Core i5 CPU, haven't implemented Scan API utilization yet. Example configs and queries: gistCaddoan
P
2

What if each user record had a BIT FIELD representing all of the topics.

TABLE Users(ID INT, UserName VARCHAR(16), Topics BINARY(8000))

A binary 8k would allow you to have 64000 topics. I would probably use multiple columns of BINARY(1024) each so I could add more topics easily.

Now when an event comes in that's tagged for topics 1, 10, 20, 30, 40. I have to search every User, but this can be parallelized and will always be N complexity where N is the number of total users.

SELECT ID 
FROM Users (READPAST)
WHERE 
    SUBSTRING(Topics, 1 / 8, 1) & (1 * POWER(2, (1 % 8))) > 0
    OR
    SUBSTRING(Topics, 10 / 8, 1) & (1 * POWER(2, (10 % 8))) > 0
    OR
    SUBSTRING(Topics, 20 / 8, 1) & (1 * POWER(2, (20 % 8))) > 0
    OR
    SUBSTRING(Topics, 30 / 8, 1) & (1 * POWER(2, (30 % 8))) > 0
    OR
    SUBSTRING(Topics, 40 / 8, 1) & (1 * POWER(2, (40 % 8))) > 0
OPTION (MAXDOP = 64)

  • No duplicates we're scanning Users once so we don't have o worry about unions
  • Some users missing the READPAST hint will skip any rows that are currently locked (being updated), so some users may be missing from the result.
  • SUbscribe You can [un]subscribe to a topic simply by toggling the topics bit in the Topics column.
Plectognath answered 17/9, 2015 at 18:55 Comment(0)
E
2

As I said in comments, a memory-based exact solution is certainly feasible.

But if you really want an approximate data structure, then what you're looking for a size-limited set (of users for each topic) with random eviction.

You also need to compute unions rapidly on the fly when queries arrive. There's no helpful pre-computation here. If topic sets tend to repeat, you can look at caching the frequently used unions.

All the usual methods of representing a set apply. Hash tables (both closed and open), trees, and skip lists (all containing user id keys; no values required) are most likely.

If you use a closed hash table with a good hash function, pseudo-random eviction happens automatically. On collision, just replace the previous value. The problem with closed hashes is always picking a good table size for the set you need to represent. Remember that to recover set elements, you'll have to traverse the whole open table including null entries, so starting with a huge table isn't a good idea; rather start with a small one and reorganize, growing by a factor each time so reorganization amortizes to constant time overhead per element stored.

With the other schemes, you can literally do pseudo-random eviction when the table gets too big. The easiest way to evict fairly is store the user id's an a table and have the size-limited set store indices. Evict by generating a random index into the table and removing that id before adding a new one.

It's also possible to evict fairly from a BST set representation by using an order statistic tree: store the number of descendants in each node. Then you can always find the n'th element in key sorted order, where n is pseudo-random, and evict it.

I know you were looking for the bitwise space efficiency of a Bloom filter, but guaranteeing no false positives seems to rule that out.

Eastertide answered 22/9, 2015 at 3:33 Comment(0)
C
1

This might not be the solution you were looking for, but you could utilize ElasticSearch's terms filter and to have one document like this for each user:

{
  "id": 12345,
  "topics": ["Apache", "GitHub", "Programming"]
}

Terms filters directly responds to the query "which users subscribe to at least one of these topics" and ES is very smart on how to cache and re-utilize filters.

It wouldn't be a probabilistic data structure but would very efficiently solve this problem. You'd need to use scan api for serializing retrieving potentially large JSON responses. If necessary you can scale this solution to billions of users spread on multiple computers and have response times like 10 - 100 milliseconds. You could also find correlations between topics (significant terms aggregation) and use ES as an engine for further analysis.


Edit: I implemented searching and scan / sroll API usage in Python and got some interesting results. I did "users who subscribe to any three of these topics" queries with that 20m users and 200m subscriptions dataset, and in general the search itself finishes in 4 - 8 milliseconds. Queries return 350.000 - 750.000 users.

Problems arise from getting user ids out of ES, even with the scan/scroll API. On Core i5 I seems to get only 8200 users / second so it is less than 0.5 million / minute (with "_source": false). The query itself looks like this:

{
  "filtered": {
    "filter": {
      "terms": {
        "topics": [ 123, 234, 345 ],
        "execution": "plain",
        "_cache": false
      }
    }
  }
}

In production I would use "execution": "bool" so that partial query results can be cached and re-utilized at other queries. I don't know what is the bottle-neck with getting results out, server's CPU usage is 50% and I run the client's python script at the same machine, utilizing elasticsearch.helpers.scan.

Caddoan answered 22/9, 2015 at 18:43 Comment(5)
This is about as reasonable as my answer. It looks like the elastic search filter loads documents into a bit set using the terms. So it's pretty much doing my suggestion only abstracting away the bit field.Plectognath
Logically, you're correct. Physically, Elasticsearch runs circles around a DB table-scan of 20MM records. Lucene would store these topics in an inverted-index (like the index of a book, with subscriber-ids in place of page-numbers)... so the initial resolution of Topic -> Subscribers would require a quick seek and contiguous read (if not already cached in memory). As a result, Elasticsearch would serve this up in 10s of milliseconds, whereas MySQL would have to examine the subscription bitset for each subscriber to test membership for the Topic (likely 10s of seconds with 20MM subscribers).Caneghem
Yeah I feel like doing a real benchmark on this with X users and Y topics, naturally topics' popularity should follow the power law to be realistic. I am also curious about the scan api which I haven't used yet in practice. I'm sure plain counts would be 5 - 20 ms but larger responses might take longer to generate and receive to the client.Caddoan
All the answers were great and gave me something to test against. I am accepting this one, because using Elastic Search sounds pretty easy to implement and test. I didn't know, how it works under the hood, but if it really is as efficient as bit sets, than it is a huge win. @NikoNyrh: You are right, we need benchmarks to see, if this is actually the best answer. Thank you guys!Kingdom
I generated a dataset with a power-law distribution: 22560 topics, 20 million users, 200 million subscriptions in total (using int ids instead of strings to represent topics), takes 2.9 GB of disk spce. Queries (for 3 topics) take 5 - 25 ms on Core i5 CPU, haven't implemented Scan API utilization yet. Example configs and queries: gistCaddoan
C
1

[This solution is similar to Louis Ricci's, except inverted to the Topics table - which could make subscription updates less practical, be warned!]

(The probabilistic data structure approach is cool, but unnecessary for your current data-size. I was initially looking at compressed bitsets for a non-probabilistic solution, as they are great at performing set operations in-memory, but I think that's overkill as well. Here is a good implementation for this type of use-case. if you're interested.)

But looking at the sparsity of your data, bitsets waste space over integer-arrays. And even with integer-arrays, the union operation is still pretty inexpensive given that you only have an average of 10,000 subscriptions per topic.


So maybe, just maybe, a dead-simple data-structure given your use-case is simply:

Topic 1 => [array of subscriber IDs]
Topic 2 => [array of subscriber IDs]
...
Topic 20,000 => [array of subscriber IDs]

Storing (avg) 10,000 subscriber IDs (assuming 32-bit integers) only requires about 40kb of space per topic.

[In an array-type or BLOB, depending on your database]

With 20,000 topics, this adds only 800mb of data to your topic table ... and very little of this (~200kb avg) needs to be loaded to memory when a notification event occurs!


Then when an average event (affecting 5 topics) occurs, all that needs to happen is:

  1. Query / Pull the data for the relevant topics (avg 5 records) into memory (avg ~200kb of I/O)

  2. Dump them into a Set data structure (de-duplicates subscriber list)

  3. Alert the subscribers in the Set.

Caneghem answered 23/9, 2015 at 20:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.