I have database with user subscriptions to topics. There is currently about 20 000 topics, 20 mln users and 200 mln subscriptions stored in SQL database. Because of its size, the database is partitioned by topics, so I can't get the info in one database query. There are couple of topics with 10 mln subscriptions, couple with 100 000 and others have hundreds or less.
When an event occurs, it usually matches couple of topics, so to inform users, I need to perform query like "give me all users subscribed to topics x, y, z and perform union of sets", so that one user gets the news once even if he subscribed both topics x and z.
The constraints are:
- There must be no duplicates in the union set. (users can't get the content twice)
- There can be bounded amount of users missing from the union set. (if sometimes user doesn't get the content, it is not that bad, but it can't be always the same user for the same topic)
- It is possible to subscribe to new topic without rebuilding whole thing.
I thought about using set of bloom filters for every topic, but they constraints are the other way round: "user either not subscribed for sure or probably subscribed". I need something like "user subscribed for sure or probably not".
Lossy hash tables might be good idea, but I am not sure, if they can be as memory efficient as bloom filters and I am afraid, that it would be always the same user, that is missing the content in his topic.
Do you know any other data structures, that mey be good for solving this problem?
p
(that you will have to calculate) and that reduces to your question of is it in the subset for sure or probably not. – Hentrichp
are going to be too low. – Hentrich