Count unique visitors with Redis or Aerospike

Asked 12/1, 2017 at 14:49 Answered 19/1, 2017 at 0:5

I am trying to count the unique vistors per page or other events (like click etc ) etc for different clients. What I plan to do is assign a unique cookie based GUID and then for every event call SADD for the GUID. redis key will be SET_[ EVENTID ]

If I just wanted count of users I could probably use PFADD, but my app also needs to know who are the unique users

But problem is if there are too many EVENTs or too many users then SADD will end up with a lot of user ids in memory We are expecting 1000k+ user events every hour , across all clients and the number of events will also be 100+

I want an opinion is redis the correct storage choice. Any traditional RDBMS method does not work because of the sheer number of requests

I am not sure if any other storage can help like Aerospike

Rettke answered 12/1, 2017 at 14:49 Comment(2)

Could you please specify what do you intend to do with all this information? What will be typical use cases for "reading" this data? How many concurrent reads of the data will be done and with what frequency? This is essentially a question about the data modeling IMO. – Publea 12/1, 2017 at 14:56

The counts are required to bill the customer. And we will have to show the list of users as a supporting document. The reads will be very sparse probably once per day. I plan to flush them to a mysql DB at the end of the day – Rettke 12/1, 2017 at 15:15

In RTB, where Aerospike is used heavily, frequency capping is a common use case for the Demand-Side Platforms (DSP). A cap is placed on the number of times a user sees a particular ad, or ads from a specific campaign. At the same time, the total number of impressions is tracked, along with the remaining budget. These counters typically have a short TTL.

Solution

You could use a composite key <page ID : user ID : yyyymmdd> as a flag for whether a specific user had visited the page, with a 24h TTL. This would live in a set page-visit in an in-memory, data-in-index namespace.

If there is no such key:

Create a new record with this key in the set page-visit with an initial value of 1.
list-append the user ID to a key <page ID : yyyymmdd> in the set page-users. This set (page-users) can live in a namespace that stores its data on SSD.

If this key exists:

Increment the count of the record at this key. This will provide instantaneous unique visitor counts for each page.

At the end of the day:

Get the count for each page, as well as the list of unique users that visited that page.
Read the record with key <page ID : yyyymmdd> from the set page-users
Assemble a batch-read against the users set based on this list of user IDs.

Advantages

Checking the page-visit flag is very low latency. It uses very little memory, as data-in-index namespaces take no additional space past the 64B of metadata each object in Aerospike costs. For example, 10M users * 64B * replication-factor 2 = 1.2GB of DRAM.
The list of unique users per-page is stored on SSD with a much lower cost per-GB than an in-memory only database like Redis. You just pay 64B per-object for the metadata entry in the in-memory primary index. The list-append operation is very efficient, as you only send the latest user ID to be appended to the page-users record. You only use this operation when a new unique user appears on the page (guarded by the page-visit flag).
All these records have their 24h TTL, so you can let them expire.
Aerospike is a distributed key-value database that scales vertically to use all the cores on your server, and horizontally without your application requiring sharding as new nodes join. The data distribution is handled automatically by the server and tracked by the client without your application needing to change.

Humble answered 19/1, 2017 at 0:5 Comment(0)

Hyperloglog & Redis

It sounds like what you might want is a HyperLogLog. It's a probabilistic data structure that allows you to tradeoff accuracy in favor of a constant size data structure. The nice thing is the innacuracy is bounded and determined by the data structure size. Using 1.5kB of memory gets you a unique count within about 2% of the right answer. Use more data per counter, get more accuracy.

Furthermore, this functionality is built into Redis.

Hyperloglog Wiki page: https://en.wikipedia.org/wiki/HyperLogLog

Relevant Redis blog post: http://antirez.com/news/75

HyperLogLog & VoltDB

If you're interested in a more traditional RDBMS model that has much better HA support than Redis does, take a look at VoltDB. It supports extreme throughput on a box, and also natively clusters. Furthermore it has rich SQL support to do many of the kinds of things Redis does (and more). It also has built in hyperloglog support in SQL. It also has an example that counts unique ids that sounds a lot like what you're doing.

http://voltdb.com

Example with counting unique ids: https://github.com/VoltDB/voltdb/tree/master/examples/uniquedevices

Regenaregency answered 12/1, 2017 at 19:44 Comment(1)

But hyperloglog has the same problem. I will not be able to get members – Rettke 13/1, 2017 at 5:47

Another way to model this in Redis is, instead of using Sets, to use Bitmaps. In your case, maintain a mapping between GUIDs and an integer denoting the index in the bit array. You may want to consider using buckets for each event to avoid waste due to sparse metrics.

This approach is being used by several Redis-backed analytics libraries, see Minuteman and bitmapist for example.

Danettedaney answered 12/1, 2017 at 15:4 Comment(0)

Recommended topics

Hot tags