Advice on pubsub topic division based on geohashes for ably websocket connection service
Asked Answered
A

2

6

My question concerns the following use case:

Use case actors

  • User A: The user who sets a broadcast region and views stream with live posts.
  • User B: The first user who sends a broadcast message from within the broadcast region set by user A.
  • User C: The second user who sends a broadcast message from within the broadcast region set by user A.

enter image description here

Use case description

  • User A selects a broadcast region within which boundaries (radius) (s)he wants to receive live broadcast messages.
  • User A opens the livefeed and requests an initial set of livefeed items.
  • User B broadcasts a message from within the broadcast region of user A while user A’s livefeed is still open. A label with 1 new livefeed item appears at the top of User A’s livefeed while it is open.
  • As user C publishes another livefeed post from within the selected broadcast region from user A, the label counter increments.

User A receives a notification similar to this example of Facebook: enter image description here

The solution I thought to apply (and which I think Pubnub uses), is to create a topic per geohash.
In my case that would mean that for every user who broadcasted a message, it needs to be published to the geohash-topic, and clients (app / website users) would consume the geohash-topic through a websocket if it fell within the range of the defined area (radius). Ably seems to provide this kind of scalable service using web sockets.

I guess it would simplified be something like this:

enter image description here

So this means that a geohash needs to be extracted from the current location from where the broadcast message is sent. This geohash should have granular scale that is small enough so that the receiving user can set a broadcast region that is more or less accurate. (I.e. the geohash should have enough accuracy if we want to allow users to define a broadcast region within which to receive live messages, which means that one should expect a quite large amount of topics if we decided to scale).

Option 2 would be to create topics for a geohash that has a less specific granularity (covering a larger area), and let clients handle the accuracy based on latlng values that are sent along with the message.
The client would then decide whether or not to drop messages. However, this means more messages are sent (more overhead), and a higher cost.

I don't have experience with this kind of architecture, and question the viability / scalability of this approach.
Could you think of an alternate solution to this question to achieve the desired result or provide more insight on how to solve this kind of problem overall? (I also considered using regular req-res flow, but this means spamming the server, which also doesn't seem like a very good solution).

I actually checked.
Given a region of 161.4 km² (like region Brussels), the division of geohashes by length of the string is as follows:

1   ≤ 5,000km   ×   5,000km
2   ≤ 1,250km   ×   625km
3   ≤ 156km     ×   156km
4   ≤ 39.1km    ×   19.5km
5   ≤ 4.89km    ×   4.89km
6   ≤ 1.22km    ×   0.61km
7   ≤ 153m      ×   153m
8   ≤ 38.2m     ×   19.1m
9   ≤ 4.77m     ×   4.77m
10  ≤ 1.19m     ×   0.596m
11  ≤ 149mm     ×   149mm
12  ≤ 37.2mm    ×   18.6mm

Given that we would allow users to have a possible inaccuracy up to 153m (on the region to which users may want to subscribe to receive local broadcast messages), it would require an amount of topics that is definitely already too large to even only cover the entire region of Brussels.
So I'm still a bit stuck at this level currently.

Adipocere answered 4/6, 2018 at 13:57 Comment(0)
A
2

1. PubNub

PubNub is currently the only service that offers an out of the box geohash pub-sub solution over websockets, but their pricing is extremely high (500 connected devices cost about 49$, 20k devices cost 799$) UPDATE: PubNub has updated price, now with unlimited devices. Website updates coming soon.

Pubnub is working on their pricing model because some of their customers were paying a lot for unexpected spikes in traffic.

However, it will not be a viable solution for a generic broadcasting messaging app that is meant to be open for everybody, and for which traffic is therefore very highly unpredictable.

This is a pity, since this service would have been the perfect solution for us otherwise.

2. Ably

Ably offers a pubsub system to stream data to clients over websockets for custom channels. Channels are created dynamically when a client attaches itself in order to either publish or subscribe to that channel.

The main problem here is that:

  • If we want high geohash accuracy, we need a high number of channels and hence we have to pay more;
  • If we go with low geohash accuracy, there will be a lot of redundant messaging: Let's say that we take a channel that is represented by a geohash of 4 characters, spanning a geographical area of 39.1 x 19.5 km.

Any post that gets sent to that channel, would be multiplexed to everybody within that region who is currently listening.

However, let's say that our app allows for a maximum radius of 10km, and half of the connected users has its setting to a 1km radius.

This means that all posts outside of that 2km radius will be multiplexed to these users unnecessarily, and will just be dropped without having any further use.

We should also take into account the scalability of this approach. For every geohash that either producer or consumer needs, another channel will be created.

It is definitely more expensive to have an app that requires topics based on geohashes worldwide, than an app that requires only theme-based topics.

That is, on world-wide adoption, the number of topics increases dramatically, hence will the price.

Another consideration is that our app requires an additional number of channels:

  • By geohash and group: Our app allows the possibility to create geolocation based groups (which would be the equivalent of Twitter like #hashtags).
  • By place
  • By followed users (premium feature)

There are a few optimistic considerations to this approach despite:

  • Streaming is only required when the newsfeed is active: when the user has a browser window open with our website + when the user is on a mobile device, and actively has the related feed open
  • Further optimisation can be done, e.g. only start streaming as from 10 to 20 seconds after refresh of the feed
  • Streaming by place / followed users may have high traffic depending on current activity, but many place channels will be idle as well

A very important note in this regard is how Ably bills its consumers, which can be used to our full advantage:

A channel is opened when any of the following happens:

  • A message is published on the channel via REST
  • A realtime client attaches to the channel. The channel remains active for the entire time the client is attached to that channel, so if you connect to Ably, attach to a channel, and publish a message but never detach the channel, the channel will remain active for as long as that connection remains open.

A channel that is open will automatically close when all of the following conditions apply:

There are no more realtime clients attached to the channel At least two minutes has passed since the last message was published. We keep channels alive for two minutes to ensure that we can provide continuity on the channel as part of our connection state recovery.

As an example, if you have 10,000 users, and at your busiest time of the month there is a single spike where 500 customers establish a realtime connection to Ably and each attach to one unique channel and one global shared channel, the peak number of channels would be the sum of the 500 unique channels per client and the one global shared channel i.e. 501 peak channels. If throughout the month each of those 10,000 users connects and attaches to their own unique channel, but not necessarily at the same time, then this does not affect your peak channel count as peak channels is the concurrent number of channels open at any point of time during that month.

Optimistic conclusion

The most important conclusion is that we should consider that this feature may not be as crucial as believe it is for a first version of the app.

Although Twitter, Facebook, etc offer this feature of receiving live updates (and users have grown to expect it), an initial beta of our app on a limited scale can work without, i.e. the user has to refresh in order to receive new updates.

During a first launch of the app, statistics can ba gathered to gain more insight into detailed user behaviour. This will enable us to build more solid infrastructural and financial reflections based on factual data.

Adipocere answered 9/6, 2018 at 14:25 Comment(7)
in your first post you've made an assumption on the high degree of locality, ie. finer resolution, that might be demanded of your users which you reconsider in your second post - would it matter if you switched to a nearest-neighbour algorithm to determine channel participation? maybe your users just want company and don't care about how far away people actually areTweed
(apparently 'enter' submits!?) anyway, i was going to also say that i came across a commercial websocket service that allows a large number of messages across thousand(s) of peers for dollars per month - drawing short of making a recommendation (since i haven't actually used this service commercially) wsninja.io and their pricing: 1000 Connections, 10M Messages per Day, $4.95/mo or Unlimited Connections, 1K Messages per Second, $29.95/moTweed
@Tweed Actually proximity matters. In that sense that the target would be cities, and that realtime walking distance is an assumption. For example when you see something in realtime nearby, you probably don't want to walk 15km till destination. I'll check out the link, many thanks. 1000 connections at one point in time (not aggregated over the month) may be reasonable. I guess that it depends on user behaviour, which we won't know until after launch.Adipocere
yep thanks for clarifying - and another question: do the users need to catch up on historical messages since they were last connected, or is this just live-feed only? and, do you expect any sort of bias between the location-based and topical filtering (ie - logical 'tags' as distinct from a pubsub queue concept 'topic') - or, when you talk about group, are you talking about a 'group of users'?Tweed
@Tweed With 'group', you could say something like 'Couchsurfing in Brussels'. It would be like a hashtag. I don't have a real bias although the livefeed is the primary feed. Then again, the groups are part of the business model. So it's not really a pick and choose.Adipocere
and what about historical catchup? or is the feed more like a radio broadcast?Tweed
@Tweed Yes that will be required too. Because live posts for some places may be from earlier, but may still be useful to see.Adipocere
M
1

Putting aside the question of Ably, Pubnub and a DIY solution, the core of the question is this:

Where is message filtering taking place?

There are three possible solution:

  1. The Pub/Sub service.

  2. The Server (WebSocket connection handler).

  3. Client side (the client's device).

Since this is obviously a mobile oriented approach, client side message filtering is extremely rude, as it increases data consumption by the client while much of the data might be irrelevant.

Client side filtering will also increase battery consumption and will likely result in lower acceptance rates by clients.

This leaves pub/sub filtering (channel names / pattern matching) and server-side filtering.

Pub/Sub channel name filtering

A single pub/sub service serves a number of servers (if not all of them), making it a very expensive resource (relative to the resources we have at hand).

Using channel names to filter messages would be ideal - as long as the filtering is cheap (using exact matches with channel name hash mapping).

However, pattern matching (when subscribing to channels with inexact names, such as "users.*") is very expansive when compared to exact pattern matching.

This means that Pub/Sub channel name filtering can't be used to filter all the messages without overloading the pub/sub system.

Server side filtering

Since a server accepts WebSocket connections and bridges between the WebSocket and the pub/sub service, it's in an ideal position to filter the messages.

However, we don't want the server to process all the messages for all the clients for each connection, as this is an extreme duplication of effort.

Hybrid solution

A classic solution would divide the earth into manageable sections (1 sq. km per section will require 510.1 million unique channel names for full coverage... but I would suggest that the 70% ocean space should be neglected).

Busy sections might be subdivided (NYC might require a section per 250 sq meters rather than 1 sq kilometer).

This allows publishers to publish to exact channel names and subscribers to subscribe to exact channel names.

Publishers might need to publish to more than one channel and subscribers might need to subscribe to more than one channel, depending on their exact location and the grid's borders.

This filtering scheme will filter much, but not all.

The server node will need to look into the message, review it's exact geo-location and filter messages before deciding if they should be sent along the WebSocket connection to the client.

Why the Hybrid Solution?

This allows the system to scale with relative ease.

Since server nodes are (by design) cheaper than the pub/sub service, they could be used to handle the exact location filtering (the heavy work).

At the same time, the strength of the pub/sub system can be used to minimize the server's workload and filter the obvious mis-matches.

Pubnub vs. Ably?

I don't know. I didn't use either of them. I worked with Redis and implemented my own pub/sub solution.

I assume they are both great and it's really up to your needs.

Personally I prefer the DIY approach when it comes to customized or complex situations. IMHO, this seems like it would fall into the DIY category if I were to implement it.

Melaniemelanin answered 12/6, 2018 at 21:25 Comment(5)
The server can filter the posts, for example using Redis geospatial, but how would you let clients subscribe to a pubsub system to receive a stream of geolocation based posts, if not through a sort of geohashing system? Do you suggest to implement websockets ourselves in our servers immediately? The main reason for using an external service would be to avoid the pain of needing to scale websockets at affordable price (the overhead of doing this ourselves probably wouldn't come for free either).Adipocere
@Adipocere , yes, I definitely recommend that you implement the WebSocket / SSE connections yourselves. This could be relatively simple work but it has a huge impact on pub/sub performance and allows work distribution across servers. If everyone connects directly to the pub/sub service, it could overload the service. Consider that each server might serve thousands of clients and use only a handful of pub/sub connections. It’s a more sustainable and scalable design that will be easier to maintain.Melaniemelanin
I'm split between two worlds currently. I've noticed that Stephen Blum has updated some text in my current solution regarding Pubnub's pricing. This would make using Pubnub an ideal solution since they seem to already offer geohash-based topics out of the box and would manage the scalability / performance concerns so we don't have to deal with that. But I could indeed also return streams over web sockets to the clients on request. I'll give it some time to think it over, thanks!Adipocere
@Adipocere , you’re welcome. As a parting thought I might consider looking into the way Pubnub is implemented and the details (or limits) of their service / features. If my answer helped at all, please feel free to upvote (no need to accept if it’s not the solution you go for).Melaniemelanin
Yes, I'm going to check out how it could be done with Pubnub. Still there would be challenges, the client having to subscribe to a multitude of topics depending on their defined radius. Maybe this would not be ideal or I can find tradeoffs. I'll think it over, thanks!Adipocere

© 2022 - 2024 — McMap. All rights reserved.