Efficiently using a rate-limited API (Echo Nest) with distributed clients

Asked 28/8, 2012 at 6:56 Answered 1/3, 2013 at 20:16

Solved algorithm networking distributed-computing feedback throttling

Background

Echo Nest have a rate limited API. A given application (identified in requests using an API key) can make up to 120 REST calls a minute. The service response includes an estimate of the total number of calls made in the last minute; repeated abuse of the API (exceeding the limit) may cause the API key to be revoked.

When used from a single machine (a web server providing a service to clients) it is easy to control access - the server has full knowledge of the history of requests and can regulate itself correctly.

But I am working on a program where distributed, independent clients make requests in parallel.

In such a case it is much less clear what an optimal solution would be. And in general the problem appears to be undecidable - if over 120 clients, all with no previous history, make an initial request at the same time, then the rate will be exceeded.

But since this is a personal project, and client use is expected to be sporadic (bursty), and my projects have never been hugely successful, that is not expected to be a huge problem. A more likely problem is that there are times when a smaller number of clients want to make many requests as quickly as possible (for example, a client may need, exceptionally, to make several thousand requests when starting for the first time - it is possible two clients would start at around the same time, so they must cooperate to share the available bandwidth).

Given all the above, what are suitable algorithms for the clients so that they rate-limit appropriately? Note that limited cooperation is possible because the API returns the total number of requests in the last minute for all clients.

Current Solution

My current solution (when the question was written - a better approach is given as an answer) is quite simple. Each client has a record of the time the last call was made and the number of calls made in the last minute, as reported by the API, on that call.

If the number of calls is less than 60 (half the limit) the client does not throttle. This allows for fast bursts of small numbers of requests.

Otherwise (ie when there are more previous requests) the client calculates the limiting rate it would need to work at (ie period = 60 / (120 - number of previous requests)) and then waits until the gap between the previous call and the current time exceeds that period (in seconds; 60 seconds in a minute; 120 max requests per minute). This effectively throttles the rate so that, if it were acting alone, it would not exceed the limit.

But the above has problems. If you think it through carefully you'll see that for large numbers of requests a single client oscillates and does not reach maximum throughput (this is partly because of the "initial burst" which will suddenly "fall outside the window" and partly because the algorithm does not make full use of its history). And multiple clients will cooperate to an extent, but I doubt that it is optimal.

Better Solutions

I can imagine a better solution that uses the full local history of the client and models other clients with, say, a Hidden Markov Model. So each client would use the API report to model the other (unknown) clients and adjust its rate accordingly.

I can also imagine an algorithm for a single client that progressively transitions from unlimited behaviour for small bursts to optimal, limited behaviour for many requests without introducing oscillations.

Do such approaches exist? Can anyone provide an implementation or reference? Can anyone think of better heuristics?

I imagine this is a known problem somewhere. In what field? Queuing theory?

I also guess (see comments earlier) that there is no optimal solution and that there may be some lore / tradition / accepted heuristic that works well in practice. I would love to know what... At the moment I am struggling to identify a similar problem in known network protocols (I imagine Perlman would have some beautiful solution if so).

I am also interested (to a lesser degree, for future reference if the program becomes popular) in a solution that requires a central server to aid collaboration.

Disclaimer

This question is not intended to be criticism of Echo Nest at all; their service and conditions of use are great. But the more I think about how best to use this, the more complex/interesting it becomes...

Also, each client has a local cache used to avoid repeating calls.

Updates

Possibly relevant paper.

Manque answered 28/8, 2012 at 6:56 Comment(5)

Does the response also tell you how many seconds are remaining in the current minute, or do you need to guess that too? [Edit: actually, my question makes an assumption about the server's rate limit that may be incorrect. Is the limit 120 requests in a 1 minute period, followed by a new 1 minute period, or is the limit 120 requests in any 60-second window?] – Scribe 28/8, 2012 at 7:37

It's not completely clear, but I did some simple tests and I think that the response number is the number of requests in the last 60 seconds. In other words it appears to be a sliding window. So there is no "current minute". But I would guess it's actually implemented somewhere inbetween (eg perhaps they bin requests in 5 second bins and use that to approximate a sliding window). – Manque 28/8, 2012 at 8:0

OK, I'll assume a sliding window. I don't claim any particular theory behind this, but it seems to me that the thing to do is after each request look at the current use reported by the API response, subtract any requests this client made in the last 60 seconds. That tells you total use by all other clients, plan this client's use accordingly. Then you have two options: (1) use all the available quota, at risk of starving other clients (especially new ones that come in during a burst on this client) or (2) keep ongoing usage less than 120 and leave space for new clients to start. – Scribe 28/8, 2012 at 8:24

And if you have a central server to aid collaboration then you can use that to allocate quota "fairly" between multiple clients. Basic model could be that the client requests permission to make N requests, and receives permission to make K <= N requests over the next t seconds. Then if desired the server could prioritise clients according to how many requests they want to make and what for. – Scribe 28/8, 2012 at 8:28

sure, but exactly how do you do all that? for example, you don't want to simply consume all remaining use, since that will leave nothing for a new client. and how do you estimate what other clients are using? maybe you should use an average over time, for example? you could write some complicated code that assumes a very specific model, but i would bet pounds to pennies that it would be unstable when the other clients don't follow the model. so my intuition is that there's a heuristic that is more robust... (and probably already known to someone). but most rate limiting is server-based. – Manque 28/8, 2012 at 8:50

The above worked, but was very noisy, and the code was a mess. I am now using a simpler approach:

Make a call
From the response, note the limit and count

Calculate

barrier = now() + 60 / max(1, (limit - count))**greedy

On the next call, wait until barrier

The idea is quite simple: that you should wait some length of time proportional to how few requests are left in that minute. For example, if count is 39 and limit is 40 then you wait an entire minute. But if count is zero then you can make a request soon. The greedy parameter is a trade-off - when greater than 1 the "first" calls are made more quickly, but you are more likely hit the limit and end up waiting for 60s.

The performance of this is similar to the approach above, and it's much more robust. It is particularly good when clients are "bursty" as the approach above gets confused trying to estimate linear rates, while this will happily let a client "steal" a few rapid requests when demand is low.

Code here.

Manque answered 9/2, 2013 at 16:8 Comment(1)

Did you consider a Token Bucket? (en.wikipedia.org/wiki/Token_bucket) I use this for a bursty client, although in your case, the co-ordination across multiple clients make the implementation more challenging. – Dissipated 25/8, 2013 at 21:17

After some experimenting, it seems that the most important thing is getting as good an estimate as possible for the upper limit of the current connection rates.

Each client can track their own (local) connection rate using a queue of timestamps. A timestamp is added to the queue on each connection and timestamps older than a minute are discarded. The "long term" (over a minute) average rate is then found from the first and last timestamps and the number of entries (minus one). The "short term" (instantaneous) rate can be found from the times of the last two requests. The upper limit is the maximum of these two values.

Each client can also estimate the external connection rate (from the other clients). The "long term" rate can be found from the number of "used" connections in the last minute, as reported by the server, corrected by the number of local connections (from the queue mentioned above). The "short term" rate can be estimated from the "used" number since the previous request (minus one, for the local connection), scaled by the time difference. Again, the upper limit (maximum of these two values) is used.

Each client computes these two rates (local and external) and then adds them to estimate the upper limit to the total rate of connections to the server. This value is compared with the target rate band, which is currently set to between 80% and 90% of the maximum (0.8 to 0.9 * 120 per minute).

From the difference between the estimated and target rates, each client modifies their own connection rate. This is done by taking the previous delta (time between the last connection and the one before) and scaling it by 1.1 (if the rate exceeds the target) or 0.9 (if the rate is lower than the target). The client then refuses to make a new connection until that scaled delta has passed (by sleeping if a new connected is requested).

Finally, nothing above forces all clients to equally share the bandwidth. So I add an additional 10% to the local rate estimate. This has the effect of preferentially over-estimating the rate for clients that have high rates, which makes them more likely to reduce their rate. In this way the "greedy" clients have a slightly stronger pressure to reduce consumption which, over the long term, appears to be sufficient to keep the distribution of resources balanced.

The important insights are:

By taking the maximum of "long term" and "short term" estimates the system is conservative (and more stable) when additional clients start up.
No client knows the total number of clients (unless it is zero or one), but all clients run the same code so can "trust" each other.
Given the above, you can't make "exact" calculations about what rate to use, but you can make a "constant" correction (in this case, +/- 10% factor) depending on the global rate.
The adjustment to the client connection frequency is made to the delta between the last two connection (adjusting based on the average over the whole minute is too slow and leads to oscillations).
Balanced consumption can be achieved by penalising the greedy clients slightly.

In (limited) experiments this works fairly well (even in the worst case of multiple clients starting at once). The main drawbacks are: (1) it doesn't allow for an initial "burst" (which would improve throughput if the server has few clients and a client has only a few requests); (2) the system does still oscillate over ~ a minute (see below); (3) handling a larger number of clients (in the worst case, eg if they all start at once) requires a larger gain (eg 20% correction instead of 10%) which tends to make the system less stable.

plot

The "used" amount reported by the (test) server, plotted against time (Unix epoch). This is for four clients (coloured), all trying to consume as much data as possible.

The oscillations come from the usual source - corrections lag signal. They are damped by (1) using the upper limit of the rates (predicting long term rate from instantaneous value) and (2) using a target band. This is why an answer informed by someone who understand control theory would be appreciated...

It's not clear to me that estimating local and external rates separately is important (they may help if the short term rate for one is high while the long-term rate for the other is high), but I doubt removing it will improve things.

In conclusion: this is all pretty much as I expected, for this kind of approach. It kind-of works, but because it's a simple feedback-based approach it's only stable within a limited range of parameters. I don't know what alternatives might be possible.

Manque answered 28/8, 2012 at 6:56 Comment(0)

Since you're using the Echonest API, why don't you take advantage of the rate limit headers that are returned with every API call?

In general you get 120 requests per minute. There are three headers that can help you self-regulate your API consumption:

X-Ratelimit-Used
X-Ratelimit-Remaining
X-Ratelimit-Limit

**(Notice the lower-case 'ell' in 'Ratelimit'--the documentation makes you think it should be capitalized, but in practice it is lower case.)

These counts account for calls made by other processes using your API key.

Pretty neat, huh? Well, I'm afraid there is a rub...

That 120-request-per-minute is really an upper bound. You can't count on it. The documentation states that value can fluctuate according to system load. I've seen it as low as 40ish in some calls I've made, and have in some cases seen it go below zero (I really hope that was a bug in the echonest API!)

One approach you can take is to slow things down once utilization (used divided by limit) reaches a certain threshold. Keep in mind though, that on the next call your limit may have been adjusted download significantly enough that 'used' is greater than 'limit'.

This works well up until a point. Since the Echonest doesn't adjust the limit in a predictable mannar, it is hard to avoid 400s in practice.

Here are some links that I've found helpful:

http://blog.echonest.com/post/15242456852/managing-your-api-rate-limit http://developer.echonest.com/docs/v4/#rate-limits

Filterable answered 1/3, 2013 at 20:16 Comment(1)

both the replies i gave use that information. the problem is how you best use it when there are multiple clients that are not communicating. because if one client simply reads until the allowance is exhausted it can starve the others. – Manque 1/3, 2013 at 22:7

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++