GitHub API: How to improve very in-efficient polling on activity events?
Asked Answered
E

1

6

GitHub API provides the feature of activity events for users, orgs and repos. The APIs support pagination upto 10 pages for a total of 300 events with 30 events per page. Rate Limiting is achieved using ETAG headers. I am trying to poll this API to get the latest activity. However this scheme is very in-efficient due to the design supported by Github as mentioned. Lets say I make a request on page-1 by

https://api.github.com/users/me/events/orgs/my-org?page=1

and i will get an ETAG entry for this page. Now I move to the next page-2 and do

https://api.github.com/users/me/events/orgs/my-org?page=2

and will get the ETAG for this 2nd page. Similarly I can pull events from all 10 supported pages.

Now lets say that some activity was performed on my orgs Github account. Lets assume that only 1 new event occured. In this case when I poll the API for page-1 with the ETAG it will return the changed page with the new event included in it. Similarly polling on page-2 with its previous ETAG will also send the changed page. This change in page-2 is however the event that was previously the last event of page-1 and has now moved to the top on page-2. This "shift-to-next" will happen for all the pages. There is NO way to find out the number of NEW events that took place.The only solution is to keep polling on page-1 to get the latest events. However this approach has a serious flaw explained below:

The situation gets worse when the number of new events between my poll rounds is greater than 30(max items on one page). In this case, events prior to the latest new 30 events will slip to page-2 directly. If I only poll on page-1 i will loose these events that slipped to page-2. The only solution that is coming to my mind is to keep a cache of the entire events and then sweep on all pages. This is however a very in-efficient and un-desirable way to do it and kills the purpose of on events notification API.

I hope some github-dev can answer this

Etch answered 25/6, 2013 at 12:4 Comment(0)
U
8

Since each event has an ID and events are ordered in the response, you only need to remember the ID of the first event in the previous response (not all of the events).

So, the way I would do it is:

Initial fetch:

  1. fetch all event pages (pages from 1 to 10)
  2. store the ETAG of the first page
  3. store the ID of the first event in the first page

Subsequent fetches:

  1. conditionally fetch first page of events with the stored ETAG
  2. if a 304 Not modified response is received, then there are no new events so terminate
  3. if a 200 OK response is received, then we have new events. Fetch pages from 1 to 10 sequentially until the first page that contains the event with the ID equal to the stored ID. All newly fetched events up until that event are new events and should be processed. So, the number of new events is discovered incrementally as the result of fetching all events up until the event you have seen before. And you are fetching only pages that you have to fetch, not more than that.
  4. store the ETAG of the first page
  5. store the ID of the first event in the first page
  6. wait for some time and then go to step 1
Unbar answered 25/6, 2013 at 15:33 Comment(5)
Thanks for the great answer. One hiccup though. Are the events guaranteed to be in order in the responses? I ask this because the event id's are not incrementalEtch
They are certainly not incremental since they are probably global IDs (one global counter for all events, not just events for the single user/org/repo you are querying about). But the IDs are ordered: events with higher IDs precede events with lower IDs in the JavaScript array that is returned as the response.Unbar
That means that I can trust the ordering of the events and that should solve my problem. Thanks again for the very clear answerEtch
Ivan, your answer was brilliant, thanks for that. When fetching different resources from a repo I noticed that they all come ordered by latest activity. This is not the case for the stargazers API. How could one use this caching technique you describe for that particular API which is ordered from oldest to newest?Angulation
Remember that conditional requests (with etag) don't count towards your primary rate limit. docs.github.com/en/rest/using-the-rest-api/… There's also a secondary rate limit that needs to be observed, so don't retry failed requests more often than a minute apart and adhere to the other guidelines from docs.github.com/en/rest/using-the-rest-api/….Semifinal

© 2022 - 2024 — McMap. All rights reserved.