How to loop through paginated API using python
Asked Answered
C

1

24

I need to retrieve the 500 most popular films from a REST API, but the results are limited to 20 per page and I am only able to make 40 calls every 10 seconds (https://developers.themoviedb.org/3/getting-started/request-rate-limiting). I am unable to loop through the paginated results dynamically, so that the 500 most popular results are in a single list.

I can successfully return the top 20 most popular films (see below) and enumerate the number of the film, but I am getting stuck working through the loop that allows me to paginate through the top 500 without timing out due to the API rate limit.

import requests #to make TMDB API calls

#Discover API url filtered to movies >= 2004 and containing Drama genre_ID: 18
discover_api = 'https://api.themoviedb.org/3/discover/movie? 
api_key=['my api key']&language=en-US&sort_by=popularity.desc&include_adult=false&include_video=false&primary_release_year=>%3D2004&with_genres=18'

#Returning all drama films >= 2004 in popularity desc
discover_api = requests.get(discover_api).json()

most_popular_films = discover_api['results']

#printing movie_id and movie_title by popularity desc
for i, film in enumerate(most_popular_films):
    print(i, film['id'], film['title'])


Sample response:

{
  "page": 1,
  "total_results": 101685,
  "total_pages": 5085,
  "results": [
    {
      "vote_count": 13,
      "id": 280960,
      "video": false,
      "vote_average": 5.2,
      "title": "Catarina and the others",
      "popularity": 130.491,
      "poster_path": "/kZMCbp0o46Tsg43omSHNHJKNTx9.jpg",
      "original_language": "pt",
      "original_title": "Catarina e os Outros",
      "genre_ids": [
        18,
        9648
      ],
      "backdrop_path": "/9nDiMhvL3FtaWMsvvvzQIuq276X.jpg",
      "adult": false,
      "overview": "Outside, the first sun rays break the dawn.  Sixteen years old Catarina can't fall asleep.  Inconsequently, in the big city adults are moved by desire...  Catarina found she is HIV positive. She wants to drag everyone else along.",
      "release_date": "2011-03-01"
    },
    {
      "vote_count": 9,
      "id": 531309,
      "video": false,
      "vote_average": 4.6,
      "title": "Brightburn",
      "popularity": 127.582,
      "poster_path": "/roslEbKdY0WSgYaB5KXvPKY0bXS.jpg",
      "original_language": "en",
      "original_title": "Brightburn",
      "genre_ids": [
        27,
        878,
        18,
        53
      ],

I need the the python loop to append the paginated results into a single list until I have captured the 500 most popular films.


Desired Output:

Movie_ID  Movie_Title
280960    Catarina and the others
531309    Brightburn
438650    Cold Pursuit
537915    After
50465     Glass
457799    Extremely Wicked, Shockingly Evil and Vile

Claus answered 19/5, 2019 at 8:29 Comment(4)
Doesn't the API response contain a next url field?Keown
It depends on the API. As the reponse includes some paginating fields ("page": 1, "total_pages": 5085, ...}) I would hope it to accept a page=n field.Fiance
@AdamGold, as far as I can tell there is no next url field, but there is a "page" parameter, which requires an integer value and defaults to "1" if no value is entered. That being said, I think your first solution could work, but there doesn't seem to be any logic to limit the loop to the top 'n' (500 in my case) results.Claus
@SergeBallesta, the API will accept a `page=n' parameter, but I also need to limit the results within those pages to the top 'n' (500 results in this case). I don't necessarily want to loop through all the results, just until I have reached the top 500 results—I also need to ensure I can handle the API rate limit of 40 requests every 10 seconds.Claus
K
49

Most APIs include a next_url field to help you loop through all results. Let's examine some cases.

1. No next_url field

You can just loop through all pages until results field is empty:

import requests #to make TMDB API calls

#Discover API url filtered to movies >= 2004 and containing Drama genre_ID: 18
discover_api_url = 'https://api.themoviedb.org/3/discover/movie? 
api_key=['my api key']&language=en-US&sort_by=popularity.desc&include_adult=false&include_video=false&primary_release_year=>%3D2004&with_genres=18'

most_popular_films = []
new_results = True
page = 1
while new_results:
    discover_api = requests.get(discover_api_url + f"&page={page}").json()
    new_results = discover_api.get("results", [])
    most_popular_films.extend(new_results)
    page += 1

#printing movie_id and movie_title by popularity desc
for i, film in enumerate(most_popular_films):
    print(i, film['id'], film['title'])

2. Depend on total_pages field

import requests #to make TMDB API calls

#Discover API url filtered to movies >= 2004 and containing Drama genre_ID: 18
discover_api_url = 'https://api.themoviedb.org/3/discover/movie? 
api_key=['my api key']&language=en-US&sort_by=popularity.desc&include_adult=false&include_video=false&primary_release_year=>%3D2004&with_genres=18'

discover_api = requests.get(discover_api_url).json()
most_popular_films = discover_api["results"]
for page in range(2, discover_api["total_pages"]+1):
    discover_api = requests.get(discover_api_url + f"&page={page}").json()
    most_popular_films.extend(discover_api["results"])

#printing movie_id and movie_title by popularity desc
for i, film in enumerate(most_popular_films):
    print(i, film['id'], film['title'])

3. next_url field exists! Yay!

Same idea, only now we check for the emptiness of the next_url field - If it's empty, it's the last page.

import requests #to make TMDB API calls

#Discover API url filtered to movies >= 2004 and containing Drama genre_ID: 18
discover_api = 'https://api.themoviedb.org/3/discover/movie? 
api_key=['my api key']&language=en-US&sort_by=popularity.desc&include_adult=false&include_video=false&primary_release_year=>%3D2004&with_genres=18'

discover_api = requests.get(discover_api).json()
most_popular_films = discover_api["results"]
while discover_api["next_url"]:
    discover_api = requests.get(discover_api["next_url"]).json()
    most_popular_films.extend(discover_api["results"])

#printing movie_id and movie_title by popularity desc
for i, film in enumerate(most_popular_films):
    print(i, film['id'], film['title'])
Keown answered 19/5, 2019 at 8:42 Comment(8)
I attempted to use your first solution but received an error (see below). Also, I would like to limit the loop to only return the top 'n' results rather than looping through everything and potentially returning more results than I need if possible.Claus
What's the received error? BTW I would go with solution #2. It's more elegant. If you want to limit the number of results, get the first N pages (range(2, 26) to get the first 25 pages), or check to see if most_popular_films length is more than N.Keown
TypeError Traceback (most recent call last) <ipython-input-124-ccc838234b24> in <module> 3 page = 1 4 while new_results: ----> 5 discover_api = requests.get(discover_api + f"&page={page}").json() 6 new_results = discover_api["results"] 7 most_popular_films.extend(new_results) TypeError: unsupported operand type(s) for +: 'dict' and 'str'Claus
I used your updated response for solution #1 and received another error message: KeyError Traceback (most recent call last) <ipython-input-143-d80540966b40> in <module> 4 while new_results: 5 discover_api = requests.get(discover_api_url + f"&page={page}").json() ----> 6 new_results = discover_api["results"] 7 most_popular_films.extend(new_results) 8 page += 1 KeyError: 'results'Claus
Fixed and updated. Again, I'd like to recommend you to use the 2nd solution.Keown
the API response says there are 5087 total pages and 101729 total results. Wouldn't it make sense to add logic to the while loop to say while pages >= 50:, so we don't loop through all the pages? Either way, wouldn't I need logic that handles the API rate limit of 40 calls per 10 seconds?Claus
Let us continue this discussion in chat.Claus
About the first solution, I would recommend against having new_results be a boolean first and a list next...Countermine

© 2022 - 2024 — McMap. All rights reserved.