pythonpython-requestsapi-design

How do I get all data from an API when I don't know the max number of pages


I am pulling data using an API, Python, and the requests package. I want to pull all the data, but have only been able to pull 4,000 rows. How do I pull all of the data? The number of pages is not present in the response. I don't know how many rows are in the data, but it's more than 4,000.

Here is the working code that can pull 4,000 rows, though some of the details need to remain private:

headers = {
    'accept': '*/*',
    'Authorization': 'Bearer <generated_token_put_here>',
    'Content-Type': 'application/json',
    'verify':'/etc/ssl/certs/ca-certificates.crt'
}

data = '{"pageSize": 2000, "pageNumber": 100}' #his is the largest pageSize and pageNumber values that will still return data. 

response = requests.post('<api_endpoint_put_here>', headers=headers, verify=True, data=data)

Solution

  • When you don't know the total number of pages, a simple method would be to iterate over the pages and then once you reach a page that has less than the number of expected rows, you know that's the last page or if there is an exact number of rows that fits the last page, then checking the page after the last page will return an empty set of data.

    Here's an implementation of that (you may have to adjust this slightly depending on how the API formats their data:

    import requests
    
    headers = {
        'accept': '*/*',
        'Authorization': 'Bearer <generated_token_put_here>',
        'Content-Type': 'application/json',
        'verify': '/etc/ssl/certs/ca-certificates.crt'
    }
    
    page_size = 2000
    page_number = 1
    all_data = []
    
    while True:
    
        data = f'{{"pageSize": {page_size}, "pageNumber": {page_number}}}'
        response = requests.post('<api_endpoint_put_here>', headers=headers, verify=True, data=data)
        
        if response.status_code != 200:
            print(f"Error: Received status code {response.status_code}")
            break
        
        response_data = response.json()  # Adjust if the response is not JSON
        if not response_data: # The previous page was the last page and had the same number of rows as pageSize
            break
        
        all_data.extend(response_data)  # or something like response_data['items'] if data is nested under 'items'
        
        if len(response_data) < page_size: # you've reached the last page
            break
        
        page_number += 1
    
    print(f"Total rows pulled: {len(all_data)}")