I understand that the GitHub Search API limits to 1000 results and 100 results per page. Therefore I wrote the following to view all 1000 results for a code search process that looks for a string torch
-
import requests
for i in range(1,11):
url = "https://api.github.com/search/code?q=torch +in:file + language:python&per_page=100&page="+str(i)
headers = {
'Authorization': 'xxxxxxxx'
}
response = requests.request("GET", url, headers=headers).json()
try:
print(len(response['items']))
except:
print("response = ", response)
Here is the output -
15
62
response = {'documentation_url': 'https://docs.github.com/en/free-pro-team@latest/rest/overview/resources-in-the-rest-api#secondary-rate-limits', 'message': 'You have exceeded a secondary rate limit. Please wait a few minutes before you try again.'}
Does there exist an efficient way to get all 1000 results from the Search API?
There's two things happening here:
The search API has different rate limits. See the GitHub Documentation:
The REST API for searching items has a custom rate limit that is separate from the rate limit governing the other REST API endpoints.
I would recommend trying lower amounts of results per page to solve the incomplete results.
You will also need to be very deliberate about the requests you're making, because the limits are low. Getting the full 1000 may be impossible without requesting a rate increase or a implementing a very long backoff.
I modified your code to add a primitive exponential backoff, but this still doesn't produce the full 1000 results and takes a while:
import requests
import time
headers = {
'Authorization': 'token <TOKEN>'
}
results = []
for i in range(1, 31):
url = "https://api.github.com/search/code?q=torch +in:file + language:python&per_page=33&page="+str(i)
backoff = 2 # backoff in seconds
while backoff < 1024:
time.sleep(backoff)
try:
response = requests.request("GET", url, headers=headers)
response.raise_for_status() # throw an exception for HTTP 400 and 500s
data = response.json()
results.append(data['items'])
print(f'Got {len(data["items"])} results for page {i}.')
url = response.links['next']['url']
break
except requests.exceptions.RequestException as e:
print('ERROR: Failed to make request: ', e)
backoff **= 2
if backoff >= 1024:
print('ERROR: Backoff limit reached.')
break