I believe the recommended method to get the contents of a large file stored on GitHub is to use REST API
. For the files which size is 1MB-100MB, it's only possible to get raw contents (in string
format).
I need to use this content to write into a file. If I use pygithub
package, I get exactly what I need (in bytes
format, and the response object contains encoding
field which value is base64
). Unfortunately, this package does not work for files which size is greater than 1MB.
So it seems that I only need to find the correct way to convert string
to bytes
. There are many ways to do it, I have tried 4 so far, and neither matches the output of pygithub
package. See the output of several guinea pig
files below. How to do the conversion correctly?
from github import Github, ContentFile
import requests
from requests.structures import CaseInsensitiveDict
import base64
token = ...
repo_name = ...
owner = ...
filename = ...
# pygithub method
github_object = Github(token)
github_user = github_object.get_user()
repo = github_user.get_repo(repo_name)
cont_obj = repo.get_contents(filename)
print('encoding', cont_obj.encoding) # prints base64
content_ref = cont_obj.decoded_content # this works correctly for <1MB files
#REST API method
url = f"https://api.github.com/repos/{owner}/{repo_name}/contents/{filename}"
headers = CaseInsensitiveDict()
headers["Accept"] = "application/vnd.github.v3.raw"
headers["Authorization"] = f"Bearer {token}"
headers["X-GitHub-Api-Version"] = "2022-11-28"
contents_str = requests.get(url, headers=headers).text
contents = []
# https://stackoverflow.com/questions/72037211/how-to-convert-a-base64-file-to-bytes
contents.append(base64.b64decode(contents_str.encode() + b'=='))
# https://stackoverflow.com/questions/7585435/best-way-to-convert-string-to-bytes-in-python-3
contents.append(bytes(contents_str, encoding="raw_unicode_escape"))
# https://stackoverflow.com/questions/7585435/best-way-to-convert-string-to-bytes-in-python-3
message_bytes = contents_str.encode('utf-8')
contents.append(base64.b64encode(message_bytes))
#contents.append(base64.decodebytes(message_bytes + b'==')) same as method 0
print(type(content_ref), len(content_ref), content_ref[:50])
for i, c in enumerate(contents):
print(i, type(c), len(c), c[:50])
The output of the guinea pig
files:
The text file that contains tiny text
allows telling that all but method 1 are incorrect
<class 'bytes'> 10 b'tiny text\n'
0 <class 'bytes'> 6 b'\xb6)\xf2\xb5\xecm'
1 <class 'bytes'> 10 b'tiny text\n'
2 <class 'bytes'> 16 b'dGlueSB0ZXh0Cg=='
for this pdf file, the length of the output of method 1 is slightly bigger, and the contents is slightly
<class 'bytes'> 3028 b'%PDF-1.3\r\n%\xe2\xe3\xcf\xd3\r\n\r\n1 0 obj\r\n<<\r\n/Type /Catalog\r\n/O'
0 <class 'bytes'> 1504 b'<1u\xdf](n?\xd3\xca\x97\xbf\t\xabZ\x96\x88?:\xebe\x8aw\xac\xdbD\x7f=\xa8\x1e\xb3}\x11zwhn=\xb4\xa1\xb8\xffO*^\xfc\xeb\xad\x96)'
1 <class 'bytes'> 3048 b'%PDF-1.3\r\n%\ufffd\ufffd\ufffd\ufffd\r\n\r\n1 0 obj\r\n<<'
2 <class 'bytes'> 4048 b'JVBERi0xLjMNCiXvv73vv73vv73vv70NCg0KMSAwIG9iag0KPD'
For this image, the size and the contents of output 1 are very different
<class 'bytes'> 57270 b'GIF89a\xfa\x00)\x01\xe7\xff\x00\x06\t\r\x0f\n\x08\x19\r\x0c \x0e\n,\x12\x0b"\x18\x17\x1f\x1a\x16"\x1b\x12&\x18\x18&!\x17%!\x1b* \x1d,'
0 <class 'bytes'> 329 b'\x18\x81|\xf5\xa17\xebo\xb6\xf3^\x1b\x03]\x02\xdb\xcd\x04\xfc\x8e\xc7\xd8\xb1D\x0c\xa36\xe8\xd3\x00\xbf\x9e\x94\xf5$\xbcT\x04D\xf9\x11\xfa\_U\x14\x05\xd5\xfce'
1 <class 'bytes'> 169079 b'GIF89a\ufffd\x00)\x01\ufffd\ufffd\x00\x06\t\r\x0f\n\x08\x19\r\x0c \x0e\n,\x12\x0b"\x18\x17\x1f\x1a\x16"'
2 <class 'bytes'> 132656 b'R0lGODlh77+9ACkB77+977+9AAYJDQ8KCBkNDCAOCiwSCyIYFx'
You can use the content
attribute in the requests response to get the content as pure bytes. This way you get a bytes
content you can save in a file, for example, as long as the file was opened in binary mode.
The next code is a vary simple example using a .png
file from one of my public repos:
import requests
url = 'https://raw.githubusercontent.com/euribates/notes/main/docs/blender/developers-preference.png'
content_in_bytes = requests.get(url).content
assert type(content_in_bytes) is bytes
with open('image.png', 'wb') as f_out:
f_out.write(content_in_bytes)
Please note Github uses a different host name (raw.githubusercontent.com
) to get the raw content.
Hope this helps.