pythongithubencodingcharacter-encodingpygithub

How to to convert raw contents of a large file stored on github to correct bytes array?


I believe the recommended method to get the contents of a large file stored on GitHub is to use REST API. For the files which size is 1MB-100MB, it's only possible to get raw contents (in string format).

I need to use this content to write into a file. If I use pygithub package, I get exactly what I need (in bytes format, and the response object contains encoding field which value is base64). Unfortunately, this package does not work for files which size is greater than 1MB.

So it seems that I only need to find the correct way to convert string to bytes. There are many ways to do it, I have tried 4 so far, and neither matches the output of pygithub package. See the output of several guinea pig files below. How to do the conversion correctly?

from github import Github, ContentFile
import requests
from requests.structures import CaseInsensitiveDict
import base64

token = ...
repo_name = ...
owner = ...
filename = ...
   
# pygithub method
github_object = Github(token)
github_user = github_object.get_user()
repo = github_user.get_repo(repo_name)
cont_obj = repo.get_contents(filename)
print('encoding', cont_obj.encoding) # prints base64
content_ref = cont_obj.decoded_content # this works correctly for <1MB files

#REST API method
url = f"https://api.github.com/repos/{owner}/{repo_name}/contents/{filename}"

headers = CaseInsensitiveDict()
headers["Accept"] = "application/vnd.github.v3.raw"
headers["Authorization"] = f"Bearer {token}"
headers["X-GitHub-Api-Version"] = "2022-11-28"
contents_str = requests.get(url, headers=headers).text

contents = []
# https://stackoverflow.com/questions/72037211/how-to-convert-a-base64-file-to-bytes
contents.append(base64.b64decode(contents_str.encode() + b'=='))

# https://stackoverflow.com/questions/7585435/best-way-to-convert-string-to-bytes-in-python-3
contents.append(bytes(contents_str, encoding="raw_unicode_escape"))

# https://stackoverflow.com/questions/7585435/best-way-to-convert-string-to-bytes-in-python-3
message_bytes = contents_str.encode('utf-8')
contents.append(base64.b64encode(message_bytes))
#contents.append(base64.decodebytes(message_bytes + b'==')) same as method 0

print(type(content_ref), len(content_ref), content_ref[:50])
for i, c in enumerate(contents):
  print(i, type(c), len(c), c[:50])

The output of the guinea pig files:


Solution

  • You can use the content attribute in the requests response to get the content as pure bytes. This way you get a bytes content you can save in a file, for example, as long as the file was opened in binary mode.

    The next code is a vary simple example using a .png file from one of my public repos:

    import requests
    
    url = 'https://raw.githubusercontent.com/euribates/notes/main/docs/blender/developers-preference.png'
    content_in_bytes = requests.get(url).content
    assert type(content_in_bytes) is bytes
    with open('image.png', 'wb') as f_out:
        f_out.write(content_in_bytes)
    

    Please note Github uses a different host name (raw.githubusercontent.com) to get the raw content.

    Hope this helps.