pythongoogle-drive-apigoogle-oauthgoogle-developers-consolepydrive

Downloading files from public Google Drive in python: scoping issues?


Using my answer to my question on how to download files from a public Google drive I managed in the past to download images using their IDs from a python script and Google API v3 from a public drive using the following bock of code:

from google_auth_oauthlib.flow import Flow, InstalledAppFlow
from googleapiclient.discovery import build
from googleapiclient.http import MediaFileUpload, MediaIoBaseDownload
from google.auth.transport.requests import Request
import io
import re
SCOPES = ['https://www.googleapis.com/auth/drive']
CLIENT_SECRET_FILE = "myjson.json"
authorized_port = 6006 # authorize URI redirect on the console
flow = InstalledAppFlow.from_client_secrets_file(CLIENT_SECRET_FILE, SCOPES)
cred = flow.run_local_server(port=authorized_port)
drive_service = build("drive", "v3", credentials=cred)
regex = "(?<=https://drive.google.com/file/d/)[a-zA-Z0-9]+"
for i, l in enumerate(links_to_download):
    url = l
    file_id = re.search(regex, url)[0]
    request = drive_service.files().get_media(fileId=file_id)
    fh = io.FileIO(f"file_{i}", mode='wb')
    downloader = MediaIoBaseDownload(fh, request)
    done = False
    while done is False:
        status, done = downloader.next_chunk()
        print("Download %d%%." % int(status.progress() * 100))

In the mean time I discovered pydrive and pydrive2, two wrappers around Google API v2 that allows to do very useful things such as listing files from folders and basically allows to do the same thing with a lighter syntax:

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
import io
import re
CLIENT_SECRET_FILE = "client_secrets.json"

gauth = GoogleAuth()
gauth.LocalWebserverAuth()
drive = GoogleDrive(gauth)
regex = "(?<=https://drive.google.com/file/d/)[a-zA-Z0-9]+"
for i, l in enumerate(links_to_download):
    url = l
    file_id = re.search(regex, url)[0]
    file_handle = drive.CreateFile({'id': file_id})
    file_handle.GetContentFile(f"file_{i}")

However now whether I use pydrive or the raw API I cannot seem to be able to download the same files and instead I am met with:

googleapiclient.errors.HttpError: <HttpError 404 when requesting https://www.googleapis.com/drive/v3/files/fileID?alt=media returned "File not found: fileID.". Details: "[{'domain': 'global', 'reason': 'notFound', 'message': 'File not found: fileID.', 'locationType': 'parameter', 'location': 'fileId'}]">

I tried everything and registered 3 different apps using Google console it seems it might be (or not) a question of scoping (see for instance this answer, with apps having access to only files in my Google drive or created by this app). However I did not have this issue before (last year).

When going to the Google console explicitly giving https://www.googleapis.com/auth/drive as a scope to the API mandates filling a ton of fields with application's website/conditions of use/confidentiality rules/authorized domains and youtube videos explaining the app. However I will be the sole user of this script. So I could only give explicitly the following scopes:

/auth/drive.appdata
/auth/drive.file
/auth/drive.install

Is it because of scoping ? Is there a solution that doesn't require creating a homepage and a youtube video ?

EDIT 1: Here is an example of links_to_download:

links_to_download = ["https://drive.google.com/file/d/fileID/view?usp=drivesdk&resourcekey=0-resourceKeyValue"]

EDIT 2: It is super instable sometimes it works without a sweat sometimes it doesn't. When I relaunch the script multiple times I get different results. Retry policies are working to a certain extent but sometimes it fails multiple times for hours.


Solution

  • Well thanks to the security update released by Google few months before. This makes the link sharing stricter and you need resource key as well to access the file in-addition to the fileId.

    As per the documentation , You need to provide the resource key as well for newer links, if you want to access it in the header X-Goog-Drive-Resource-Keys as fileId1/resourceKey1.

    If you apply this change in your code, it will work as normal. Example edit below:

    regex = "(?<=https://drive.google.com/file/d/)[a-zA-Z0-9]+"
    regex_rkey = "(?<=resourcekey=)[a-zA-Z0-9-]+"
    for i, l in enumerate(links_to_download):
        url = l
        file_id = re.search(regex, url)[0]
        resource_key = re.search(regex_rkey, url)[0]
        request = drive_service.files().get_media(fileId=file_id)
        request.headers["X-Goog-Drive-Resource-Keys"] = f"{file_id}/{resource_key}"
        fh = io.FileIO(f"file_{i}", mode='wb')
        downloader = MediaIoBaseDownload(fh, request)
        done = False
        while done is False:
            status, done = downloader.next_chunk()
            print("Download %d%%." % int(status.progress() * 100))
    

    Well, the regex for resource key was something I quickly made, so cannot be sure on if it supports every case. But this provides you the solution. Now, you may have to listen to old and new links based on this and set the changes.