pythongoogle-app-enginegoogle-cloud-storagegoogle-cloud-data-transfer

Automatically retrieving large files via public HTTP into Google Cloud Storage


For weather processing purpose, I am looking to retrieve automatically daily weather forecast data in Google Cloud Storage.

The files are available on public HTTP URL (http://dcpc-nwp.meteo.fr/openwis-user-portal/srv/en/main.home), but they are very large (between 30 and 300 Megabytes). Size of files is the main issue.

After looking at previous stackoverflow topics, I have tried two unsuccessful methods:

1/ First attempt via urlfetch in Google App Engine

    from google.appengine.api import urlfetch

    url = "http://dcpc-nwp.meteo.fr/servic..."
    result = urlfetch.fetch(url)

    [...] # Code to save in a Google Cloud Storage bucket

But I get the following error message on the urlfetch line :

DeadlineExceededError: Deadline exceeded while waiting for HTTP response from URL

2/ Second attempt via the Cloud Storage Transfert Service

According to the documentation, it is possible to retrieve HTTP Data into Cloud Storage directly via the Cloud Storage Transfert Service : https://cloud.google.com/storage/transfer/reference/rest/v1/TransferSpec#httpdata

But it requires the size and md5 of the files before the download. This option cannot work in my case because the website does not provide those information.

3/ Any ideas ?

Do you see any solution to retrieve automatically large file on HTTP into my Cloud Storage bucket?


Solution

  • 3/ Workaround with a Compute Engine instance

    Since it was not possible to retrieve large files from external HTTP with App Engine or directly with Cloud Storage, I have used a workaround with an always-running Compute Engine instance.

    This instance regularly checks if new weather files are available, downloads them and uploads them to a Cloud Storage bucket.

    For scalability, maintenance and cost reasons, I would have prefered to use only serverless services, but hopefully :