pythongoogle-cloud-platformscrapygoogle-cloud-storagescrapinghub

"'str' object has no attribute 'get'" when using Google Cloud Storage with ScrapingHub


I'm trying to get Google Cloud Storage working with a Scrapy Cloud + Crawlera project so that I can save text files I'm trying to download. I'm encountering an error when I run my script that seems to have to do with my Google permissions not working properly.

Error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/scrapy/pipelines/media.py", line 68, in from_crawler
    pipe = cls.from_settings(crawler.settings)
  File "/usr/local/lib/python3.7/site-packages/scrapy/pipelines/files.py", line 325, in from_settings
    return cls(store_uri, settings=settings)
  File "/usr/local/lib/python3.7/site-packages/scrapy/pipelines/files.py", line 289, in __init__
    self.store = self._get_store(store_uri)
  File "/usr/local/lib/python3.7/site-packages/scrapy/pipelines/files.py", line 333, in _get_store
    return store_cls(uri)
  File "/usr/local/lib/python3.7/site-packages/scrapy/pipelines/files.py", line 217, in __init__
    client = storage.Client(project=self.GCS_PROJECT_ID)
  File "/app/python/lib/python3.7/site-packages/google/cloud/storage/client.py", line 82, in __init__
    project=project, credentials=credentials, _http=_http
  File "/app/python/lib/python3.7/site-packages/google/cloud/client.py", line 228, in __init__
    Client.__init__(self, credentials=credentials, _http=_http)
  File "/app/python/lib/python3.7/site-packages/google/cloud/client.py", line 133, in __init__
    credentials, _ = google.auth.default()
  File "/app/python/lib/python3.7/site-packages/google/auth/_default.py", line 305, in default
    credentials, project_id = checker()
  File "/app/python/lib/python3.7/site-packages/google/auth/_default.py", line 165, in _get_explicit_environ_credentials
    os.environ[environment_vars.CREDENTIALS])
  File "/app/python/lib/python3.7/site-packages/google/auth/_default.py", line 102, in _load_credentials_from_file
    credential_type = info.get('type')
AttributeError: 'str' object has no attribute 'get'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 80, in crawl
    self.engine = self._create_engine()
  File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 105, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "/usr/local/lib/python3.7/site-packages/scrapy/core/engine.py", line 70, in __init__
    self.scraper = Scraper(crawler)
  File "/usr/local/lib/python3.7/site-packages/scrapy/core/scraper.py", line 71, in __init__
    self.itemproc = itemproc_cls.from_crawler(crawler)
  File "/usr/local/lib/python3.7/site-packages/scrapy/middleware.py", line 53, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/usr/local/lib/python3.7/site-packages/scrapy/middleware.py", line 35, in from_settings
    mw = create_instance(mwcls, settings, crawler)
  File "/usr/local/lib/python3.7/site-packages/scrapy/utils/misc.py", line 140, in create_instance
    return objcls.from_crawler(crawler, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/scrapy/pipelines/media.py", line 70, in from_crawler
    pipe = cls()
TypeError: __init__() missing 1 required positional argument: 'store_uri'

__init__.py where I create the credentials file:

# Code from https://medium.com/@rutger_93697/i-thought-this-solution-was-somewhat-complex-3e8bc91f83f8

import os
import json
import pkgutil
import logging

path = "{}/google-cloud-storage-credentials.json".format(os.getcwd())

credentials_content = '<escaped JSON data>'

with open(path, "w") as text_file:
    text_file.write(json.dumps(credentials_content))

logging.warning("Path to credentials: %s" % path)
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = path

settings.py:

BOT_NAME = 'get_case_urls'

SPIDER_MODULES = ['get_case_urls.spiders']
NEWSPIDER_MODULE = 'get_case_urls.spiders'


# Obey robots.txt rules
ROBOTSTXT_OBEY = True


# Crawlera
DOWNLOADER_MIDDLEWARES = {'scrapy_crawlera.CrawleraMiddleware': 300}
CRAWLERA_ENABLED = True
CRAWLERA_APIKEY = '<crawlera-api-key>'
CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 32
AUTOTHROTTLE_ENABLED = False
DOWNLOAD_TIMEOUT = 600


ITEM_PIPELINES = {
    'scrapy.pipelines.files.FilesPipeline': 500
}
FILES_STORE = 'gs://<name-of-my-gcs-project>'
IMAGES_STORE = 'gs://<name-of-my-gcs-project>'
GCS_PROJECT_ID = "<id-of-my-gcs-project>" 

Solution

  • After looking at the code for _load_credentials_from_file it seems to me that I had not saved the JSON to a text file correctly: in __init__.py, rather than having text_file.write(json.dumps(credentials_content)), I should have had text_file.write(credentials_content) or text_file.write(json.dumps(json.loads(credentials_content))).