ckandata-harvest

Harvesters using DCAT extension get stucked


We've been using ckanext-dcat to harvest from remote json sources, sometimes some harvest jobs didn't finish and had to be deleted along with all the datasets from that source, which is not very convinient but then all goes back to normal, I don't know if there is a way to delete just a single job.

But now I get this in gather consumer log:

    Traceback (most recent call last):
  File "/usr/lib/ckan/default/bin/paster", line 9, in <module>
    load_entry_point('PasteScript==1.7.5', 'console_scripts', 'paster')()
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/paste/script/command.py", line 104, in run
    invoke(command, command_name, options, args[1:])
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/paste/script/command.py", line 143, in invoke
    exit_code = runner.run(args)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/paste/script/command.py", line 238, in run
    result = self.command()
  File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/commands/harvester.py", line 129, in command
    gather_callback(consumer, method, header, body)
  File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/queue.py", line 219, in gather_callback
    harvest_object_ids = harvester.gather_stage(job)
  File "/usr/lib/ckan/default/src/ckanext-dcat/ckanext/dcat/harvesters.py", line 186, in gather_stage
    content = self._get_content(url, harvest_job, page)
  File "/usr/lib/ckan/default/src/ckanext-dcat/ckanext/dcat/harvesters.py", line 66, in _get_content
    cl = r.headers['content-length']
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/requests/structures.py", line 54, in __getitem__
    return self._store[key.lower()][1]
KeyError: 'content-length

The job finishes but no datasets get created, if I delete the job and reharvest it keeps running but never ends and other harvest jobs don't update either.

How can I fix this?


Solution

  • @Urkonn, different things going on here:

    [1] https://github.com/ckan/ckanext-dcat/commit/ed186623d83cf3baf9dd29bdb13be7f1431b8ab8