I'm using scrapy to crawl over this XML data from an archive, which uses the OAI-PMH framework. I am not extremely familiar with exactly how OAI-PMH could affect Scrapy, but there seems to be a problem when I use the following command:
scrapy view http://fukushima.archive-disasters.jp/infolib/oai_repository/repository?verb=ListRecords&metadataPrefix=ndlkn
Instead of the website opening in my browser, a Notepad file opens up with the following:
<?xml version="1.0" encoding="UTF-8" ?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"> <responseDate>
2013-12-30T00:11:45Z</responseDate>
<request>http://fukushima.archive-disasters.jp/infolib/oai_repository/repository</request>
<error code="badArgument">It is an inaccurate parameter.</error>
And the following comes up on the command line:
[default] INFO: Spider closed (finished)
'metadataPrefix' is not recognized as an internal or external command, operable program or batch file.
The only time metadataPrefix
shows up in the XML is in the 3rd line:
<request metadataPrefix="ndlkn" verb="ListRecords">
Is there any way I can use this website with Scrapy's "view" command?
Also, I'm also having trouble accessing the XML data itself through the scrapy shell. In the previous version of scrapy, after using remove_namespaces()
I could access all the records on the page using sel.xpath('//record')
, but now that generates []
and I'm having trouble figuring out the correct xpath needed.
Here's how the following commands look:
scrapy shell http://fukushima.archive-disasters.jp/infolib/oai_repository/repository?verb=ListRecords&metadataPrefix=ndlkn
Typical scrapy output, then:
>>> sel.remove_namespaces()
>>> sel.xpath('//record')
[]
>>> sel.xpath('//OAI-PMH')
[<Selector xpath='//OAI-PMH' data=u'<OAI-PMH xmlns="http://www.openarchives.'>]
>>> sel.xpath('//OAI-PMH/request')
[<Selector xpath='//OAI-PMH/request' data=u'<request xmlns="http://www.openarchives.'>]
>>> sel.xpath('//OAI-PMH/ListRecords')
[]
What xpaths do I need to use?
Sorry for the long question. I'm just worried that the two issues are linked and that OAI-PMH is causing problems here. Please let me know if I should break this up or any other ways I can make it more clear.
EDIT: I feel super dumb, but I realized the problem. Because there's an &
in the URL, it needs to be in quotes or escaped after calling scrapy view
or scrapy shell
. That fixes both of my issues! Hope this helps anyone in the future.
I realized my mistake. Because there's an &
in the URL, it needs to be in quotes or escaped after scrapy view
or scrapy shell
. That fixes both of my issues! Hope this helps anyone in the future.