I am trying to implement a similar script on my project following this blog post here: https://www.imagescape.com/blog/scraping-pdf-doc-and-docx-scrapy/
The code of the spider class from the source:
import re
import textract
from itertools import chain
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from tempfile import NamedTemporaryFile
control_chars = ''.join(map(chr, chain(range(0, 9), range(11, 32), range(127, 160))))
CONTROL_CHAR_RE = re.compile('[%s]' % re.escape(control_chars))
TEXTRACT_EXTENSIONS = [".pdf", ".doc", ".docx", ""]
class CustomLinkExtractor(LinkExtractor):
def __init__(self, *args, **kwargs):
super(CustomLinkExtractor, self).__init__(*args, **kwargs)
# Keep the default values in "deny_extensions" *except* for those types we want.
self.deny_extensions = [ext for ext in self.deny_extensions if ext not in TEXTRACT_EXTENSIONS]
class ItsyBitsySpider(CrawlSpider):
name = "itsy_bitsy"
start_urls = [
'https://www.imagescape.com/media/uploads/zinnia/2018/08/20/scrape_me.html'
]
def __init__(self, *args, **kwargs):
self.rules = (Rule(CustomLinkExtractor(), follow=True, callback="parse_item"),)
super(ItsyBitsySpider, self).__init__(*args, **kwargs)
def parse_item(self, response):
if hasattr(response, "text"):
# The response is text - we assume html. Normally we'd do something
# with this, but this demo is just about binary content, so...
pass
else:
# We assume the response is binary data
# One-liner for testing if "response.url" ends with any of TEXTRACT_EXTENSIONS
extension = list(filter(lambda x: response.url.lower().endswith(x), TEXTRACT_EXTENSIONS))[0]
if extension:
# This is a pdf or something else that Textract can process
# Create a temporary file with the correct extension.
tempfile = NamedTemporaryFile(suffix=extension)
tempfile.write(response.body)
tempfile.flush()
extracted_data = textract.process(tempfile.name)
extracted_data = extracted_data.decode('utf-8')
extracted_data = CONTROL_CHAR_RE.sub('', extracted_data)
tempfile.close()
with open("scraped_content.txt", "a") as f:
f.write(response.url.upper())
f.write("\n")
f.write(extracted_data)
f.write("\n\n")
My current python is: 3.10 and my OS is windows 10. And the error that it returns when trying to execute as a scrapy crawler,
PS C:\Users\USER\Desktop\git repo\tut> scrapy crawl itsy_bitsy
2021-12-12 22:43:10 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: tut)
2021-12-12 22:43:10 [scrapy.utils.log] INFO: Versions: lxml 4.6.4.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.10.0 (tags/v3.10.0:b494f59, Oct 4 2021, 19:00:18) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l 24 Aug 2021), cryptography 35.0.0, Platform Windows-10-10.0.19042-SP0
2021-12-12 22:43:10 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-12-12 22:43:10 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'tut',
'NEWSPIDER_MODULE': 'tut.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['tut.spiders']}
2021-12-12 22:43:10 [scrapy.extensions.telnet] INFO: Telnet Password: ##
2021-12-12 22:43:10 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2021-12-12 22:43:10 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-12-12 22:43:10 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-12-12 22:43:10 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-12-12 22:43:10 [scrapy.core.engine] INFO: Spider opened
2021-12-12 22:43:10 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-12-12 22:43:10 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-12-12 22:43:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imagescape.com/robots.txt> (referer: None)
2021-12-12 22:43:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imagescape.com/media/uploads/zinnia/2018/08/20/scrape_me.html> (referer: None)
2021-12-12 22:43:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imagescape.com/media/uploads/zinnia/2018/08/20/sampletext.docx> (referer: https://www.imagescape.com/media/uploads/zinnia/2018/08/20/scrape_me.html)
2021-12-12 22:43:13 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.imagescape.com/media/uploads/zinnia/2018/08/20/sampletext.docx> (referer: https://www.imagescape.com/media/uploads/zinnia/2018/08/20/scrape_me.html)
Traceback (most recent call last):
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\utils\defer.py", line 120, in iter_errback
yield next(it)
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\utils\python.py", line 353, in __next__
return next(self.data)
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\utils\python.py", line 353, in __next__
return next(self.data)
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\core\spidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\core\spidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 342, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\core\spidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 40, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\core\spidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\core\spidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\spiders\crawl.py", line 114, in _parse_response
cb_res = callback(response, **cb_kwargs) or ()
File "C:\Users\USER\Desktop\git repo\tut\tut\spiders\spider1.py", line 42, in parse_item
extracted_data = textract.process(tempfile.name)
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\textract\parsers\__init__.py", line 79, in process
return parser.process(filename, input_encoding, output_encoding, **kwargs)
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\textract\parsers\utils.py", line 46, in process
byte_string = self.extract(filename, **kwargs)
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\textract\parsers\docx_parser.py", line 11, in extract
return docx2txt.process(filename)
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\docx2txt\docx2txt.py", line 76, in process
zipf = zipfile.ZipFile(docx)
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\zipfile.py", line 1240, in __init__
self.fp = io.open(file, filemode)
PermissionError: [Errno 13] Permission denied: 'C:\\Users\\USER\\AppData\\Local\\Temp\\tmpvp9upczz.docx'
2021-12-12 22:43:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imagescape.com/media/uploads/zinnia/2018/08/20/sampletext.pdf> (referer: https://www.imagescape.com/media/uploads/zinnia/2018/08/20/scrape_me.html)
2021-12-12 22:43:13 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.imagescape.com/media/uploads/zinnia/2018/08/20/sampletext.pdf> (referer: https://www.imagescape.com/media/uploads/zinnia/2018/08/20/scrape_me.html)
Traceback (most recent call last):
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\textract\parsers\utils.py", line 87, in run
pipe = subprocess.Popen(
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 966, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 1435, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\utils\defer.py", line 120, in iter_errback
yield next(it)
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\utils\python.py", line 353, in __next__
return next(self.data)
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\utils\python.py", line 353, in __next__
return next(self.data)
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\core\spidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\core\spidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 342, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\core\spidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 40, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\core\spidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\core\spidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\spiders\crawl.py", line 114, in _parse_response
cb_res = callback(response, **cb_kwargs) or ()
File "C:\Users\USER\Desktop\git repo\tut\tut\spiders\spider1.py", line 42, in parse_item
extracted_data = textract.process(tempfile.name)
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\textract\parsers\__init__.py", line 79, in process
return parser.process(filename, input_encoding, output_encoding, **kwargs)
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\textract\parsers\utils.py", line 46, in process
byte_string = self.extract(filename, **kwargs)
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\textract\parsers\pdf_parser.py", line 29, in extract
raise ex
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\textract\parsers\pdf_parser.py", line 21, in extract
return self.extract_pdftotext(filename, **kwargs)
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\textract\parsers\pdf_parser.py", line 44, in extract_pdftotext
stdout, _ = self.run(args)
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\textract\parsers\utils.py", line 95, in run
raise exceptions.ShellError(
textract.exceptions.ShellError: The command `pdftotext C:\Users\USER\AppData\Local\Temp\tmpg2cla7xb.pdf -` failed with exit code 127
------------- stdout -------------
------------- stderr -------------
2021-12-12 22:43:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imagescape.com/media/uploads/zinnia/2018/08/20/sampletext.doc> (referer: https://www.imagescape.com/media/uploads/zinnia/2018/08/20/scrape_me.html)
2021-12-12 22:43:14 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.imagescape.com/media/uploads/zinnia/2018/08/20/sampletext.doc> (referer: https://www.imagescape.com/media/uploads/zinnia/2018/08/20/scrape_me.html)
Traceback (most recent call last):
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\utils\defer.py", line 120, in iter_errback
yield next(it)
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\utils\python.py", line 353, in __next__
return next(self.data)
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\utils\python.py", line 353, in __next__
return next(self.data)
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\core\spidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\core\spidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 342, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\core\spidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 40, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\core\spidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\core\spidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\spiders\crawl.py", line 114, in _parse_response
cb_res = callback(response, **cb_kwargs) or ()
File "C:\Users\USER\Desktop\git repo\tut\tut\spiders\spider1.py", line 42, in parse_item
extracted_data = textract.process(tempfile.name)
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\textract\parsers\__init__.py", line 79, in process
return parser.process(filename, input_encoding, output_encoding, **kwargs)
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\textract\parsers\utils.py", line 46, in process
byte_string = self.extract(filename, **kwargs)
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\textract\parsers\doc_parser.py", line 9, in extract
stdout, stderr = self.run(['antiword', filename])
File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\textract\parsers\utils.py", line 106, in run
raise exceptions.ShellError(
textract.exceptions.ShellError: The command `antiword C:\Users\USER\AppData\Local\Temp\tmpndf_bon7.doc` failed with exit code 1
------------- stdout -------------
b''------------- stderr -------------
b'Traceback (most recent call last):\r\n File "C:\\Users\\USER\\AppData\\Local\\Programs\\Python\\Python310\\lib\\runpy.py", line 196, in _run_module_as_main\r\n return _run_code(code, main_globals, None,\r\n File "C:\\Users\\USER\\AppData\\Local\\Programs\\Python\\Python310\\lib\\runpy.py", line 86, in _run_code\r\n exec(code, run_globals)\r\n File "C:\\Users\\USER\\AppData\\Local\\Programs\\Python\\Python310\\Scripts\\antiword.exe\\__main__.py", line 7, in <module>\r\n File "C:\\Users\\USER\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\antiword.py", line 21, in main\r\n r = run(cmd)\r\n File "C:\\Users\\USER\\AppData\\Local\\Programs\\Python\\Python310\\lib\\subprocess.py", line 501, in run\r\n with Popen(*popenargs, **kwargs) as process:\r\n File "C:\\Users\\USER\\AppData\\Local\\Programs\\Python\\Python310\\lib\\subprocess.py", line 966, in __init__\r\n self._execute_child(args, executable, preexec_fn, close_fds,\r\n File "C:\\Users\\USER\\AppData\\Local\\Programs\\Python\\Python310\\lib\\subprocess.py", line 1435, in _execute_child\r\n hp, ht, pid, tid = _winapi.CreateProcess(executable, args,\r\nFileNotFoundError: [WinError 2] The system cannot find the file specified\r\n'
2021-12-12 22:43:14 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-12 22:43:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1649,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 5,
'downloader/response_bytes': 46050,
'downloader/response_count': 5,
'downloader/response_status_count/200': 5,
'elapsed_time_seconds': 3.548882,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 12, 12, 16, 43, 14, 330047),
'httpcompression/response_bytes': 230,
'httpcompression/response_count': 1,
'log_count/DEBUG': 5,
'log_count/ERROR': 3,
'log_count/INFO': 10,
'request_depth_max': 1,
'response_received_count': 5,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'spider_exceptions/PermissionError': 1,
'spider_exceptions/ShellError': 2,
'start_time': datetime.datetime(2021, 12, 12, 16, 43, 10, 781165)}
2021-12-12 22:43:14 [scrapy.core.engine] INFO: Spider closed (finished)
PS C:\Users\USER\Desktop\git repo\tut>
I have installed all the mentioned pip packages on the blog post and I think its happening due to some error on the antiword module. But it also did install successfully as a pip package. Please help me troubleshoot.
This program was meant to be ran in linux, so there are a few steps you need to do in order for it to run in windows.
1. Install the libraries.
Installation in Anaconda:
conda install -c conda-forge poppler
conda install -c conda-forge pdftotext
Installation in Pip:
pip install python-poppler
pip install pdftotext
2. Download antiword, extract the folder to C:\ (important), and add it to PATH
3. There's a problem because you're trying to open a file while it's still in use.
Change:
tempfile = NamedTemporaryFile(suffix=extension)
tempfile.write(response.body)
tempfile.flush()
extracted_data = textract.process(tempfile.name)
extracted_data = extracted_data.decode('utf-8')
extracted_data = CONTROL_CHAR_RE.sub('', extracted_data)
tempfile.close()
to:
tempfile = NamedTemporaryFile(suffix=extension, delete=False)
tempfile.write(response.body)
tempfile.close()
extracted_data = textract.process(tempfile.name)
extracted_data = extracted_data.decode('utf-8')
extracted_data = CONTROL_CHAR_RE.sub('', extracted_data)
4. Open a new terminal to reload the PATH environment variables
5. Run scrapy crawl itsy_bitsy
and enjoy.