pythonweb-crawlerpython-newspapercommon-crawlnewspaper3k

exception in newsplease commoncrawl.py file


i am using newsplease library that i have cloned from https://github.com/fhamborg/news-please. i want to use newsplease to get news artices from commoncrawl news datasets. i am running commoncrawl.py file as instruct here. i have used the command below -

python -m newsplease.examples.commoncrawl

on executing the following command i am getting following errors -

my_local_download_dir_warc=./cc_download_warc/
my_local_download_dir_article=./cc_download_articles/
delete_warc_after_extraction=False
my_number_of_extraction_processes=1
INFO:newsplease.crawler.commoncrawl_crawler:executing: aws s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/ --no-sign-request > .tmpaws.txt && awk '{ print $4 }' .tmpaws.txt && rm .tmpaws.txt
INFO:newsplease.crawler.commoncrawl_crawler:found 2 files at commoncrawl.org
INFO:newsplease.crawler.commoncrawl_crawler:creating extraction process pool with 1 processes
INFO:newsplease.crawler.commoncrawl_extractor:found local file ./cc_download_warc/https%3A%2F%2Fcommoncrawl.s3.amazonaws.com%2F, not downloading again due to configuration
Traceback (most recent call last):
  File "/home/prateek/.local/lib/python3.6/site-packages/warcio/recordloader.py", line 236, in _detect_type_load_headers
    rec_headers = self.arc_parser.parse(stream, statusline)
  File "/home/prateek/.local/lib/python3.6/site-packages/warcio/recordloader.py", line 312, in parse
    raise StatusAndHeadersParserException(msg, parts)
warcio.statusandheaders.StatusAndHeadersParserException: Wrong # of headers, expected arc headers ['uri', 'ip-address', 'archive-date', 'content-type', 'length'], Found ['<?xml', 'version="1.0"', 'encoding="UTF-8"?>']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/examples/commoncrawl.py", line 172, in <module>
    main()
  File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/examples/commoncrawl.py", line 168, in main
    continue_process=True)
  File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 320, in crawl_from_commoncrawl
    log_pathname_fully_extracted_warcs=__log_pathname_fully_extracted_warcs)
  File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 230, in __start_commoncrawl_extractor
    log_pathname_fully_extracted_warcs=__log_pathname_fully_extracted_warcs)
  File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_extractor.py", line 338, in extract_from_commoncrawl
    self.__run()
  File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_extractor.py", line 292, in __run
    self.__process_warc_gz_file(local_path_name)
  File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_extractor.py", line 231, in __process_warc_gz_file
    for record in ArchiveIterator(stream):
  File "/home/prateek/.local/lib/python3.6/site-packages/warcio/archiveiterator.py", line 110, in _iterate_records
    self.record = self._next_record(self.next_line)
  File "/home/prateek/.local/lib/python3.6/site-packages/warcio/archiveiterator.py", line 262, in _next_record
    self.check_digests)
  File "/home/prateek/.local/lib/python3.6/site-packages/warcio/recordloader.py", line 88, in parse_record_stream
    known_format))
  File "/home/prateek/.local/lib/python3.6/site-packages/warcio/recordloader.py", line 243, in _detect_type_load_headers
    raise ArchiveLoadFailed(msg + str(se.statusline))
warcio.exceptions.ArchiveLoadFailed: Unknown archive format, first line: ['<?xml', 'version="1.0"', 'encoding="UTF-8"?>']

what is the error here how can i resolve this.

https://github.com/fhamborg/news-please says that adopt the config section in newsplease/examples/commoncrawl.py. what does this mean ?
i have copied the configurations from this file and pasted in config.cfg which is present in the newsplease/config directory. is this what thay have instructed ? or i have made a mistake here.

i am using python 3.6. i have only one python installed in my machine.


Solution

  • this error is because of the libraries being used by the newsplease. mistake is made when we manually install every library, while installing focus on the versions of packages. version info of every library is given in setup.py file. install exact version given in setup.py file. now there may be problems while executing the setup.py.

    so use this command -

    python3 setup.py install
    

    if you need to uninstall all the previous verions of installed packeges then run -

    pip3 freeze --user | xargs pip3 uninstall -y
    

    for more ways to do this click here