i am using newsplease library that i have cloned from https://github.com/fhamborg/news-please. i want to use newsplease to get news artices from commoncrawl news datasets. i am running commoncrawl.py file as instruct here. i have used the command below -
python -m newsplease.examples.commoncrawl
on executing the following command i am getting following errors -
my_local_download_dir_warc=./cc_download_warc/
my_local_download_dir_article=./cc_download_articles/
delete_warc_after_extraction=False
my_number_of_extraction_processes=1
INFO:newsplease.crawler.commoncrawl_crawler:executing: aws s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/ --no-sign-request > .tmpaws.txt && awk '{ print $4 }' .tmpaws.txt && rm .tmpaws.txt
INFO:newsplease.crawler.commoncrawl_crawler:found 2 files at commoncrawl.org
INFO:newsplease.crawler.commoncrawl_crawler:creating extraction process pool with 1 processes
INFO:newsplease.crawler.commoncrawl_extractor:found local file ./cc_download_warc/https%3A%2F%2Fcommoncrawl.s3.amazonaws.com%2F, not downloading again due to configuration
Traceback (most recent call last):
File "/home/prateek/.local/lib/python3.6/site-packages/warcio/recordloader.py", line 236, in _detect_type_load_headers
rec_headers = self.arc_parser.parse(stream, statusline)
File "/home/prateek/.local/lib/python3.6/site-packages/warcio/recordloader.py", line 312, in parse
raise StatusAndHeadersParserException(msg, parts)
warcio.statusandheaders.StatusAndHeadersParserException: Wrong # of headers, expected arc headers ['uri', 'ip-address', 'archive-date', 'content-type', 'length'], Found ['<?xml', 'version="1.0"', 'encoding="UTF-8"?>']
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/examples/commoncrawl.py", line 172, in <module>
main()
File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/examples/commoncrawl.py", line 168, in main
continue_process=True)
File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 320, in crawl_from_commoncrawl
log_pathname_fully_extracted_warcs=__log_pathname_fully_extracted_warcs)
File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 230, in __start_commoncrawl_extractor
log_pathname_fully_extracted_warcs=__log_pathname_fully_extracted_warcs)
File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_extractor.py", line 338, in extract_from_commoncrawl
self.__run()
File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_extractor.py", line 292, in __run
self.__process_warc_gz_file(local_path_name)
File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_extractor.py", line 231, in __process_warc_gz_file
for record in ArchiveIterator(stream):
File "/home/prateek/.local/lib/python3.6/site-packages/warcio/archiveiterator.py", line 110, in _iterate_records
self.record = self._next_record(self.next_line)
File "/home/prateek/.local/lib/python3.6/site-packages/warcio/archiveiterator.py", line 262, in _next_record
self.check_digests)
File "/home/prateek/.local/lib/python3.6/site-packages/warcio/recordloader.py", line 88, in parse_record_stream
known_format))
File "/home/prateek/.local/lib/python3.6/site-packages/warcio/recordloader.py", line 243, in _detect_type_load_headers
raise ArchiveLoadFailed(msg + str(se.statusline))
warcio.exceptions.ArchiveLoadFailed: Unknown archive format, first line: ['<?xml', 'version="1.0"', 'encoding="UTF-8"?>']
what is the error here how can i resolve this.
https://github.com/fhamborg/news-please says that adopt the config section in
newsplease/examples/commoncrawl.py.
what does this mean ?
i have copied the configurations from this file and pasted in
config.cfg which is present in the newsplease/config directory.
is this what thay have instructed ? or i have made a mistake here.
i am using python 3.6. i have only one python installed in my machine.
this error is because of the libraries being used by the newsplease. mistake is made when we manually install every library, while installing focus on the versions of packages. version info of every library is given in setup.py file. install exact version given in setup.py file. now there may be problems while executing the setup.py.
so use this command -
python3 setup.py install
if you need to uninstall all the previous verions of installed packeges then run -
pip3 freeze --user | xargs pip3 uninstall -y
for more ways to do this click here