pythonwebarchivewarc

How to read a subset of records from a warc file


I'm trying to parse .warc files from Common Crawl in Python.

Since the files are huge, I want to start with a sample/subset of the first few records.

How do I truncate the file the file to only include the first X lines while preserving the newlines/carriage returns that are in place?

Here's what I tried already:

  1. head -n 250 oldfile > newfile This removes some of the returns that are needed to parse the file. Here's the error I get if I try to use this file in my Hadoop job (reading it with the warc package):

      Traceback (most recent call last):
          File "test.py", line 46, in <module>
            TagGrabber.run()
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/job.py", line 461, in run
            mr_job.execute()
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/job.py", line 479, in execute
            super(MRJob, self).execute()
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/launch.py", line 151, in execute
            self.run_job()
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/launch.py", line 214, in run_job
            runner.run()
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/runner.py", line 464, in run
            self._run()
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/sim.py", line 173, in _run
            self._invoke_step(step_num, 'mapper')
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/sim.py", line 264, in _invoke_step
            self.per_step_runner_finish(step_num)
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/local.py", line 152, in per_step_runner_finish
            self._wait_for_process(proc_dict, step_num)
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/local.py", line 268, in _wait_for_process
            (proc_dict['args'], returncode, ''.join(tb_lines)))
        Exception: Command ['sh', '-ex', 'setup-wrapper.sh', '/var/cc-mrjob/venv/bin/python', 'test.py', '--step-num=0', '--mapper', '/tmp/test.root.20150520.071726.549519/input_part-00000'] returned non-zero exit status 1:
        Traceback (most recent call last):
          File "test.py", line 46, in <module>
            TagGrabber.run()
          File "/tmp/test.root.20150520.071726.549519/job_local_dir/0/mapper/0/mrjob.tar.gz/mrjob/job.py", line 461, in run
            mr_job.execute()
          File "/tmp/test.root.20150520.071726.549519/job_local_dir/0/mapper/0/mrjob.tar.gz/mrjob/job.py", line 470, in execute
            self.run_mapper(self.options.step_num)
          File "/tmp/test.root.20150520.071726.549519/job_local_dir/0/mapper/0/mrjob.tar.gz/mrjob/job.py", line 535, in run_mapper
            for out_key, out_value in mapper(key, value) or ():
          File "/var/cc-mrjob/mrcc.py", line 33, in mapper
            for i, record in enumerate(f):
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/warc/warc.py", line 390, in __iter__
            record = self.read_record()
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/warc/warc.py", line 373, in read_record
            header = self.read_header(fileobj)
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/warc/warc.py", line 331, in read_header
            raise IOError("Bad version line: %r" % version_line)
        IOError: Bad version line: 'WARC/1.0\n'
    
  2. same as #1 but with tail command

  3. same as #1 but using tr or sed after to replace any lost newline or ^M (carriage return) characters. This causes the warc package to still complain that expected carriage returns or newlines were not in place.
  4. unix2dos oldfile

Solution

  • It would be difficult to handle newlines correctly because the .warc files may contain binary data as well. Truncation would also probably produce broken .warc files, since the python library for example trusts that the Content-Length headers are valid.

    The warc python lib reads only a record at a time from the .warc file (avoiding reading the entire file to memory at once), and thus it is possible to handle subsets using python only. For example:

    import warc
    from itertools import islice
    
    N = 10
    warc_file = warc.open('/path/to/file.warc')
    for record in islice(warc_file, N):
        do_stuff_with(record)