I'm trying to parse .warc files from Common Crawl in Python.
Since the files are huge, I want to start with a sample/subset of the first few records.
How do I truncate the file the file to only include the first X lines while preserving the newlines/carriage returns that are in place?
Here's what I tried already:
head -n 250 oldfile > newfile
This removes some of the returns that are needed to parse the file. Here's the error I get if I try to use this file in my Hadoop job (reading it with the warc
package):
Traceback (most recent call last):
File "test.py", line 46, in <module>
TagGrabber.run()
File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/job.py", line 461, in run
mr_job.execute()
File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/job.py", line 479, in execute
super(MRJob, self).execute()
File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/launch.py", line 151, in execute
self.run_job()
File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/launch.py", line 214, in run_job
runner.run()
File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/runner.py", line 464, in run
self._run()
File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/sim.py", line 173, in _run
self._invoke_step(step_num, 'mapper')
File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/sim.py", line 264, in _invoke_step
self.per_step_runner_finish(step_num)
File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/local.py", line 152, in per_step_runner_finish
self._wait_for_process(proc_dict, step_num)
File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/local.py", line 268, in _wait_for_process
(proc_dict['args'], returncode, ''.join(tb_lines)))
Exception: Command ['sh', '-ex', 'setup-wrapper.sh', '/var/cc-mrjob/venv/bin/python', 'test.py', '--step-num=0', '--mapper', '/tmp/test.root.20150520.071726.549519/input_part-00000'] returned non-zero exit status 1:
Traceback (most recent call last):
File "test.py", line 46, in <module>
TagGrabber.run()
File "/tmp/test.root.20150520.071726.549519/job_local_dir/0/mapper/0/mrjob.tar.gz/mrjob/job.py", line 461, in run
mr_job.execute()
File "/tmp/test.root.20150520.071726.549519/job_local_dir/0/mapper/0/mrjob.tar.gz/mrjob/job.py", line 470, in execute
self.run_mapper(self.options.step_num)
File "/tmp/test.root.20150520.071726.549519/job_local_dir/0/mapper/0/mrjob.tar.gz/mrjob/job.py", line 535, in run_mapper
for out_key, out_value in mapper(key, value) or ():
File "/var/cc-mrjob/mrcc.py", line 33, in mapper
for i, record in enumerate(f):
File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/warc/warc.py", line 390, in __iter__
record = self.read_record()
File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/warc/warc.py", line 373, in read_record
header = self.read_header(fileobj)
File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/warc/warc.py", line 331, in read_header
raise IOError("Bad version line: %r" % version_line)
IOError: Bad version line: 'WARC/1.0\n'
same as #1 but with tail
command
tr
or sed
after to replace any lost newline or ^M
(carriage return) characters. This causes the warc
package to still complain that expected carriage returns or newlines were not in place.unix2dos oldfile
It would be difficult to handle newlines correctly because the .warc files may contain binary data as well. Truncation would also probably produce broken .warc files, since the python library for example trusts that the Content-Length headers are valid.
The warc python lib reads only a record at a time from the .warc file (avoiding reading the entire file to memory at once), and thus it is possible to handle subsets using python only. For example:
import warc
from itertools import islice
N = 10
warc_file = warc.open('/path/to/file.warc')
for record in islice(warc_file, N):
do_stuff_with(record)