pythonmapreducewarc

Mapreduce carriage return


I want to process CommonCrawl WARC files in MapReduce using the input format s3a.

The problem is that the carriage return char at the end of the input lines is removed and tab is put instead (as it is the default delimiter).

Why does this happen?

This is the code with which I start MapReduce

time yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
  -D mapred.compress.map.output=true \
  -D mapred.reduce.tasks=0 \
  -D mapred.job.name=cc \
  -D fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider \
  -files mapper.py \
  -archives wasbs://cluster@ccscsg.blob.core.windows.net/user/ubuntu/virtualenv/.venv2.zip#venv \
  -mapper mapper.py \
  -input s3a://commoncrawl/crawl-data/CC-MAIN-2018-39/segments/1537267155413.17/warc/CC-MAIN-20180918130631-20180918150631-00000.warc.gz \
  -output /output_warc

mapper.py

#!./venv/bin/python
import sys
for line in sys.stdin:
    sys.stdout.write(line)

Solution

  • You could set -D mapreduce.output.textoutputformat.separator=$'\r'. But this will add an \r to every line, even if there wasn't one in the input.

    A MapReduce job expects as mapper output a pair and the separator used to separate key and value in the output is set by (mapreduce.output.textoutputformat.separator` (the tab character is the default).

    Btw., WARC files are not text files - there is binary payload (PDFs, images) and the HTML has no fixed content encoding. You may consider to use a WARC parsing library (e.g., warcio) or simply use cc-mrjob or cc-pyspark to do the processing.