javahadoopbigdataapache-crunch

Apache crunch unable to write output


Might be oversight but I am unable to spot why Apache Crunch won't write out output to a file for a very simple program I am writing to learn Crunch..

Here's the code:

import org.apache.crunch.Pipeline;
import org.apache.hadoop.conf.Configuration;    

....
private Pipeline                  pipeline;
private Configuration             etlConf;

....
this.etlConf  = getConf();
this.pipeline = new MRPipeline(TestETL.class, etlConf);
....

// Read file
logger.info("Reading input file: " + inputFileURI.toString());
PCollection<String> input = pipeline.readTextFile(inputFileURI.toString());

System.out.println("INPUT SIZE = " + input.asCollection().getValue().size());

// Write file 
logger.info("Writing Final output to file: " + outputFileURI.toString());
input.write(
    To.textFile(outputFileURI.toString()),
    WriteMode.OVERWRITE
);

This is the logging I see when I execute this jar using hadoop:

18/12/31 09:41:51 INFO etl.TestClass: Executing Test run
18/12/31 09:41:51 INFO etl.TestETL: Reading input file: /user/sw029693/process_analyzer/input/input.txt
INPUT SIZE = 3
18/12/31 09:41:51 INFO etl.TestETL: Writing Final output to file: 
/user/sw029693/process_analyzer/output/occurences
18/12/31 09:41:51 INFO impl.FileTargetImpl: Will write output files to new path: /user/sw029693/process_analyzer/output/occurences
18/12/31 09:41:51 INFO etl.TestETL: Cleaning-up TestETL run
18/12/31 09:41:51 INFO etl.TestETL: ETL completed with status 0.

The input file is very simple and looks like this:

this is line 1
this is line 2
this is line 3

Although the logging indicates a write should have happened to the output location, I see no files being created. Any thoughts?


Solution

  • package com.hadoop.crunch;
    
    import java.io.*;
    import java.util.Collection;
    import java.util.Iterator;
    
    import org.apache.crunch.*;
    import org.apache.crunch.impl.mr.MRPipeline;
    import org.apache.crunch.io.From;
    import org.apache.hadoop.conf.*;
    import org.apache.hadoop.fs.*;
    import org.apache.hadoop.util.*;
    import org.apache.log4j.Logger;
    
    public class App extends Configured implements Tool, Serializable{
        private static final long serialVersionUID = 1L;
        private static Logger LOG = Logger.getLogger(App.class);
    
        @Override
        public int run(String[] args) throws Exception {
            final Path fileSource = new Path(args[0]);
            final Path outFileName = new Path(args[1], "event-" + System.currentTimeMillis() + ".txt");
    
            //MRPipeline translates the overall pipeline into one or more MapReduce jobs
            Pipeline pipeline = new MRPipeline(App.class, getConf());
            //Specify the input data to the pipeline. 
            //The input data is contained in PCollection
            PCollection<String> inDataPipe = pipeline.read(From.textFile(fileSource));
    
            //inject an operation into the crunch data pipeline
            PObject<Collection<String>> dataCollection = inDataPipe.asCollection();
    
            //iterate over the collection 
            Iterator<String> iterator = dataCollection.getValue().iterator();
            FileSystem fs = FileSystem.getLocal(getConf());
            BufferedWriter bufferedWriter = new BufferedWriter(new OutputStreamWriter(fs.create(outFileName, true)));
    
            while(iterator.hasNext()){
                String data = iterator.next().toString();
                bufferedWriter.write(data);
                bufferedWriter.newLine();
            }
    
            bufferedWriter.close();
    
            //Start the execution of the crunch pipeline, trigger the creation & execution of MR jobs
            PipelineResult result = pipeline.done();
    
            return result.succeeded() ? 0 : 1;
        }
    
        public static void main(String[] args) {
            if (args.length != 2)throw new RuntimeException("Usage: hadoop jar <inputPath> <outputPath>");
            try {
                ToolRunner.run(new Configuration(), new App(), args );
            } catch (Exception e) {
                LOG.error(e.getLocalizedMessage());
            }
        }
    
    }
    

    Usage: Run as java program with arguments: 1st arg is input fileName or directory and the second arg is output file directory. The out filename is event-Timestamp and remember there is a single space between the args{0} & args{1}. /user/sw029693/process_analyzer/input/input.txt /user/sw029693/process_analyzer/input/