apache-sparkspark-streamingaccess-loglog-analysis

How to get data from regularly appended log file in Apache Spark?


I have one Apache access log file which has some data and it is continuously increasing. I want to analyze that data using Apache Spark Streaming API.

And Spark is new for me and i created one program in which ,i use jssc.textFileStream(directory) function to get log data. but its not work as per my requirement.

please suggest me some approaches to analyze that log file using spark.

Here is my code.

SparkConf conf = new SparkConf()
                .setMaster("spark://192.168.1.9:7077")
                .setAppName("log streaming")
                .setSparkHome("/usr/local/spark")
                .setJars(new String[] { "target/sparkstreamingdemo-0.0.1.jar" });
        StreamingContext ssc = new StreamingContext(conf, new Duration(5000));
        DStream<String> filerdd = ssc.textFileStream("/home/user/logs");
        filerdd.print();
        ssc.start();
        ssc.awaitTermination();

This code does not return any data from existing files. This is only work when i create a new file but when i update that new file, program again does not return updated data.


Solution

  • If the file is modified in real-time you can use Tailer from Apache Commons IO. That's the simpliest sample:

         public void readLogs(File f, long delay) {
            TailerListener listener = new MyTailerListener();
            Tailer tailer = new Tailer(f, listener, delay);
    
            // stupid executor impl. for demo purposes
            Executor executor = new Executor() {
                public void execute(Runnable command) {
                    command.run();
                 }
            };
            executor.execute(tailer);       
        }
    
        public class MyTailerListener extends TailerListenerAdapter {
            public void handle(String line) {
                System.out.println(line);
            }
        }
    

    The code above may be used as a log reader for Apache Flume and applied as a source. Then you need to configure Flume sink to redirect collected logs to Spark stream and apply Spark for analyzing data from Flume stream (http://spark.apache.org/docs/latest/streaming-flume-integration.html)

    More details about Flume setup in this post: real time log processing using apache spark streaming