perl

Merging log files with different formats


I have two log files with different date/time formats that I would like to merge.

The first file is a standard Apache access_log file like so:

127.0.0.1 - - [29/Feb/2016:16:57:52 -0600] "GET /application/wcs/api/version?nodeRef=workspace://SpacesStore/ecd62cfa-fd19-4d6b-b45d-14f0e5b92cf0 HTTP/1.1" 200 567
127.0.0.1 - - [29/Feb/2016:16:57:52 -0600] "GET /application/wcs/api/node/workspace/SpacesStore/ecd62cfa-fd19-4d6b-b45d-14f0e5b92cf0/workflow-instances HTTP/1.1" 200 40
127.0.0.1 - - [29/Feb/2016:16:57:52 -0600] "GET /application/wcs/cisco/appId?userId=abcdefg&requestType=get HTTP/1.1" 200 45
173.37.239.93 - abcdefg [29/Feb/2016:16:57:52 -0600] "GET /share/page/site/nextgen-edcs/document-details?nodeRef=workspace://SpacesStore/ecd62cfa-fd19-4d6b-b45d-14f0e5b92cf0 HTTP/1.1" 200 124492
173.37.239.93 - abcdefg [29/Feb/2016:16:57:53 -0600] "GET /share/service/messages_69bcdfdb058bb873ff49cc2a10c958b7.js?locale=en_US HTTP/1.1" 200 81698
173.37.239.93 - abcdefg [29/Feb/2016:16:57:53 -0600] "GET /share/res/yui/history/history_543b42a00a378f4d4b6e70c81d915b0a.js HTTP/1.1" 200 5781

. . . where 'abcdedfg' = userid.

The second log file is formatted like so:

2016-02-12 08:16:03,630 WARN [cluster.cache.HazelcastSimpleCache] [http-bio-8443-exec-212] Cluster is inactive but put(k,v) was called for cache HazelcastSimpleCache[cacheName=cache.readersSharedCache]
2016-02-12 08:16:03,630 WARN [cluster.cache.HazelcastSimpleCache] [http-bio-8443-exec-212] Cluster is inactive but get(key) was called for cache HazelcastSimpleCache[cacheName=cache.readersSharedCache], key=AclEntity[ ID=1893033, version=55, aclId=16cf5bc3-27d0-4d50-a93d-3bee1ddd112e, isLatest=true, aclVersion=1, inherits=true, inheritsFrom=1889292, type=1, inheritedAcl=1893034, isVersioned=false, requiresVersion=false, aclChangeSet=1451473]
2016-02-12 08:16:03,630 WARN [cluster.cache.HazelcastSimpleCache] [http-bio-8443-exec-212] Cluster is inactive but put(k,v) was called for cache HazelcastSimpleCache[cacheName=cache.readersSharedCache]

My goals are:

  1. convert the date/time formats in the first log file to the date/time format of the second log file
  2. knock off the IP addresses from the first log file but keep the userids.
  3. merge the two log files together
  4. sort on date/time.

Here is what I have so far --

$LOGFILE1 = "catalina.out";
$LOGFILE2 = "access_log";

open(LOGFILE1) or die("Could not open log file.");
foreach $line (<LOGFILE1>) {
    chomp($line);
    if ($line =~ /^2016.+$/) {
         print $line . "\n";
    }
}

open(LOGFILE2) or die("Could not open log file.");
foreach $line (<LOGFILE2>) {
chomp($line);
if ($line =~ /\d{2}\/\S{3}\/\d{4}:\d{2}:\d{2}:\d{2} -\d{3}/) {
print $line . "\n";
}

    # format of file 1
    # DD/MMM/YYYY:HH:MM:SS -NNNN
    # 29/Feb/2016:20:03:07 -600
    # format of file 2
    # YYYY-MM-DD HH:MM:SS,NNN
    # 2016-02-12 08:16:03,631
}

So I'm basically only interested in lines with date/time info, so the above code is discarding the other lines.

Where I'm stuck is:
1) how do I convert the date/time format in file 1 into the data/time format of file 2?
2) I'm not interested in the IP addresses but I do want to keep the userids. Since file 1 does not start with date/time info like file 2, after converting, how will I sort on date after merging the two?

Any help would be appreciated!


Solution

  • Here is a solution using Time::Piece. I used Inline::Files to simulate the 2 files. You would need to open you logfiles like

    my $logfile1 = "catalina.out";
    my $logfile2 = "access_log";
    
    
    open my $log1_fh, '<', $logfile1 or die $1;
    open my $log2_fh, '<', $logfile2 or die $1;
    

    The program would look like this one, which gave me the results I think you want.

    #!/usr/bin/perl
    use strict;
    use warnings;
    use Inline::Files;
    use Time::Piece;
    
    my %data;
    
    while (<FILE2>) {
        # get date_time
        my ($dt) = /^(\d{4}-\d\d-\d\d \d\d:\d\d:\d\d),/ or next;
        push @{ $data{$dt} }, $_;
    }
    
    my $format = '%d/%b/%Y:%H:%M:%S';
    
    while (<FILE1>) {
        /\[(\S+)/;
        my $t = Time::Piece->strptime($1, $format)
            or die "Cannot parse $1. $!";
    
        my $dt = $t->strftime('%Y-%m-%d %H:%M:%S');
    
        s/^\S+ (?:- )+//;
        s/(?<=\[)[^\]]+/$dt/;
        push @{ $data{$dt} }, $_;
    }
    
    for my $dt (sort keys %data) {
        my $aref = $data{$dt};
        print for @$aref;   
    }
    
    
    __FILE1__
    127.0.0.1 - - [29/Feb/2016:16:57:52 -0600] "GET /application/wcs/api/version?nodeRef=workspace://SpacesStore/ecd62cfa-fd19-4d6b-b45d-14f0e5b92cf0 HTTP/1.1" 200 567
    127.0.0.1 - - [29/Feb/2016:16:57:52 -0600] "GET /application/wcs/api/node/workspace/SpacesStore/ecd62cfa-fd19-4d6b-b45d-14f0e5b92cf0/workflow-instances HTTP/1.1" 200 40
    127.0.0.1 - - [29/Feb/2016:16:57:52 -0600] "GET /application/wcs/cisco/appId?userId=abcdefg&requestType=get HTTP/1.1" 200 45
    173.37.239.93 - abcdefg [29/Feb/2016:16:57:52 -0600] "GET /share/page/site/nextgen-edcs/document-details?nodeRef=workspace://SpacesStore/ecd62cfa-fd19-4d6b-b45d-14f0e5b92cf0 HTTP/1.1" 200 124492
    173.37.239.93 - abcdefg [29/Feb/2016:16:57:53 -0600] "GET /share/service/messages_69bcdfdb058bb873ff49cc2a10c958b7.js?locale=en_US HTTP/1.1" 200 81698
    173.37.239.93 - abcdefg [29/Feb/2016:16:57:53 -0600] "GET /share/res/yui/history/history_543b42a00a378f4d4b6e70c81d915b0a.js HTTP/1.1" 200 5781
    __FILE2__
    2016-02-12 08:16:03,630  WARN  [cluster.cache.HazelcastSimpleCache] [http-bio-8443-exec-212] Cluster is inactive but put(k,v) was called for cache HazelcastSimpleCache[cacheName=cache.readersSharedCache]
    2016-02-12 08:16:03,630  WARN  [cluster.cache.HazelcastSimpleCache] [http-bio-8443-exec-212] Cluster is inactive but get(key) was called for cache HazelcastSimpleCache[cacheName=cache.readersSharedCache], key=AclEntity[ ID=1893033, version=55, aclId=16cf5bc3-27d0-4d50-a93d-3bee1ddd112e, isLatest=true, aclVersion=1, inherits=true, inheritsFrom=1889292, type=1, inheritedAcl=1893034, isVersioned=false, requiresVersion=false, aclChangeSet=1451473]
    2016-02-12 08:16:03,630  WARN  [cluster.cache.HazelcastSimpleCache] [http-bio-8443-exec-212] Cluster is inactive but put(k,v) was called for cache HazelcastSimpleCache[cacheName=cache.readersSharedCache]
    

    I used a hash %data to store the lines. The key is the transformed date so later in the program, you can print them in sorted order.

    The output from this program is:

    2016-02-12 08:16:03,630  WARN  [cluster.cache.HazelcastSimpleCache] [http-bio-8443-exec-212] Cluster is inactive but put(k,v) was called for cache HazelcastSimpleCache[cacheName=cache.readersSharedCache]
    2016-02-12 08:16:03,630  WARN  [cluster.cache.HazelcastSimpleCache] [http-bio-8443-exec-212] Cluster is inactive but get(key) was called for cache HazelcastSimpleCache[cacheName=cache.readersSharedCache], key=AclEntity[ ID=1893033, version=55, aclId=16cf5bc3-27d0-4d50-a93d-3bee1ddd112e, isLatest=true, aclVersion=1, inherits=true, inheritsFrom=1889292, type=1, inheritedAcl=1893034, isVersioned=false, requiresVersion=false, aclChangeSet=1451473]
    2016-02-12 08:16:03,630  WARN  [cluster.cache.HazelcastSimpleCache] [http-bio-8443-exec-212] Cluster is inactive but put(k,v) was called for cache HazelcastSimpleCache[cacheName=cache.readersSharedCache]
    [2016-02-29 16:57:52] "GET /application/wcs/api/version?nodeRef=workspace://SpacesStore/ecd62cfa-fd19-4d6b-b45d-14f0e5b92cf0 HTTP/1.1" 200 567
    [2016-02-29 16:57:52] "GET /application/wcs/api/node/workspace/SpacesStore/ecd62cfa-fd19-4d6b-b45d-14f0e5b92cf0/workflow-instances HTTP/1.1" 200 40
    [2016-02-29 16:57:52] "GET /application/wcs/cisco/appId?userId=abcdefg&requestType=get HTTP/1.1" 200 45
    abcdefg [2016-02-29 16:57:52] "GET /share/page/site/nextgen-edcs/document-details?nodeRef=workspace://SpacesStore/ecd62cfa-fd19-4d6b-b45d-14f0e5b92cf0 HTTP/1.1" 200 124492
    abcdefg [2016-02-29 16:57:53] "GET /share/service/messages_69bcdfdb058bb873ff49cc2a10c958b7.js?locale=en_US HTTP/1.1" 200 81698
    abcdefg [2016-02-29 16:57:53] "GET /share/res/yui/history/history_543b42a00a378f4d4b6e70c81d915b0a.js HTTP/1.1" 200 5781