perlhashlog-analysisset-unionperl5.8

combine keys of hashes for output (outer join of hashes)


I'm analysing a log file with Perl 5.8.8.[1] I'm searching for days that expose some of two trigger patterns, maybe one of it, maybe both (I changed the actual patterns in the code snippet shown below). I'm interested in the count of occurrences per day, next step will be to make a spreadsheet of it, that's why the output formatting with tabs.

Because only one of the patterns may occur in a day, I need a way to combine the keys of both hashes. I did by generating a new hash. Is there a built-in function for that? I searched the web and stack overflow without any result, the only hit I got here was Build a string from 2 hashes, but in that case the key sets were identical.

#!/usr/bin/perl -w
use strict;
use warnings;
use locale;

# input analysis: searching for two patterns:
my %pattern_a = ();
my %pattern_b = ();
foreach my $line (<>) {
    if ($line =~ m/^(\d{4}-\d{2}-\d{2})(.+)$/) {
        my $day = $1;
        my $what = $2;
        if ($what =~ m/beendet/) {
            $pattern_a{$day} ++;
        } elsif ($what =~ m/ohne/) {
            $pattern_b{$day} ++;
        }
    }
}

# generate the union of hash keys:        <-- In Question
my %union = ();
$union{$_} = 1 for keys %pattern_a;
$union{$_} = 1 for keys %pattern_b;

# formatted output sorted by day:
foreach my $day (sort keys %union) {
    print join "\t", $day, 
            ($pattern_a{$day} || 0), 
            ($pattern_b{$day} || 0)."\n";
}

The expected output would look like this:

2017-02-01      0       1
2017-02-18      0       592
2017-02-19      2       0

[1] I'm aware that this Perl version is quite outdated. But I'm using Perl rarely, but when I do, it has to go fast. So figuring out Perl versions and so on gets done later. But the Perl version is not so important for the actual question, at least I hope so...


Solution

  • It's easier to structure your data first by day, then by pattern. That can be done using a hash reference.

    use strict;
    use warnings;
    
    my %matches;
    while ( my $line = <DATA> ) {
        if ($line =~ m/^(\d{4}-\d{2}-\d{2})(.+)$/) {
            my $day = $1;
            my $what = $2;
            if ($what =~ m/beendet/) {
                $matches{$day}->{a} ++;
            } elsif ($what =~ m/ohne/) {
                $matches{$day}->{b} ++;
            }
        }
    }
    
    # formatted output sorted by day:
    foreach my $day (sort keys %matches) {
        print join(
            "\t",
            $day,
            $matches{$day}->{a} || 0,
            $matches{$day}->{b} || 0,
        ), "\n";
    }
    
    __DATA__
    2017-02-01 einmal Pommes ohne
    2017-02-02 Wartung gestartet
    2017-02-02 Wartung beendet
    2017-02-03 ohne Moos nix los
    

    That program produces output as follows

    2017-02-01  0   1
    2017-02-02  1   0
    2017-02-03  0   1
    

    To understand the data structure, you can use Data::Dumper to output it (though I suggest using Data::Printer instead, as that's intended for human consumption and not as a serialization).

    use Data::Dumper;
    print Dumper \%matches;
    __END__
    
    $VAR1 = {
              '2017-02-03' => {
                                'b' => 1
                              },
              '2017-02-02' => {
                                'a' => 1
                              },
              '2017-02-01' => {
                                'b' => 1
                              }
            };
    

    As you can see, the data is structured first by date. Each key represents one day. Inside, there is an additional hash reference that only holds one key. That's the pattern. Later we iterate the day first. Then we get

    {
        'b' => 1
    }
    

    in the first iteration. Then we iterate all the patterns. The above program does this not by actually iterating, but by explicitly stating each possible key. If it's there it's used. If it's not defined, it's set to 0 with the || operator.


    The program can be further simplified to use arbitrary patterns. If you don't care about the order of the patterns in the output, include a header and you can easily add more patterns later.

    I used a config hash for the patterns, and Text::Table to create the output.

    use strict;
    use warnings;
    use Text::Table;
    
    my %matches;
    my %patterns = (
        beendet => qr/beendet/,
        ohne    => qr/ohne/,
        komplex => qr/foo\sbar?/, # or whatever
    );
    while ( my $line = <DATA> ) {
        if ($line =~ m/^(\d{4}-\d{2}-\d{2})(.+)$/) {
            my $day = $1;
            my $what = $2;
            foreach my $name ( sort keys %patterns ) {
                if ( $what =~ $patterns{$name} ) {
                    $matches{$day}->{$name}++ ;
                    last;
                }
            }
        }
    }
    
    # formatted output sorted by day:
    my @head = sort keys %patterns;
    my $tb = Text::Table->new( 'Tag', @head );
    
    foreach my $day (sort keys %matches) {
        $tb->load([ $day, map { $matches{$day}->{$_} || 0 } @head ]);
    }
    
    print $tb;
    
    __DATA__
    2017-02-01 einmal Pommes ohne
    2017-02-02 Wartung gestartet
    2017-02-02 Wartung beendet
    2017-02-03 ohne Moos nix los
    

    This prints

    Tag        beendet komplex ohne
    2017-02-01 0       0       1   
    2017-02-02 1       0       0   
    2017-02-03 0       0       1   
    

    If you don't want to install an additional module, maybe just create a CSV file. Since you're from Germany, I suggest a semicolon ; as the separator, because German Excel uses that as the default.

    Here is a verbose example of how to do this instead of Text::Table.

    my @head = sort keys %patterns;
    print join( ';', @head ), "\n";
    foreach my $day (sort keys %matches) {
        my @cols;
        push @cols, $matches{$day}->{$_} || 0 for @head;
        print join ';', $day, @cols;
        print "\n";
    }
    

    And the output is

    beendet;komplex;ohne
    2017-02-01;0;0;1
    2017-02-02;1;0;0
    2017-02-03;0;0;1
    

    But you should also look into Text::CSV if you don't want this to go to the screen.