perlyamlperl-data-structures

Merge two yml files does not handle duplicates?


I am trying to merge 2 yml files using Hash::Merge perl module. And trying to Dump it to yml file using Dump from YMAL module.

use strict;
use warnings;
use Hash::Merge qw( merge );
Hash::Merge::set_behavior('RETAINMENT_PRECEDENT');
use File::Slurp qw(write_file);
use YAML;
my $yaml1 = $ARGV[0];
my $yaml2 = $ARGV[1];
my $yaml_output = $ARGV[2];
my $clkgrps = &YAML::LoadFile($yaml1);
my $clkgrps1 = &YAML::LoadFile($yaml2);
my $clockgroups = merge($clkgrps1, $clkgrps);
my $out_yaml = Dump $clockgroups;
write_file($yaml_output, { binmode => ':raw' }, $out_yaml);

After merging yml file, I could see duplicate entries i.e. following content is same in two yml files. While merging it is treating them as different entries. Do we have any implicit way in handle duplicates?


Solution

  • The data structures obtained from YAML files generally contain keys with values being arrayrefs with hashrefs. In your test case that's the arrayref for the key test.

    Then a tool like Hash::Merge can only add the hashrefs to the arrayref belonging to the same key; it is not meant to compare array elements, as there aren't general criteria for that. So you need to do this yourself in order to prune duplicates, or apply any specific rules of your choice to data.

    One way to handle this is to serialize (so stringify) complex data structures in each arrayref that may contain duplicates so to be able to build a hash with them being keys, which is a standard way to handle duplicates (with O(1) complexity, albeit possibly with a large constant).

    There are a number of ways to serialize data in Perl. I'd recommend JSON::XS, as a very fast tool with output that can be used by any language and tool. (But please research others of course, that may suit your precise needs better.)

    A simple complete example, using your test cases

    use strict;
    use warnings;
    use feature 'say';
    use Data::Dump qw(dd pp);
    
    use YAML;
    use JSON::XS;
    use Hash::Merge qw( merge );
    #Hash::Merge::set_behavior('RETAINMENT_PRECEDENT');  # irrelevant here
    
    die "Usage: $0 in-file1 in-file2 output-file\n" if @ARGV != 3;
    
    my ($yaml1, $yaml2, $yaml_out) = @ARGV;
    
    my $hr1 = YAML::LoadFile($yaml1);
    my $hr2 = YAML::LoadFile($yaml2);
    my $merged = merge($hr2, $hr1);
    #say "merged: ", pp $merged;
    
    for my $key (keys %$merged) {
        # The same keys get overwritten
        my %uniq = map { encode_json $_ => 1 } @{$merged->{$key}};
        
        # Overwrite the arrayref with the one without dupes
        $merged->{$key} = [ map { decode_json $_ } keys %uniq ];
    }
    dd $merged;
    
    # Save the final structure...
    

    More complex data structures require a more judicious traversal; consider using a tool for that.

    With files as shown in the question this prints

    {
      test => [
        { directory => "LIB_DIR", name => "ObsSel.ktc", project => "TOT" },
        { directory => "MODEL_DIR", name => "pipe.v", project => "TOT" },
        {
          directory => "PCIE_LIB_DIR",
          name => "pciechip.ktc",
          project => "PCIE_MODE",
        },
        { directory => "NAME_DIR", name => "fame.v", project => "SINGH" },
        { directory => "TREE_PROJECT", name => "Syn.yml", project => "TOT" },
      ],
    }
    

    (I use Data::Dump to show complex data, for its simplicity and default compact output.)

    If there are issues with serializing and comparing entire structures consider using a digest (checksum, hashing) of some sort.

    Another option altogether would be to compare data structures as they are in order to resolve duplicates, by hand. For comparison of complex data structures I like to use Test::More, which works very nicely for mere comparisons outside of any testing. But there are dedicated tools as well of course, like Data::Compare.


    Finally, instead of manually processing the result of a naive merge, like above, one can code the desired behavior using Hash::Merge::add_behavior_spec and then have the module do it all. For specific examples of how to use this feature see for instance this post and this post and this post.

    Note that in this case you still write all the code to do the job like above but the module does take some of the mechanics off of your hands.