I have a data stream in JSON format that my script accesses from an internal website. My script converts the JSON to a perl hash using JSON.pm (I'm using perl 5.10.1 on RHEL 6.9)
Within this hash are multiple nested hashes, and nested arrays, some of which are nested within other hashes/arrays inside of the big hash.
I need to walk the entire structure of the hash, including all of the arrays and nested hashes, and remove any keys anywhere in the entire structure, which share the same name as any other key (only for a specific key name though).
Additionally, because of how the data is structured, some nested hashes have ONLY keys that are now deleted, leaving the value for some keys as an empty hash. I also need to remove those keys which have an empty hash for its value
Here is my data after its conversion to perl:
$VAR1 = {
'cat' => 'meow',
'dog' => [
{
'a' => {
'husky' => {
'name' => 'fred'
},
'chow' => {
'name' => 'barney'
}
},
},
{
'b' => {
'husky' => 'wilma',
'lab' => 'betty'
},
'c' => 'pebbles' # yes this is intentionally a scalar in the example
},
{
'd' => {
'shihtzu' => 'bambam'
},
},
{
'e' => {
'husky' => 'dino'
},
},
],
}
We want to remove all keys named 'husky'
Here is what it should look like:
$VAR1 = {
'cat' => 'meow',
'dog' => [
{
'a' => {
'chow' => {
'name' => 'barney'
}
},
},
{
'b' => {
'labrador' => 'betty'
},
'c' => 'pebbles'
},
{
'd' => {
'shihtzu' => 'bambam'
},
},
],
}
Here is what I get after I added @Shawn's code and made a tweak to it (this is very close, but we need to account for the empty hashes:
$VAR1 = {
'cat' => 'meow',
'dog' => [
{
'a' => {
'chow' => {
'name' => 'barney'
}
},
},
{
'b' => {
'lab' => 'betty'
},
'c' => 'pebbles' # yes this is intentionally a scalar in the example
},
{
'd' => {
'shihtzu' => 'bambam'
},
},
{
'e' => {},
},
]
}
I've tried a few variations found elsewhere on SO and on perlmonks. keys %$_ == 0
, !%$_
to name a few. But none seem to work with this hash slice.
Code:
use 5.008008;
use strict;
use warnings;
use English; # I know I know, don't use English...
use JSON;
use YAML::Tiny qw(Dump);
# proprietary modules I wrote added here, which themselves load in LWP, HTTP::Cookie and others, and they do the bulk of building and sending the request. They are the back end to this script's front end.
[-snipped a ton of code-]
sub _count_keys
{
my ($j, $seen) = @ARG;
my $type = ref $j;
if ($type eq "ARRAY")
{
for (@{$j})
{
_count_keys($ARG, $seen);
}
}
elsif ($type eq "HASH")
{
while (my ($key, $val) = each %{$j})
{
$seen->{$key}++;
if (ref $val)
{
_count_keys($val, $seen);
}
}
}
return $seen;
}
sub _remove_duplicate_keys
{
my ($j, $seen) = @ARG;
$seen //= _count_keys($j, {});
my $type = ref $j;
if ($type eq "ARRAY")
{
return [ map { _remove_duplicate_keys($ARG, $seen) } @{$j} ];
}
elsif ($type eq "HASH")
{
my %obj = %{$j};
delete @obj{grep { $seen->{$ARG} > 1 and $ARG eq 'keyNameToBeExcluded'} keys %obj};
# Here is where I have been putting another delete line but I can't seem to find the right parameters for the grep to make it delete the empty anon hashes. Example of what I tried is the next comment below
# delete @obj{grep { $seen->{$ARG} > 1 and keys $ARG{assetDetails} == 0 } keys %obj};
while (my ($key, $val) = each %obj)
{
if (ref $val)
{
$obj{$key} = _remove_duplicate_keys($val, $seen);
}
}
return \%obj;
}
else
{
return $j;
}
}
sub _process_json
{
my $JSONOUTPUT = shift;
my $OPTIONS = shift;
# Change true to 1 and false to 0 to prevent blessed objects from appearing in the JSON, which prevents the YAML::Tiny module from barfing
foreach (@{$JSONOUTPUT})
{
s{true(,\n)}{1$1}gxms;
s{false(,\n)}{0$1}gxms;
}
my $JSONPERLOBJ = JSON->new->utf8->decode(@{$JSONOUTPUT});
# Test code below here; real code not in use while I test getting the output right.
use Data::Dumper;
my $BEFORE = $JSONPERLOBJ;
my $AFTER = _remove_duplicate_keys($JSONPERLOBJ);
# $JSONPERLOBJ = _remove_duplicate_keys($JSONPERLOBJ);
#print Dumper $BEFORE;
print Dumper $AFTER;
exit 1;
# End test code
}
sub _main
{
[-snip private code-]
my @JSONOUTPUT = $RESPONSE->decoded_content;
my $RC = _process_json(\@JSONOUTPUT, $OPTIONS);
exit ($RC == 1)?0:1;
}
I think this does what you want:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;
use JSON::XS; # Better than JSON; also see JSON::MaybeXS
my $j = <<EOJSON;
{
"foo": 1,
"bar": {
"foo": true,
"baz": false
},
"dog": "woof",
"cat": [ { "foo": 3 } ]
}
EOJSON
sub count_keys {
my ($j, $seen) = @_;
my $type = ref $j;
if ($type eq "ARRAY") {
count_keys($_, $seen) for @$j;
} elsif ($type eq "HASH") {
while (my ($key, $val) = each %$j) {
$seen->{$key}++;
count_keys($val, $seen) if ref $val;
}
}
return $seen;
}
sub remove_dups {
my ($j, $seen) = @_;
$seen //= count_keys($j, {});
my $type = ref $j;
if ($type eq "ARRAY") {
return [ map { remove_dups($_, $seen) } @$j ];
} elsif ($type eq "HASH") {
my %obj = %$j;
delete @obj{grep { $seen->{$_} > 1 } keys %obj};
while (my ($key, $val) = each %obj) {
$obj{$key} = remove_dups($val, $seen) if ref $val;
}
return \%obj;
} else {
return $j;
}
}
my $parsed = decode_json $j;
my $printer = JSON::XS->new->pretty->canonical;
say "Before:";
print $printer->encode($parsed);
say "After:";
my $dedup = remove_dups $parsed;
print $printer->encode($dedup);
produces
Before:
{
"bar" : {
"baz" : false,
"foo" : true
},
"cat" : [
{
"foo" : 3
}
],
"dog" : "woof",
"foo" : 1
}
After:
{
"bar" : {
"baz" : false
},
"cat" : [
{}
],
"dog" : "woof"
}
Edit for explanation:
The first time remove_dups
is called on a perl data structure representing a json value (Which doesn't have to be a json object), it calls count_keys
to recursively walk the structure and create a hash of all the keys and the number of times each one occurs. Then it again recursively walks the structure, returning a deep copy without keys that appeared more than once in the original.
This line is the real magic:
delete @obj{grep { $seen->{$_} > 1 } keys %obj};
It uses a hash slice to delete a bunch of keys all at once, with the grep bit returning a list of keys that appeared more than once. More information on slices.