My question is as follows:
I have to read a big XML file, 50 MB; and anonymise some tags/fields that relate to private issues, like name surname address, email, phone number, etc...
I know exactly which tags in XML are to be anonymised.
s|<a>alpha</a>|MD5ed(alpha)|e;
s|<h>beta</h>|MD5ed(beta)|e;
where alpha
and beta
refer to any characters within, which will also be hashed, using probably an algorithm like MD5.
I will only convert the tag value, not the tags themselves.
I hope, I am clear enough about my problem. How do I achieve this?
Using regexps is indeed dangerous, unless you know exactly the format of the file, it's easy to parse with regexps, and you are sure that it will not change in the future.
Otherwise you could indeed use XML::Twig,as below. An alternative would be to use XML::LibXML, although the file might be a bit big to load it entirely in memory (then again, maybe not, memory is cheap these days) so you might have to use the pull mode, which I don't know much about.
Compact XML::Twig code:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
use Digest::MD5 'md5_base64';
my @tags_to_anonymize= qw( name surname address email phone);
# the handler for each element ($_) sets its content with the md5 and then flushes
my %handlers= map { $_ => sub { $_->set_text( md5_base64( $_->text))->flush } } @tags_to_anonymize;
XML::Twig->new( twig_roots => \%handlers, twig_print_outside_roots => 1)
->parsefile( "my_big_file.xml")
->flush;