regexxmlperlxml-parsingcpan

Remove entries with specific keys from an XML file in Perl


I have XML files that look like this:

<?xml version="1.0" encoding="UTF-8"?>
<!-- some comment here -->
<rsccat version="1.0" locale="en_US" product="some_prouduct" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="../../../../product/resources/schema/msgcat.xsd">
  <message>

    <entry key="entry1" lol="false">
        <![CDATA[
            <actions>
                <action id="hmm" type="nothing">
                    <cmd>456</cmd>
                    <msg id="123"></msg>
                </action>
            </actions>
        ]]>
    </entry>

<entry key="entry2">message2 </entry>
<entry key="entry3">message3 </entry>

<entry key="entry4">
    <actions hello="yes">
    <action type="lol">
    <cmd>rolf</cmd>
    <txt>omg</txt>
    </action>
    </actions> </entry>



</message>
</rsccat>

I would like to write a function in Perl which takes in the path of an XML file, and a list of keys to be removed, and removes the entries associated with those keys entirely, without leaving any white spaces or blank lines. Moreover, I would like that the existing blank lines in the original XML files are preserved, for instance, the three blank lines after the entry with key entry4.

I have written a function which removes the entries without leaving any blank lines, but it also removes the existing blank lines in the XML file.

use File::Slurp;  
sub findReplaceFile
{
    my ($filename, @keys) = @_;  

    my $filetext = read_file($filename);

    foreach my $key (@keys) 
    {
        chomp($key);  # remove newline characters
        my $regex = qr/<entry\s+key\s*=\s*"${key}".*?>.+?<\/entry>/s;
        $filetext =~ s/$regex//gs;  # replacing with empty string
        $filetext =~ s/\n\s*\n/\n/g;  # removing extra line
    }
}

Please help me with my goal, I am fine with both the XML Parser module in Perl as well as plain old regex.


Solution

  • Wrote an example without using modules. Most likely, when reading a file, they use the chomp function, which removes line breaks. This is not the ultimate truth, but only my assumption. It is this module (File::Slurp) that I have never used. File app.pl

    #!/usr/bin/perl -w
    use strict;
    
    my $path = "data.xml";
    findReplaceFile($path, "entry2", "entry4");
    
    
    sub findReplaceFile {
        my ($filename, @keys) = @_;
        my $data = readData($filename);
        foreach my $key (@keys) {
            $data =~ s/<entry[^>]+key=(.?)$key\1[^>]*?>.*?<\/entry>\n?//mis;
        }
        writeData($filename, $data);
    }
    
    sub writeData {
        my $path = shift || "data.txt";
        my $data = shift || die "To write data to a file, you need to transfer this data";
        if (-e $path) {
            open my $fh, ">$path.dat" or die "Can't open file '$path.dat' for write: $!";
            print $fh $data;
            close $fh;
        }
    }
    
    sub readData {
        my $path = shift || "data.txt";
        my $data = "";
        if (-e $path and -T $path and -r $path) {
            open my $fh, "<$path" or die "Can't open file '$path' for read: $!";
            $data = join("", <$fh>);
            close $fh;
        } else {
            die "File '$path' dosn't exists or not a text file";
        }
        return $data;
    }
    

    This code will not modify your original XML. It will save the result in a separate file, adding the substring ".dat" to the file name, in the line:

    open my $fh, ">$path.dat" or die;
    

    It should also be noted that this code completely reads the file into memory, if your file grows to a huge size, you will need to rewrite the algorithm for reading line by line from the file, as well as checking and replacing on the fly.

    The following line of code does exactly the same as the code above. Run this line in the terminal, key numbers must be specified in this part: (?:1|3) - first and third (?:1|3|2) - first, third and second etc.

    perl -i.dat -ps0400e "s/<entry[^>]+key=(.?)entry(?:1|3)\1[^>]*?>.*?<\/entry>\n?//gmis" data.xml
    

    Only now the original file will be saved with the .dat extension, and the result will be saved to the file with the original name.