Basically I need to use the schema option from the perl module XML::libXML::Reader in order to validate a large (>1GB) XML file as the file is parsed.
Previously I have used the xmllint command to validate an XML file against a given schema (xsd) file. However now I have some large XML files to validate and am running out of memory (8GB) trying to perform the validation.
I have read on the XML::libXML::Reader perl module page that there is a schema option. However, when I use it (see code below) the code exits when the first invalidate element of the XML file is found.
use strict;
use warnings;
use XML::LibXML::Reader;
my $SchemaFile='schema.xsd';
my $FileToAnalyse='/tmp/file.xml';
my $reader = XML::LibXML::Reader->new(location => $FileToAnalyse,Schema=>$SchemaFile) or
die "cannot read file '$FileToAnalyse': $!\n";
while($reader->read) {
Process the file line by line here, even if not valid against schema (reduces memory usage for large files)
}
I need to collect the invalid entries and continue rather than exiting. Is this possible?
The reason $reader->read
does not recover from schema validation errors (even if recovery could be possible) can be seen at line #8815 of LibXML.xs
. Notice that REPORT_ERROR()
is called with a zero value (the value indicates whether `LibXML_report_error_ctx() will be able to recover from errors or not. A value of zero, means it will not try to recover, and it will call XML::LibXML::Error::_report_error to die.
I tried to change the value to 1 at line #8815 and recompiled the XS module, and now it reported the schema errors as warnings (instead of dying) and continued the parsing.
I guess there is a good reason why this option is not made available to the user, but I am not so familiar with XML parsing that I can give an example of what could go wrong here.
Edit:
It seems that the correct approach is to catch the exceptions thrown by read()
, then try to call read()
another time, if the following call to read()
returns -1, the parser was not able to recover from the error, if it returns 0, end-of-file was reached, and if it returns 1 it was able to recover from the exception. I did some testing and it seems it is able to recover from schema validation errors, but not from parsing errors. So you could try the following:
use feature qw(say);
use strict;
use warnings;
use Try::Tiny qw(try catch);
use XML::LibXML::Reader;
my $SchemaFile='schema.xsd';
my $FileToAnalyse='file.xml';
my $reader = XML::LibXML::Reader->new(
location => $FileToAnalyse, Schema => $SchemaFile
) or die "cannot read file '$FileToAnalyse': $!\n";
while (1) {
my $result;
try { $result = $reader->read } catch {
say '==> ' . $_;
$result = 1; # Try to continue after exception..
};
last if $result != 1;
if ( $reader->nodeType == XML_READER_ELEMENT ) {
say "Element node: ", $reader->name;
}
}
$reader->finish();
$reader->close();