phpxmlxmlreader

PHP - How to identify and count only parent elements of a very large XML efficiently


I have a very large xml file with the following format (this is a very small snip of two of the sections).

<?xml version="1.0" standalone="yes"?>
<LaunchBox>
  <Game>
    <Name>Violet</Name>
    <ReleaseYear>1985</ReleaseYear>
    <MaxPlayers>1</MaxPlayers>
    <Platform>ZiNc</Platform>
  </Game>
  <Game>
    <Name>Wishbringer</Name>
    <ReleaseYear>1985</ReleaseYear>
    <MaxPlayers>1</MaxPlayers>
    <Platform>ZiNc</Platform>
  </Game>
  <Platform>
    <Name>3DO Interactive Multiplayer</Name>
    <Emulated>true</Emulated>
    <ReleaseDate>1993-10-04T00:00:00-07:00</ReleaseDate>
    <Developer>The 3DO Company</Developer>
  </Platform>
  <Platform>
    <Name>Commodore Amiga</Name>
    <Emulated>true</Emulated>
    <ReleaseDate>1985-07-23T00:00:00-07:00</ReleaseDate>
    <Developer>Commodore International</Developer>
  </Platform>
</LaunchBox>

I would like to quickly find the instances of all the parent elements (i.e. Game and Platform in the above example) to count them but also to extract the contents.

To complicate matters, there is also a Platform "child" inside Game (which I don't want to count). I only want the Parent (i.e. I do not want Game -> Platform but I do want just Platform.

From a combination of this site and Google I came up with the following function code:

$attributeCount = 0;

$xml = new XMLReader();
$xml->open($xmlFile);
$elements = new \XMLElementIterator($xml, $sectionNameWereGetting);
// $sectionNameWereGetting is a variable that changes to Game and Platform etc

foreach( $elements as $key => $indElement ){
            if ($xml->nodeType == XMLReader::ELEMENT && $xml->name == $sectionNameWereGetting) {
                $parseElement = new SimpleXMLElement($xml->readOuterXML());
// NOW I CAN COUNT IF THE ELEMENT HAS CHILDREN
                $thisCount = $parseElement->count();
                unset($parseElement);
                if ($thisCount == 0){
// IF THERE'S NO CHILDREN THEN SKIP THIS ELEMENT
                    continue;
                }
// IF THERE IS CHILDREN THEN INCREMENT THE COUNT
// - IN ANOTHER FUNCTION I GRAB THE CONTENTS HERE
// - AND PUT THEM IN THE DATABASE
                $attributeCount++;
            }
}
unset($elements);
$xml->close();
unset($xml);

return  $attributeCount;

I'm using the excellent script by Hakre at https://github.com/hakre/XMLReaderIterator/blob/master/src/XMLElementIterator.php

This does work. But I think assigning a new SimpleXMLElement is slowing the operation down.

I only need the SimpleXMLElement to check if the element has children (which I'm using to ascertain if the element is inside another parent or not - i.e. if it's a parent it 'will' have children so I want to count it but, if it's inside another parent then it won't have children and I want to ignore it).

But perhaps there is a better solution than counting children? i.e. a $xml->isParent() function or something?

The current function times out before it has fully counted all the sections of the xml (there are around 8 different sections and some of them have several 100,000's of records).

How can I make this process more efficient as I'm also using similar code to grab the contents of the main sections and put them into a database so it will pay dividends to be as efficient as possible.

Also worth noting that I'm not particularly good at programming so please feel free to point out other mistakes I may have made so that I can improve.


Solution

  • You do not need to serialize the XML to load it into DOM or SimpleXML. You can expand into a DOM document:

    $reader = new XMLReader();
    $reader->open(getXMLDataURL());
    
    $document = new DOMDocument();
    
    // navigate using read()/next()
    
    while ($found) {
      // expand into DOM 
      $node = $reader->expand($document);
      // import DOM into SimpleXML 
      $simpleXMLObject = simplexml_import_dom($node);
     
      // navigate using read()/next()
    }
    

    However counting the element children of the document element can be done with just the right calls to XMLReader:read() and XMLReader:next(). read() will navigate to the following node including descendants while next() goes to the following sibling node - ignoring the descendants.

    $reader = new XMLReader();
    $reader->open(getXMLDataURL());
    
    $document = new DOMDocument();
    $xpath = new DOMXpath($document);
    
    $found = false;
    // look for the document element
    do {
      $found = $found ? $reader->next() : $reader->read();
    } while (
      $found && 
      $reader->localName !== 'LaunchBox'
    );
    
    // go to first child of the document element
    if ($found) {
        $found = $reader->read();
    }
    
    $counts = [];
    
    // found a node at depth 1 
    while ($found && $reader->depth === 1) {
         if ($reader->nodeType === XMLReader::ELEMENT) {
            if (isset($counts[$reader->localName])) {
                $counts[$reader->localName]++;
            } else {
                $counts[$reader->localName] = 1;
            }
        }
        // go to next sibling node
        $found = $reader->next();
    }
    
    var_dump($counts);
    
    
    function getXMLDataURL() {
       $xml = <<<'XML'
    <?xml version="1.0" standalone="yes"?>
    <LaunchBox>
      <Game>
        <Name>Violet</Name>
        <ReleaseYear>1985</ReleaseYear>
        <MaxPlayers>1</MaxPlayers>
        <Platform>ZiNc</Platform>
      </Game>
      <Game>
        <Name>Wishbringer</Name>
        <ReleaseYear>1985</ReleaseYear>
        <MaxPlayers>1</MaxPlayers>
        <Platform>ZiNc</Platform>
      </Game>
      <Platform>
        <Name>3DO Interactive Multiplayer</Name>
        <Emulated>true</Emulated>
        <ReleaseDate>1993-10-04T00:00:00-07:00</ReleaseDate>
        <Developer>The 3DO Company</Developer>
      </Platform>
      <Platform>
        <Name>Commodore Amiga</Name>
        <Emulated>true</Emulated>
        <ReleaseDate>1985-07-23T00:00:00-07:00</ReleaseDate>
        <Developer>Commodore International</Developer>
      </Platform>
    </LaunchBox>
    XML;
        return 'data:application/xml;base64,'.base64_encode($xml);
    }
    

    Output:

    array(2) {
      ["Game"]=>
      int(2)
      ["Platform"]=>
      int(2)
    }