excelvbaweb-scrapingmsxml6msxml3

Why does parsing XML document using MSXML v3.0 work, but MSXML v6.0 doesn't


So, I am working on a project that scrapes and collects data from many different sources around the internet with many different methods depending on each source's characteristics.

The most recent addition is a web API call which returns the following XML as a response:

<?xml version="1.0"?>
<Publication_MarketDocument xmlns="urn:iec62325.351:tc57wg16:451-3:publicationdocument:7:0">
    <mRID>29b526a69b9445a7bb507ba446e3e8f9</mRID>
    <revisionNumber>1</revisionNumber>
    <type>A44</type>
    <sender_MarketParticipant.mRID codingScheme="A01">10X1001A1001A450</sender_MarketParticipant.mRID>
    <sender_MarketParticipant.marketRole.type>A32</sender_MarketParticipant.marketRole.type>
    <receiver_MarketParticipant.mRID codingScheme="A01">10X1001A1001A450</receiver_MarketParticipant.mRID>
    <receiver_MarketParticipant.marketRole.type>A33</receiver_MarketParticipant.marketRole.type>
    <createdDateTime>2019-09-19T11:28:51Z</createdDateTime>
    <period.timeInterval>
        <start>2019-09-18T22:00Z</start>
        <end>2019-09-19T22:00Z</end>
    </period.timeInterval>
    <TimeSeries>
        <mRID>1</mRID>
        <businessType>A62</businessType>
        <in_Domain.mRID codingScheme="A01">10YCS-SERBIATSOV</in_Domain.mRID>
        <out_Domain.mRID codingScheme="A01">10YCS-SERBIATSOV</out_Domain.mRID>
        <currency_Unit.name>EUR</currency_Unit.name>
        <price_Measure_Unit.name>MWH</price_Measure_Unit.name>
        <curveType>A01</curveType>
        <Period>
            <timeInterval>
                <start>2019-09-18T22:00Z</start>
                <end>2019-09-19T22:00Z</end>
            </timeInterval>
            <resolution>PT60M</resolution>
            <Point>
                <position>1</position>
                <price.amount>44.08</price.amount>
            </Point>
            <Point>
                <position>2</position>
                <price.amount>37.14</price.amount>
            </Point>
            <Point>
                <position>3</position>
                <price.amount>32.21</price.amount>
            </Point>
            <Point>
                <position>4</position>
                <price.amount>31.44</price.amount>
            </Point>
            <Point>
                <position>5</position>
                <price.amount>32.48</price.amount>
            </Point>
            <Point>
                <position>6</position>
                <price.amount>45.52</price.amount>
            </Point>
            <Point>
                <position>7</position>
                <price.amount>56.05</price.amount>
            </Point>
            <Point>
                <position>8</position>
                <price.amount>74.96</price.amount>
            </Point>
            <Point>
                <position>9</position>
                <price.amount>74.08</price.amount>
            </Point>
            <Point>
                <position>10</position>
                <price.amount>69.03</price.amount>
            </Point>
            <Point>
                <position>11</position>
                <price.amount>72.89</price.amount>
            </Point>
            <Point>
                <position>12</position>
                <price.amount>68.91</price.amount>
            </Point>
            <Point>
                <position>13</position>
                <price.amount>74.95</price.amount>
            </Point>
            <Point>
                <position>14</position>
                <price.amount>72.91</price.amount>
            </Point>
            <Point>
                <position>15</position>
                <price.amount>75.97</price.amount>
            </Point>
            <Point>
                <position>16</position>
                <price.amount>76.49</price.amount>
            </Point>
            <Point>
                <position>17</position>
                <price.amount>59.08</price.amount>
            </Point>
            <Point>
                <position>18</position>
                <price.amount>60.19</price.amount>
            </Point>
            <Point>
                <position>19</position>
                <price.amount>64.69</price.amount>
            </Point>
            <Point>
                <position>20</position>
                <price.amount>69.18</price.amount>
            </Point>
            <Point>
                <position>21</position>
                <price.amount>64.97</price.amount>
            </Point>
            <Point>
                <position>22</position>
                <price.amount>63.38</price.amount>
            </Point>
            <Point>
                <position>23</position>
                <price.amount>52.92</price.amount>
            </Point>
            <Point>
                <position>24</position>
                <price.amount>48.08</price.amount>
            </Point>
        </Period>
    </TimeSeries>
</Publication_MarketDocument> 

Having dealt successfully with situations like that using Microsoft XML, v6.0 I tried the following:

Dim respXML As New MSXML2.DOMDocument60
respXML.LoadXML (ThisWorkbook.Worksheets("Sheet2").Range("A1")) 'for the sake of the post's simplicity I'm loading the xml from excel
Debug.Print respXML.getElementsByTagName("price.amount").Length

This should be returning 24 but instead it returns 0. Indeed the following:

Debug.Print respXML.getElementsByTagName("price.amount")(1) Is Nothing

returns True, which means that the <price.amount></price.amount> elements are not being found. However, Debug.Print respXML.XML yields the expected results.

I read somewhere that early binding could be causing problems so I tried the following as well:

Dim respXML As Object
Set respXML = CreateObject("MSXML2.DOMDocument.6.0")
respXML.LoadXML (ThisWorkbook.Worksheets("Sheet2").Range("A1"))
Debug.Print respXML.getElementsByTagName("price.amount").Length
Debug.Print respXML.getElementsByTagName("price.amount")(1) Is Nothing

Still the results are the same.

Switching to Microsoft XML, v3.0 resolves the issue completely.

However, I would prefer sticking to v6.0 since it's the one that is more actively being maintained and supported.

Why does this happen? Does it have to do with the XML itself? Does it have to do with my code? Am I missing something? Is there a way to make it work with Microsoft XML, v6.0?

Any input would be appreciated.


Solution

  • To extend @CindyMeister's answer, the issue does appear to be namespace handling between the MSXML versions using getElementsByTagName(). Specifically, your XML maintains an xmlns attribute without colon identified prefix which requires DOM libraries to assign a prefix when parsing content:

    <Publication_MarketDocument xmlns="urn:iec62325.351:tc57wg16:451-3:publicationdocument:7:0" ...
    

    However, using SelectionNamespaces + SelectNodes to define a temporary alias, such as doc, to default namespace prefix, both libraries print out expected results. And MS docs even advises the latter method (emphasis added):

    The getElementsByTagName method simulates the matching of the provided argument against the result of the tagName property of IXMLDOMElement. When executed, it does not recognize or support namespaces. Instead, you should use the selectNodes method, which is faster in some cases and can support more complex searches.

    MXSML v3.0 (prints unexpected getElementsByTagName result)

    Sub ParseXMLv3()
        Dim respXML As New MSXML2.DOMDocument30
    
        respXML.Load "C:\Path\To\Input.xml"
        respXML.setProperty "SelectionLanguage", "XPath"
        respXML.setProperty "SelectionNamespaces", "xmlns:doc='urn:iec62325.351:tc57wg16:451-3:publicationdocument:7:0'"
    
        Debug.Print respXML.SelectNodes("//doc:price.amount").Length       ' PRINTS 24
        Debug.Print respXML.SelectNodes("//price.amount").Length           ' PRINTS 0
        Debug.Print respXML.getElementsByTagName("price.amount").Length    ' PRINTS 24
    
        Set respXML = Nothing
    End Sub
    

    MSXML v6.0

    Sub ParseXMLv6()
        Dim respXML As New MSXML2.DOMDocument60
    
        respXML.Load "C:\Path\To\Input.xml"
        respXML.setProperty "SelectionLanguage", "XPath"
        respXML.setProperty "SelectionNamespaces", "xmlns:doc='urn:iec62325.351:tc57wg16:451-3:publicationdocument:7:0'"
    
        Debug.Print respXML.SelectNodes("//doc:price.amount").Length       ' PRINTS 24
        Debug.Print respXML.SelectNodes("//price.amount").Length           ' PRINTS 0
        Debug.Print respXML.getElementsByTagName("price.amount").Length    ' PRINTS 0
    
        Set respXML = Nothing
    End Sub