asp.netvb.netiisfile-processing

Read large XML file from webserver without splitting in smaller chunks


I'm downloading a file from a 3rd party server, like so:

Try
    req = DirectCast(HttpWebRequest.Create("https://www.example.com/my.xml"), HttpWebRequest)
    req.Timeout = 100000 '100 seconds
    Resp = DirectCast(req.GetResponse(), HttpWebResponse)
    reader = New StreamReader(Resp.GetResponseStream)
    responseString = reader.ReadToEnd()
Catch ex As Exception

End Try

The file my.xml is 1.2GB and I'm getting the error "Exception of type 'System.OutOfMemoryException' was thrown." When I open Windows Task Manager I see memory usage is at just 70% of total available memory and IIS Worker Process is not growing in size to use full system memory. When I found this: https://learn.microsoft.com/en-us/archive/blogs/tom/chat-question-memory-limits-for-32-bit-and-64-bit-processes, so the 70% failure sounds about right.

So now I'm considering splitting the file in more manageable smaller chunks. However, how can I do this without creating separate files? Is there a way to load for example 100MB into memory each time (respecting XML node endings) or perhaps by reading X number of XML nodes each time?

When I Google on "Read large XML file from webserver without splitting in smaller chunks" I get nothing but file splitting tools.

UPDATE 1

Based on Lex Li's suggestion I searched and found this tutorial: https://learn.microsoft.com/en-us/dotnet/standard/linq/perform-streaming-transform-large-xml-documents

So I translated the code, which works as per the tutorial:

Private Shared Iterator Function StreamCustomerItem(ByVal uri As String) As IEnumerable(Of XElement)
    Using reader As XmlReader = XmlReader.Create(uri)
        Dim name As XElement = Nothing
        Dim item As XElement = Nothing
        reader.MoveToContent()

        While reader.Read()

            If reader.NodeType = XmlNodeType.Element AndAlso reader.Name = "Customer" Then

                While reader.Read()

                    If reader.NodeType = XmlNodeType.Element AndAlso reader.Name = "Name" Then
                        name = TryCast(XElement.ReadFrom(reader), XElement)
                        Exit While
                    End If
                End While

                While reader.Read()
                    If reader.NodeType = XmlNodeType.EndElement Then Exit While

                    If reader.NodeType = XmlNodeType.Element AndAlso reader.Name = "Item" Then
                        item = TryCast(XElement.ReadFrom(reader), XElement)

                        If item IsNot Nothing Then
                            Dim tempRoot As XElement = New XElement("Root", New XElement(name))
                            tempRoot.Add(item)
                            Yield item
                        End If
                    End If
                End While
            End If
        End While
    End Using
End Function

Private Shared Sub Main()
    Dim srcTree As IEnumerable(Of XElement) = From el In StreamCustomerItem("https://www.example.com/source.xml") Select New XElement("Item", New XElement("Customer", CStr(el.Parent.Element("Name"))), New XElement(el.Element("Key")))
    Dim xws As XmlWriterSettings = New XmlWriterSettings()
    xws.OmitXmlDeclaration = True
    xws.Indent = True

    Using xw As XmlWriter = XmlWriter.Create(HttpContext.Current.Server.MapPath("files\") + "Output.xml", xws)
        xw.WriteStartElement("Root")

        For Each el As XElement In srcTree
            el.WriteTo(xw)
        Next

        xw.WriteEndElement()
    End Using

End Sub

The example above transforms the source.xml in an output.xml, but all I want is to read product nodes exactly as is (no transformation needed) and in such a way that it reads in individual nodes so I can process large XML files.

I tried to rewrite it so it extracts values from my XML just like the structure. First I tried just getting something ready from my xml file like so:

Private Shared Iterator Function StreamCustomerItem(ByVal uri As String) As IEnumerable(Of XElement)
    Using reader As XmlReader = XmlReader.Create(uri)
        Dim name As XElement = Nothing
        Dim item As XElement = Nothing
        reader.MoveToContent()

        While reader.Read()
            If reader.NodeType = XmlNodeType.Element AndAlso reader.Name = "Id" Then
                name = TryCast(XElement.ReadFrom(reader), XElement)
                item = TryCast(XElement.ReadFrom(reader), XElement)

                If item IsNot Nothing Then
                    Dim tempRoot As XElement = New XElement("Root", New XElement(name))
                    tempRoot.Add(item)
                    Yield item
                End If

                Exit While
            End If
        End While
    End Using
End Function

Private Shared Sub Main()
    Dim srcTree As IEnumerable(Of XElement)

    srcTree = From el In StreamCustomerItem("https://www.example.com/mysource.xml")
              Select New XElement("product", New XElement("product", CStr(el.Parent.Element("Id"))))


    Dim xws As XmlWriterSettings = New XmlWriterSettings()
    xws.OmitXmlDeclaration = True
    xws.Indent = True

    Using xw As XmlWriter = XmlWriter.Create(HttpContext.Current.Server.MapPath("files\") + "Output.xml", xws)
        xw.WriteStartElement("Root")

        For Each el As XElement In srcTree
            el.WriteTo(xw)
        Next

        xw.WriteEndElement()
    End Using


End Sub

That just writes <Root /> to my output.xml though

mysource.xml

<?xml version="1.0" encoding="UTF-8" ?>
<products>
    <product>
        <Id>
            <![CDATA[122854]]>
        </Id>
        <Type>
            <![CDATA[restaurant]]>
        </Type>
        <features>
            <wifi>
                <![CDATA[included]]>
            </wifi>
        </features>         
    </product>
</products>

So to summarize my question: how can I read individual product nodes as-is from "mysource.xml" without loading the full file into memory?

UPDATE 1

Private Shared Iterator Function StreamCustomerItem(ByVal uri As String) As IEnumerable(Of XElement)
    Using reader As XmlReader = XmlReader.Create(uri)
        Dim name As XElement = Nothing
        Dim item As XElement = Nothing
        reader.MoveToContent()

        While Not reader.EOF
            If reader.NodeType = XmlNodeType.Element AndAlso reader.Name = "product" Then
                Dim el As XElement = TryCast(XElement.ReadFrom(reader), XElement)
                If el IsNot Nothing Then Yield el
            Else
                reader.Read()
            End If
        End While
    End Using
End Function            


Private Shared Sub Main()
    Dim element As IEnumerable(Of XmlElement) = From el In StreamCustomerItem("source.xml") Select el

    For Each str As XmlElement In grandChildData
    'here loop through `product` element
        Console.WriteLine(str)
    Next
End Sub 

My full test file via Onion Share (use TOR browser to download):

http://jkntfybog2s5cc754sn7mujvyaawdqxd4q5imss66x3hsos34rrbjrid.onion Key: YLTDQSDHTBWGDGQ6FIADTN2K7GFOFT5R7SFKWKTDER3WETD7EMKA


Solution

  • The important thing is to make sure you never load the whole file, but "stream" (in the general sense, stream bytes, characters, xml nodes, etc.) everything from end to end (ie: server to client here).

    For network bytes, it means you must use a raw Stream object.

    For Xml nodes, it means you can use an XmlReader (not an XmlDocument which loads a full document object model from a stream). In this case, you can use an XmlTextReader which "Represents a reader that provides fast, non-cached, forward-only access to XML data".

    Here is a C# piece of code (that can easily be translated to VB.NET) that does this, but can still build an intermediary small Xml document for each product in the big Gb file, using XmlReader methods ReadInnerXml and/or ReadOuterXml:

    var req = (HttpWebRequest)WebRequest.Create("https://www.yourserver.com/spotahome_1.xml");
    using (var resp = req.GetResponse())
    {
        using (var stream = resp.GetResponseStream())
        {
            using (var xml = new XmlTextReader(stream))
            {
                var count = 0;
                while (xml.Read())
                {
                    switch (xml.NodeType)
                    {
                        case XmlNodeType.Element:
                            if (xml.Name == "product")
                            {
                                // using XmlDocument is ok here since we know
                                // a product is not too big
                                // but we could continue with the reader too
                                var product = new XmlDocument();
                                product.LoadXml(xml.ReadOuterXml());
                                Console.WriteLine(count++);
                            }
                            break;
                    }
                }
            }
        }
    }
    

    PS: Ideally, you could use async / await code with Async counterparts methods ReadInnerXmlAsync / ReadOuterXmlAsync but this is another story and easy to setup.