azure-data-lakeu-sqlextractor

OutOfMemory on custom extractor


I have stitched a lot of small XML files into one file, and then made a custom extractor to return rows with one byte array that corresponds to each file.

  1. Run on remote/master
    • Run it for one file (gzipped, 11Mb), it works fine.
    • Run it for more than one file, I get a System.OutOfMemoryException.
  2. Run on local/master
    • Run it for one or more files (gzipped 500+ Mbs), works fine.

Extractor looks like this:

public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
    {

        using (var stream = new StreamReader(input.BaseStream))
        {
            var xml = stream.ReadToEnd();

            // Clean stiched XML
            xml = UtilsXml.CleanXml(xml);

            // Get nodes - one for each stiched file
            var d = new XmlDocument();
            d.LoadXml(xml);
            var root = d.FirstChild;

            for (int i = 0; i < root.ChildNodes.Count; i++)
            {
                output.Set<object>(1, Encoding.ASCII.GetBytes(root.ChildNodes[i].OuterXml.ToString()));
                yield return output.AsReadOnly();
            }

            yield break;
        }
    }

and error message looks like this:

==== Caught exception System.OutOfMemoryException

at System.Xml.XmlDocument.CreateTextNode(String text)
at System.Xml.XmlLoader.LoadAttributeNode()
at System.Xml.XmlLoader.LoadNode(Boolean skipOverWhitespace)
at System.Xml.XmlLoader.LoadDocSequence(XmlDocument parentDoc)
at System.Xml.XmlDocument.Load(XmlReader reader)
at System.Xml.XmlDocument.LoadXml(String xml)
at Microsoft.Analytics.Tools.Formats.Text.XmlByteArrayRowExtractor.<Extract>d__0.MoveNext()
at ScopeEngine.SqlIpExtractor<ScopeEngine::GZipInput,Extract_0_Data0>.GetNextRow(SqlIpExtractor<ScopeEngine::GZipInput\,Extract_0_Data0>* , Extract_0_Data0* output) in d:\data\ccs\jobs\bc367467-ef86-43d2-a937-46ba2d4cc524_v0\sqlmanaged.h:line 1924

So what am I doing wrong? And how do I debug this on remote?

Thanks!


Solution

  • Unfortunately local run does not enforce memory allocations, so you would have to check memory in local vertex debug yourself.

    Looking at your code above, I see that you are loading XML documents into a DOM. Please note that an XML DOM can explode the data size from the string representation up to a factor of 10 or more (I have seen 2 to 12 in my times as the resident SQL XML guru).

    Each UDO today only gets 1/2 GB of RAM to play with. So what I assume is that your XML DOM document(s) start going beyond that.

    The recommendation normally is that you use the XMLReader interface (there is a reader extractor in the samples on http://usql.io as well) and scan through the document(s) to find the information you are looking for.

    If your documents are always small enough (e.g., <20MB), you may want to make sure that you release the memory of the other documents and operate one document at a time.

    We do have plans to allow you to annotate your UDO with memory needs, but that is still a bit out.