ssasmdxadomd.net

MS Olap Mdx query Execution is hanging intermittently


Recently one issue is really challenging me that I already wasted almost 15 days to figure out root cause. But unfortunately, so far there is no luck.

Here are the details,

As part of the feature we provide in our application to our customers, customers can queue up list of jobs and let it run in background and let customer know once the job they queued finished.

Each job supposed to execute ~100K mdx queries to complete with success result. But behind the scene, our engine divide that 100K queries into smaller chunks and create jobs for each chunk with less amount of queries. In this case small jobs are dealing with 1000 queries. With this rough numbers, I can tell that engine is creating 100 additional jobs. Our engine then starts executing those small chunks one by one.

And in each job execution it runs RunAndParseQueryResult method in the following code.

class Snippet
{
    public static void RunAndParseQueryResult()
    {
        DataTable result = new DataTable();
        IDbConnection conn = ConnectionFactory.CreateConnection();
        conn.Open();
        foreach (string mdxQuery in queryList)
        {
              ExecuteMdxQuery(IDbConnection connection, DataTable result, string mdxQuery)
        }
        conn.Close();
        conn.Dispose();
    }

    private void ExecuteMdxQuery(IDbConnection connection, DataTable result, string mdxQuery)
    {
        var conn = connection as AdomdConnection;
        Trace.TraceInformation("Log 1");
        if (conn != null)
        {
            Trace.TraceInformation("Log 2");
            using (AdomdCommand command = new AdomdCommand(mdxFile.Mdx, conn) { CommandTimeout = 5000 })
            {
                Trace.TraceInformation("Log 3");
                using (AdomdDataAdapter adapter = new AdomdDataAdapter(command))
                {
                    Trace.TraceInformation("Log 4");
                    try
                    {
                        DataTable dt = new DataTable();
                        Trace.TraceInformation("Log 5");
                        dt.BeginLoadData();
                        Trace.TraceInformation("Log 6");
                        adapter.Fill(dt);
                        Trace.TraceInformation("Log 7");
                        dt.EndLoadData();
                        Trace.TraceInformation("Log 8");

                        if (dt.Rows.Any())
                        {
                            Trace.TraceInformation("Log 9");
                            ParseQueryResult(result, mdxFile, dt);
                            Trace.TraceInformation("Log 10");
                        }
                    }
                    catch (Exception exception)
                    {
                        Trace.TraceInformation("Log 11");
                        throw;
                    }
                }
            }
        }
        else
        {
            throw new Exception("Given cube connection can not be casted to AdoMdConnection");
        }
    }
}

As you can see, RunAndParseQueryResult method opens connection and pass it to ExecuteMdxQuery method along with mdxQuery loop variable.

In the ExecuteMdxQuery method, almost after every line , I have put a log using Trace.TraceInformation method.

What happens is at certain iteration ExecuteMdxQuery method stops at

 adapter.Fill(dt);

method. I am figuring out this by looking at the logs. Because if it was executed successfully the I would have seen log like "Log 7" or if it failed executing it I should be able to see "Log 11". But none of those lines seems to be run.

When I run the query manually it is working fine. The query is definitely not long running query, and even it was, we have specified the timeout 5000 second in

AdomdCommand command = new AdomdCommand(mdxFile.Mdx, conn) { CommandTimeout = 5000 }

code and it suppose to throw an TimeOutException normally. But it is not.

Any opinion why this could be ?

Thank you in advance.


Solution

  • After a lot of effort and being patience for long time, we were able to understand the root cause of this issue with help of Microsoft support.

    In the end, issue turned out to be related to CommandTimeout property behavior in AdoMD.Net.

    CommandTimeout setting does not work as expected in case of network failure.

    To identify the issue, we had to create the dump of client application at the time application hangs and no more executing query.

    With WinDbg tool, we found out that, at the time when hang is happening application is waiting on some methods which is under System.Net namespace. Following is the stack trace we got out of dump file.

    a15e18d548 7ffa07f306fa [InlinedCallFrame: 000000a15e18d548] System.Net.UnsafeNclNativeMethods+OSSOCK.recv(IntPtr, Byte*, Int32, System.Net.Sockets.SocketFlags)
    a15e18d548 7ff99b7757af [InlinedCallFrame: 000000a15e18d548] System.Net.UnsafeNclNativeMethods+OSSOCK.recv(IntPtr, Byte*, Int32, System.Net.Sockets.SocketFlags)
    a15e18d520 7ff99b7757af DomainBoundILStubClass.IL_STUB_PInvoke(IntPtr, Byte*, Int32, System.Net.Sockets.SocketFlags)
    a15e18d5f0 7ff99b7761be System.Net.Sockets.Socket.Receive(Byte[], Int32, Int32, System.Net.Sockets.SocketFlags, System.Net.Sockets.SocketError ByRef)
    a15e18d690 7ff99b775d35 System.Net.Sockets.NetworkStream.Read(Byte[], Int32, Int32)
    a15e18d710 7ff99b775be9 System.Net.FixedSizeReader.ReadPacket(Byte[], Int32, Int32)
    a15e18d760 7ff99b782839 System.Net.Security._SslStream.StartFrameHeader(Byte[], Int32, Int32, System.Net.AsyncProtocolRequest)
    a15e18d7d0 7ff99b78237a System.Net.Security._SslStream.StartReading(Byte[], Int32, Int32, System.Net.AsyncProtocolRequest)
    a15e18d850 7ff99b781fab System.Net.Security._SslStream.ProcessRead(Byte[], Int32, Int32, System.Net.AsyncProtocolRequest)
    a15e18d8d0 7ff99b781d6e System.Net.TlsStream.Read(Byte[], Int32, Int32)
    a15e18d960 7ff99b781c93 System.Net.PooledStream.Read(Byte[], Int32, Int32)
    a15e18d990 7ff99b781499 System.Net.Connection.SyncRead(System.Net.HttpWebRequest, Boolean, Boolean)
    a15e18da20 7ff99b76d694 System.Net.ConnectStream.WriteHeaders(Boolean)
    a15e18dad0 7ff99b76bb44 System.Net.HttpWebRequest.EndSubmitRequest()
    a15e18db20 7ff99b762e58 System.Net.Connection.SubmitRequest(System.Net.HttpWebRequest, Boolean)
    a15e18dbb0 7ff99b76120e System.Net.ServicePoint.SubmitRequest(System.Net.HttpWebRequest, System.String)
    a15e18dc20 7ff99b60eda4 System.Net.HttpWebRequest.SubmitRequest(System.Net.ServicePoint)
    a15e18dc80 7ff99b60ded1 System.Net.HttpWebRequest.GetResponse()
    a15e18dd30 7ff99d474f7c Microsoft.AnalysisServices.AdomdClient.HttpStream.GetResponseStream()
    a15e18de20 7ff99d474ac0 Microsoft.AnalysisServices.AdomdClient.HttpStream.GetResponseDataType()
    a15e18de90 7ff99d4743cc Microsoft.AnalysisServices.AdomdClient.CompressedStream.GetResponseDataType()
    a15e18def0 7ff99d470745 Microsoft.AnalysisServices.AdomdClient.XmlaClient.EndRequest()
    a15e18df70 7ff99d4702f7 Microsoft.AnalysisServices.AdomdClient.XmlaClient.SendMessage(Boolean, Boolean, Boolean)
    a15e18dfe0 7ff99d558f0a Microsoft.AnalysisServices.AdomdClient.XmlaClient.ExecuteStatement(System.String, System.Collections.IDictionary, System.Collections.IDictionary, System.Data.IDataParameterCollection, Boolean)
    a15e18e040 7ff99d558037 Microsoft.AnalysisServices.AdomdClient.AdomdConnection+XmlaClientProvider.Microsoft.AnalysisServices.AdomdClient.IExecuteProvider.ExecuteTabular(System.Data.CommandBehavior, Microsoft.AnalysisServices.AdomdClient.ICommandContentProvider, Microsoft.AnalysisServices.AdomdClient.AdomdPropertyCollection, System.Data.IDataParameterCollection)
    a15e18e0e0 7ff99d557b28 Microsoft.AnalysisServices.AdomdClient.AdomdCommand.ExecuteReader(System.Data.CommandBehavior)
    

    As you can see from the stack traces hang is happening in Receive methods in

    System.Net.Sockets.Socket
    

    namespace.

    After this we decided to capture network traces in both Analysis Service and client application side. When we capture traces on both end, interestingly, hang issue is no more happening. Then we only captured the network traces on SSAS server.

    NetMon Traces ScreenShot

    As you can see from above screenshot, SSAS tried re-transmitting the response 9 times and finally timed-out and disconnected the TCP communication.

    This lead us to think that client loosing the network connection for sometimes. While client is trying to re-establish the network connection, result has been prepared already on SSAS side and SSAS is trying to transfer results to client. As seen from network traces, it tried 9 times in my example. In the end it gave up more trial. What is happening is that, During that 9 times trial, if client establishes the network connections back, then everything is moving forward as expected, but if client establishes the connection after retry ends, then client continue to wait forever although server has already attempted to respond to client. At this point, as a client, we would expect it to at least respect to CommandTimeout property. And fail the execution when CommandTimeout is reached. But it does not do that. This issue has been submitted to product team already.

    As a workaround we realized that, there is another connection string property

    Timeout : Specifies how long (in seconds) the client library waits for a command to complete before generating an error. link

    When we set the Timeout in case of network failure, connection timeout is happening and hence query execution is failing. Client is never hanging. Although, connection loss is not something we would like to have in case of command timeout, but this is what we have until CommandTimout property behavior is fixed by Product team.

    Thank you.