jvmwebspherezosibm-wasibm-jvm

WebSphere Application Hang


If a WebSphere Application is hung on z/OS, what steps should be taken to find the cause?

So far, I took a Heap Dump, Java Core, and System Dump.

None of the threads are deadlocked, no memory issues, and there does not seem to be an abundance of threads. (Only ~50, which is fairly normal.)

The entire application is not accessible. By that I mean, any attempts to connect to it's web pages hang and timeout.

What can by causing this? I am considering a high CPU event, but not sure how to retroactively check that.

I get an similar error message to this 30 times.

BBOO0221W: WSVR0605W: Thread "WebSphere WLM Dispatch Thread t=008b74f8" (00000075) has been active for 720962 milliseconds and may be hung.  There is/are 30 thread(s) in total in the server that may be hung.
at sun.reflect.GeneratedMethodAccessor617.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
    at java.lang.reflect.Method.invoke(Method.java:611)
    at com.sun.faces.el.MethodBindingImpl.invoke(MethodBindingImpl.java:126)
    at com.sun.faces.application.ActionListenerImpl.processAction(ActionListenerImpl.java:72)
    at javax.faces.component.UICommand.broadcast(UICommand.java:312)
    at javax.faces.component.UIViewRoot.broadcastEvents(UIViewRoot.java:267)
    at javax.faces.component.UIViewRoot.processApplication(UIViewRoot.java:381)
    at com.sun.faces.lifecycle.InvokeApplicationPhase.execute(InvokeApplicationPhase.java:75)
    at com.sun.faces.lifecycle.LifecycleImpl.phase(LifecycleImpl.java:200)
    at com.sun.faces.lifecycle.LifecycleImpl.execute(LifecycleImpl.java:90)
    at javax.faces.webapp.FacesServlet.service(FacesServlet.java:197)
    at com.ibm.ws.webcontainer.servlet.ServletWrapper.service(ServletWrapper.java:1230)
    at com.ibm.ws.webcontainer.servlet.ServletWrapper.handleRequest(ServletWrapper.java:779)
    at com.ibm.ws.webcontainer.servlet.ServletWrapper.handleRequest(ServletWrapper.java:478)
    at com.ibm.ws.webcontainer.servlet.ServletWrapperImpl.handleRequest(ServletWrapperImpl.java:178)
    at com.ibm.ws.webcontainer.filter.WebAppFilterChain.invokeTarget(WebAppFilterChain.java:136)
    at com.ibm.ws.webcontainer.filter.WebAppFilterChain.doFilter(WebAppFilterChain.java:97)
    at org.apache.myfaces.webapp.filter.ExtensionsFilter.doFilter(ExtensionsFilter.java:97)
    at com.ibm.ws.webcontainer.filter.FilterInstanceWrapper.doFilter(FilterInstanceWrapper.java:195)
    at com.ibm.ws.webcontainer.filter.WebAppFilterChain.doFilter(WebAppFilterChain.java:91)
(truncated)

The "hung" thread themselves don't seem to have any real pattern, they are just normal activity, that should not hang.


Solution

  • One of the best features of z/OS is the diagnostic capabilities - you never have to guess...it's pretty much always possible to find out EXACTLY what's going on.

    Personally, I'd start with your system dump and IPCS. Of course, this is a pretty rare skill these days, so if this isn't your thing, first step might be to find someone with good dump reading skills. If you're totally stuck, there's a good introduction here.

    Start by making sure that your dump has what you think it has...a good percentage of system dumps end up including the wrong address spaces or the wrong data areas, and these are pretty much useless. If you're in this situation, it's time to design exactly the dump options you want so you capture what you need next time the problem occurs.

    You'll get a good overview about what's going on inside WebSphere by using the WebSphere IPCS dump formatter - an overview is here.

    In the WebSphere address space(s), there will be many threads, and these will have z/OS TCB's (task control blocks). Go through every last TCB (in IPCS, the SUMM FORMAT command or equivalent) and understand whether it's running (potentially looping) or waiting. My bet would be that the threads are waiting on something or other...a lock, an outside signal, a call to DB2, some vendor software, etc - a good goal is to make a list of all the threads and exactly what each one is waiting for.

    Largely, finding a wait reason is about going through the TCB/RB structures to find the PSW and registers at the time of the wait...this tells you the module that's waiting, and most likely you can figure out what's going on from here.

    If the system hadn't been hanging a long time before you took your dump, you might also check the system trace table. It will give you the history of what the address space has been doing, although if it's been a long time, there may not be much data there.

    Also, since WebSphere is a giant UNIX Services application, don't forget to look at the OMVSDATA if you have this in your dump.

    Don't forget that you can always reach out for IBM support - you spend a ton of money on software like WebSphere, so reaching out to have them explain what's going on is certainly one of the better ideas.

    Good luck!