linuxperformanceubuntudebuggingxorg

How to better diagnose which client is causing high Xorg CPU usage?


I'm having problems where Xorg server suddenly gets very slow and its CPU usage goes to 100%. Googling about this suggest that the "state of art" debugging method for this issue is to start randomly killing X clients until the problem goes away and then you know which client was the problematic one (good luck trying to debug the process after killing it). None of the X clients are eating lots of CPU (around max 1% of CPU each). And the system still has 2GB of available RAM.

Does anybody know a method to better diagnose the problem? Basically I'm looking for top replacement for Xorg which would directly point me to the client process that is causing most CPU usage for Xorg. I already know about xrestop which is the same thing for the Xorg memory usage but this question is specifically about CPU usage.

Another cause for this slowdown might be running out of GPU memory which might cause pushing bitmaps over PCI-express bus instead of rendering directly from GPU memory to display. If this shows as CPU usage on the kernel level (e.g. top), it might be the problem I'm seeing but I would like to understand the cause better. As I'm writing this from the problematic system it seems that most slowdown occurs in input handling or font rendering but I dont know how to diagnose the thing any better.

I know it's possible to use another computer and connect with ssh, attach a gdb process to the problematic Xorg server and start digging but I'm hoping for something a bit easier.

If I got PID or even window handle for the problematic client, figuring out the root cause would be easier (e.g. https://unix.stackexchange.com/q/5478). And I wouldn't need to kill well-behaving processes as a side-effect.


Solution

  • Here's a script to figure out if one of the clients is causing the problem (did not help with my problem, though):

    #!/bin/bash
    
    WINDOW_IDS=$(xwininfo -tree -root | grep -o -P '\b0x[0-9a-f]+' | sort -u)
    
    PIDS=""
    for ID in $WINDOW_IDS
    do
        if [ "$ID" = "0x0" ]
        then
            continue
        fi
        #printf "Window %s PID=" "$ID" 
        PID=$(LC_ALL=C xprop -id "$ID" _NET_WM_PID | cut -d' ' -f3-)
        if [ "$PID" = "not found." ]
        then
        #   printf "%s\n" "(unknown)"
        #   See also: https://unix.stackexchange.com/a/84981
            true
        else
        #   printf "%s\n" "$PID"
            PIDS="$PIDS $PID"
        fi
    done
    
    PIDS=$(printf "%s\n" $PIDS | sort -u)
    
    # go through the list of processes connected to Xorg:
    
    for PID in $PIDS
    do
        printf "%s: %s\n" "$PID" "$(cat /proc/$PID/cmdline)"
        sleep 0.1s # wait for the previous line to get on the screen before stopping e.g. compositing manager 
        # Stop the process for 5 secs and let the process continue after that.
        kill -STOP "$PID" && sleep 5s && kill -CONT "$PID"
    done
    

    The idea is to stop each client in turn for 5 secs and if that cures the problem for 5 secs, you've found the problem. This script sends SIGSTOP to target process which cannot be ignored and prevents target process from getting CPU time so it cannot send any events to Xorg, either. Note that if you kill this script in the middle, you may end up with one of your processes in STOPPED state. Send SIGCONT to fix the problem. If you wait for the script to finish, everything should be okay. (See also: https://unix.stackexchange.com/a/298650)

    For my case, Xorg kept going slow no matter which client was stopped so I guess the problem I'm seeing is internal Xorg issue and I would need to use FlameGraphs (http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html) to figure out the real cause of the problem.