c++linuxlinux-kernelsystemtapapache-traffic-server

why does copy_to_user spend several hundreds of milliseconds?


I have some linux servers running apache trafficserver and have found about 1/20's requests take several milliseconds which is much longer than expected(should be 1-2ms).

After tracing by systemtap, I'v found the time is spent in linux kernel copy_to_user() call. (accept->inet_csk_accept->move_addr_to_user->copy_to_user)

The server load is quit low (<100qps) and copy_to_user() just copy 16 bytes' data(struct sockaddr) but spend hundreds of millisecond.

As I'm quit new to systemtap and kernel trace technique, I can't investigate further reason. I've checked cpu usage, swap usage

Thanks for any advice.

Hardware:

Software:


Update:

@employee of the month thanks for you advise, after add some probe on __do_page_fault, I found the time was spent on __do_page_fault -> down_read(&mm->mmap_sem);

    [1495677639202]copy_to_user done:24145 [ACCEPT 0:8080] 1
    [1495677639202] 16 16 move_addr_to_user done:24145 [ACCEPT 0:8080] 1
    [1495677639314]copy_to_user done:24145 [ACCEPT 0:8080] 0
    [1495677639314] 16 16 move_addr_to_user done:24145 [ACCEPT 0:8080] 0
    [1495677641329]do page fault done:24145 [ACCEPT 0:8080] 622
    [1495677641329]copy_to_user done:24145 [ACCEPT 0:8080] 622
    [1495677641329] 16 16 move_addr_to_user done:24145 [ACCEPT 0:8080] 622

@Ricardo Biehl Pasquali thanks for your advise. Apache trafficserver read/write to hdd in seperate threads. trafficserver have 1 thread to accept connections, 88(8 threads per hdd)threads to do blocking read/write(cache content), but i don't understand why blocking read/write in other threads could lead to high latency of __do_page_fault().

    Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    sda               0.00    16.00    0.00    4.00     0.00   160.00    40.00     0.05   13.50    0.00   13.50   7.75   3.10
    sdi               0.00     0.00    0.00    0.00     0.00     0.00      0.00     0.00    0.00    0.00    0.00   0.00   0.00
    sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
    sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
    sdj               0.00     0.00    6.00    0.00  4326.00     0.00   721.00     0.06   10.50   10.50    0.00   6.83   4.10
    sdc               0.00     0.00    2.00    0.00  1472.00     0.00   736.00     0.04   18.50   18.50    0.00   9.50   1.90
    sdh               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
    sdg               0.00     0.00   14.00    0.00 10464.00     0.00   747.43     0.04    2.71    2.71    0.00   1.79   2.50

Update:

Problem solved! The root cause is trafficserver periodly call system() to backup file in an ancient plugin we wrote.


Solution

  • Well, the reasonable explanation is that you are taking a page fault. One possible reason would be that the server forks/exits depending on how the load changes and the new child has to fault the page, but as a minor fault it should not take long unless there is memory shortage.

    Another one is there is memory shortage and you are in fact swapping.

    Either way, given you have come this far I don't see what's the problem going further. You want to put the probe on __do_page_fault and continue from there.