I'm having trouble debugging random hangups in my polling code for exactly 11ms. This is happening across multiple servers, with different hardware configs and operating systems (and no shared resource for the code in question), so I'm inclined to believe its in my code.
Essentially I have a bunch of in-flight TCP requests that I have as futures in a list. My polling function is checking through them and sending results over a channel. It works fine most of the time, but occasionally will pause for exactly 11ms, and then continue.
Normally the Waker is scheduled by the underlying TCP connection, as well as the channel that pushes requests into the queue. But for debugging purposes I've added a busy-loop that always schedules the Waker immediately.
Using the busy loop, I'd expect a constant stream of poll() calls, and normally that is the case. However, when one of these hiccups happens, I see what looks like the thread get paused/preempted and then re-started 11ms later.
A few examples of places the pause has happened, that I wouldn't expect it to:
2023-09-26T23:29:54.354627Z WARN database::backends: checking pending: 1
2023-09-26T23:29:54.354637Z WARN database::backends: poll future... Pending
2023-09-26T23:29:54.365740Z WARN database::backends: done (not ready!) future
Which comes from the following code:
let fut_polled = fut.poll_unpin(cx);
warn!("poll future... {:?}", fut_polled);
if let Poll::Ready((x, y, z)) = fut_polled {
warn!("future ready");
... snipped...
}
warn!("done (not ready!) future");
Can see it successfully polled the underlying future, and when doing a simple if-let on the Poll result, gets suspended for 11ms. I would expect future ready
or done (not ready!) future
to be executed immediately.
Another example is:
2023-09-26T23:25:07.669470Z WARN database::backends: poll
2023-09-26T23:25:07.680708Z WARN database::backends: poll loop!
Which comes from:
fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
warn!("poll");
let pin = self.get_mut();
loop {
warn!("poll loop!");
Which again, I wouldn't expect any delay on, self.get_mut()
is just returning the underlying Pin pointer, which shouldn't take any time.
This happens often enough to be easily reproducible (every few seconds under light load), but rare enough that strace generates a LOT of difficult to sift through data. This is also in a much larger application, and I haven't had any luck making a small reproducible case.
The thread running the poll loop is started as such:
let _ = std::thread::Builder::new()
.name("database-thread".to_string())
.spawn(move || {
let rt = tokio::runtime::Builder::new_current_thread()
.enable_all()
.build()
.expect("failed to create tokio runtime");
rt.block_on(async move { handler.await });
})
.expect("failed to spawn database thread");
The server itself isn't heavily loaded (plenty of spare ram, low cpu usage, plenty of cores)...
How do I go about debugging what's pausing my thread, and why its always 11ms even across different servers?
Edit: Adding some strace data...
[pid 4056498] 20:34:14.995277 write(1, "\33[2m2023-09-27T00:34:14.995272Z\33"..., 1402023-09-27T00:34:14.995272Z WARN database::backends: poll future... Pending
<unfinished ...>
[pid 4056497] 20:34:14.995283 write(11, "{\"id\":[snipped]"..., 192 <unfinished ...>
[pid 4056498] 20:34:14.995289 <... write resumed>) = 140 <0.000007>
[pid 4056497] 20:34:14.995293 <... write resumed>) = 192 <0.000006>
[pid 4056498] 20:34:14.995298 write(1, "\33[2m2023-09-27T00:34:14.995294Z\33"..., 1422023-09-27T00:34:14.995294Z WARN database::backends: done (not ready!) future
<unfinished ...>
[pid 4056501] 20:34:14.995304 <... epoll_wait resumed>[{events=EPOLLOUT, data={u32=2752585088, u64=2699992046976}}], 1024, 3) = 1 <0.000574>
[pid 4056497] 20:34:14.995309 epoll_wait(20, <unfinished ...>
[pid 4056501] 20:34:14.995315 epoll_wait(3, <unfinished ...>
[pid 4056498] 20:34:15.006650 <... write resumed>) = 142 <0.000016>
[pid 4056501] 20:34:15.006668 <... epoll_wait resumed>[{events=EPOLLIN|EPOLLOUT, data={u32=2752585088, u64=2699992046976}}], 1024, 2) = 1 <0.011350>
[pid 4056501] 20:34:15.006689 write(22, "\1\0\0\0\0\0\0\0", 8 <unfinished ...>
[pid 4056498] 20:34:15.006696 write(1, "\33[2m2023-09-27T00:34:15.006681Z\33"..., 135 <unfinished ...>
2023-09-27T00:34:15.006681Z WARN database::backends: poll loop finished
Can see the same lines as above (done (not ready!) future
), the next line after it is just on the other side of a loop end, pseudocode:
loop {
log("poll future");
if (ready) {
...
} else {
log("done (not ready!) future");
}
}
log("poll loop finished");
So in between that else
log, and the end of the loop log, strace is doing two epoll_wait()
s, and pausing for 11ms.
Edit: When the epoll_wait() happens, its specifically waiting on anon_inode:[eventpoll]
for example:
[pid 4057391] 20:53:44.862936 epoll_wait(19<anon_inode:[eventpoll]>, <unfinished ...>
[pid 4057360] 20:53:44.862949 <... epoll_wait resumed>[{events=EPOLLOUT, data={u32=1511071104, u64=2660095827328}}], 1024, 2) = 1 <0.011111>
Whats the next step in debugging this?
Can see it successfully polled the underlying future, and when doing a simple if-let on the Poll result, gets suspended for 11ms. I would expect future ready or done (not ready!) future to be executed immediately.
The if let
is extremely unlikely to be causing the delay here, especially because your warn!
statement already contains the result returned by poll_unpin
. Therefore, the delay likely occurs during the logging statement which in all likelihood takes way longer than the other statement there.
If you want to test how your program behaves during normal execution it might be better to store some statistics. For example you could track the time between each iteration and count the number of times the iteration time exceeds 10ms and then output the count once in a while. This technique can be used to ensure that the problems you are currently observing aren't being caused by the logging statements themselves. And that does not seem like a stretch: logging at every iteration in a busy loop is asking for trouble. When is this stuff going to be committed to the logs? Which thread is going to do that? Does this thread need to acquire a lock before it can commit this large an amount of data?
Obviously could be, but 11ms seems like a long time to be off-core. These are (reasonably) beefy boxes, 12-16 core, and this happens plenty often when they're under low load. I'd expect a core to be available at almost all times...
Generally speaking 11ms is not that long if there are other threads that want to run on the same core. See this question. Especially if you care about latency you might want to see how you can tweak the scheduler settings to see how you can avoid off-core time as much as possible.
Your server might seem idle but perhaps there could be outside events that trigger lots of things to happen at the same time, making it more occupied for a brief moment. For example, there could be multiple requests coming in or these requests could spawn multiple different threads. Besides, there is the fact that moving to a different core carries the cost of a lot of cache misses so is often delayed by the scheduler.
In short, here are some good next steps: