I think this is going to be one of those simple-when-you-see-it problems, but it has got me baffled.
[STOP PRESS: I was right. Solution was found. See the answers.]
I am using Python's unittest framework to test a multi-threaded app. Nice and straight forward - I have 5 or so worker threads monitoring a common queue, and a single producer thread making work-items for them. The producer thread is being triggered by a test-case.
In this test, only one task is being put on the queue. The processing it does is in the test is just a stub for the real processing, so the worker thread does a 5 second-sleep to simulate the elapsed time before the task will really be done, and the thread will be ready to get another task.
To the snippet of code is:
logging.info("Sleep starting")
time.sleep(5)
logging.info("Waking up")
Now the weird part. I see the "Sleep starting" log message, but not the Waking up message. The program locks up and doesn't respond to Keyboard Interrupt (CTRL+C). CPU load is very low.
I see the same problem in Windows and Ubuntu (Python 2.6.2).
I have pondered if an exception is occurring and being hidden, so I add "print 1/0" between the first and second line - I see the Division By Zero error being raised. I move it to after the sleep, and I never see the message.
I figured "Okay, maybe the other thread is trying to log something very very large at the same time, and it is still buffering. What is it doing?"
Well, by this time, the test has returned to the unittest, where it is pausing waiting for the thread to get going before testing the system's state.
logging.info("Test sleep starting")
time.sleep(0.25)
logging.info("Test waking up")
Wow, that looks familiar. It is freezing in exactly the same way! The first log message is appearing, the second isn't.
I have recently done a significant rewrite of the unit so I can't claim "I didn't touch anything", but I can't see anything untoward in my changes.
Suspicious areas:
I am including using Threading.Lock (because I don't know how to reason about GIL's safety, so I stick to what I know. I see nothing "deadlocky" about my code.
I am new to Python's unittest framework. Is there something it does with redirecting logging or similar that might simulate these symptoms?
No, I haven't substituted a non-standard time module!
What would prevent a thread from waking up? What else have I missed?
Sigh.
Worker Thread #1 is sleeping, and waking up afterwards. It is then going to log the wake message, and is blocked. Only one thread can be logging at a time.
UnitTest Thread is sleeping, and waking up afterwards. It is then going to log the wake message, and is blocked. Only one thread can be logging at a time.
Worker-Thread-Not-Previously-Mentioned-In-The-Question #2 was quietly finishing the processing the PREVIOUS item in the queue, while the first Worker Thread was sleeping. It got to a log statement. One of the parameters was an object, and str() was implicitly called. The str() function on that object had a bug; it deadlocked when it accessed some of its data members. The deadlock occured while being processed by the logging function, thus keeping the logging thread-lock, and making it appear like the other threads never woke up.
The division by zero test didn't make a difference, because the result of it was an attempt to log.