How come `readIORef` is a blocking operation

This was a complete surprise for me. Can someone explain what is the reason behind readIORef blocking, when there is an atomicModifyIORef in flight? I understand that the assumption is that the modifying function supplied to the latter function is suppose to be very quick, but that is besides the point.

Here is a sample piece of code that reproduces what I am talking about:

{-# LANGUAGE NumericUnderscores #-}
module Main where

import Control.Concurrent
import Control.Concurrent.Async
import Control.Monad
import Data.IORef
import Say (sayString)
import Data.Time.Clock
import System.IO.Unsafe

main :: IO ()
main = do
  ref <- newIORef (10 :: Int)
  before <- getCurrentTime
  race_ (threadBusy ref 10_000_000) (threadBlock ref)
  after <- getCurrentTime
  sayString $ "Elapsed: " ++ show (diffUTCTime after before)


threadBlock :: IORef Int -> IO ()
threadBlock ref = do
  sayString "Below threads are totally blocked on a busy IORef"
  race_ (forever $ sayString "readIORef: Wating ..." >> threadDelay 500_000) $ do
    -- need to give a bit of time to ensure ref is set to busy by another thread
    threadDelay 100_000
    x <- readIORef ref
    sayString $ "Unblocked with value: " ++ show x


threadBusy :: IORef Int -> Int -> IO ()
threadBusy ref n = do
  sayString $ "Setting IORef to busy for " ++ show n ++ " μs"
  y <- atomicModifyIORef' ref (\x -> unsafePerformIO (threadDelay n) `seq` (x * 10000, x))
  -- threadDelay is not required above, a simple busy loop that takes a while works just as well
  sayString $ "Finished blocking the IORef, returned with value: " ++ show y

Running this piece of code produces:

$ stack exec --package time --package async --package say --force-dirty --resolver nightly -- ghc -O2 -threaded atomic-ref.hs && ./atomic-ref
Setting IORef to busy for 10000000 μs
Below threads are totally blocked on a busy IORef
readIORef: Wating ...
Unblocked with value: 100000
readIORef: Wating ...
Finished blocking the IORef, returned with value: 10
Elapsed: 10.003357215s

Note that readIORef: Wating ... is printed only twice, once before blocking and one more time afterwards. This is very unexpected, since it is an action that runs in a totally separate thread. This means that blocking on IORef affects other threads than the one that invoked readIORef, which is even more surprising.

Are those semantics expected, or is it a bug? I fit is not a bug, why is this expected? I'll open a ghc bug later, unless someone has an explanation for this behavior that I can't think of. I won't be surprised that this is some limitation of ghc runtime, in which case I will provide an answer here later. Regardless of the outcome it is very useful to know about this behavior.

Edit 1

The busy loop I tried that does not require unsafePerformIO was requested in comments, so here it is

threadBusy :: IORef Int -> Int -> IO ()
threadBusy ref n = do
  sayString $ "Setting IORef to busy for " ++ show n ++ " μs"
  y <- atomicModifyIORef ref (\x -> busyLoop 10000000000 `seq` (x * 10000, x))
  sayString $ "Finished blocking the IORef, returned with value: " ++ show y

busyLoop :: Int -> Int
busyLoop n = go 1 0
  where
    go acc i
      | i < n = go (i `xor` acc) (i + 1)
      | otherwise = acc

The outcome is exactly the same, except the runtime is slightly different.

Setting IORef to busy for 10000000 μs
Below threads are totally blocked on a busy IORef
readIORef: Wating ...
Unblocked with value: 100000
readIORef: Wating ...
Finished blocking the IORef, returned with value: 10
Elapsed: 8.545412986s

Edit 2

It turns out that sayString was the reason for no output not appearing. Here is what the out is when sayString is swapped for putStrLn:

Below threads are totally blocked on a busy IORef
Setting IORef to busy for 10000000 μs
readIORef: Wating ...
readIORef: Wating ...
readIORef: Wating ...
readIORef: Wating ...
readIORef: Wating ...
readIORef: Wating ...
readIORef: Wating ...
readIORef: Wating ...
readIORef: Wating ...
readIORef: Wating ...
readIORef: Wating ...
readIORef: Wating ...
readIORef: Wating ...
readIORef: Wating ...
readIORef: Wating ...
readIORef: Wating ...
readIORef: Wating ...
readIORef: Wating ...
readIORef: Wating ...
readIORef: Wating ...
Finished blocking the IORef, returned with value: 10
Unblocked with value: 100000
Elapsed: 10.002272691s

That still does not answer the question, why readIORef block. In fact I just stumbled upon a quote from a book "Haskell High Performance" By Samuli Thomasson that tells us that blocking should not happen:

Solution

I think I understand what happens now. TLDR, readIORef is not a blocking operation! Big thanks to everyone who commented on the question.

The way I break down the logic mentally is (same as in question, but with added Thread names):


threadBlock :: IORef Int -> IO ()
threadBlock ref = do
  race_ ({- Thread C -} forever $ sayString "readIORef: Wating ..." >> threadDelay 500_000) $ do
    {- Thread B -}
    threadDelay 100_000
    x <- readIORef ref
    sayString $ "Unblocked with value: " ++ show x

threadBusy :: IORef Int -> Int -> IO ()
threadBusy ref n = do {- Thread A -}
  sayString $ "Setting IORef to busy for " ++ show n ++ " μs"
  y <- atomicModifyIORef' ref (\x -> unsafePerformIO (threadDelay n) `seq` (x * 10000, x))
  sayString $ "Finished blocking the IORef, returned with value: " ++ show y

Thread A updates the content of a ref with a thunk that will be filled when this computation is done unsafePerformIO (threadDelay n) `seq` (x * 10000, x). The important part is that because atomicModifyIORef' is most likely implemented with CAS (compare-and-swap) and the swap was successful, since expected value matched and the new value was updated with the thunk that has not been evaluated yet. Because atomicModifyIORef' is a strict it has to wait until the value is computed, which will take 10 sec before returning. So thread A blocks.
Thread B reads the thunk from ref with readIORef WITHOUT blocking. Now once an attempt is made to print the new content of a thunk x it has to stop and wait until it is filled with a value, which still is in a process of being computed. Because of that it has to wait thus it looks like it is blocked.
Thread C was suppose to be printing a message every 0.5 sec with sayString, but it fails to do so and therefore behaved as it was blocked as well. From a quick look say package and GHC.IO.Handle it looks like a Handle for stdout gets blocked by thread B, because printing in say package suppose to happen without interleaving and for that reason thread C could not do any printing either, thus it looked like it was blocked as well. That is why switching to putStrLn unblocked Thread C and allowed it to print a message every 0.5 sec.

This definitely convinces me, but if anyone has a better explanation I'll be happy to accept another answer.