I want to add Haskell to my toolbox so I'm working my way through Real World Haskell.
In the chapter in Input and Output, in the section on hGetContents
, I came across this example:
import System.IO
import Data.Char(toUpper)
main :: IO ()
main = do
inh <- openFile "input.txt" ReadMode
outh <- openFile "output.txt" WriteMode
inpStr <- hGetContents inh
let result = processData inpStr
hPutStr outh result
hClose inh
hClose outh
processData :: String -> String
processData = map toUpper
Following this code sample, the authors go on to say:
Notice that
hGetContents
handled all of the reading for us. Also, take a look atprocessData
. It's a pure function since it has no side effects and always returns the same result each time it is called. It has no need to know—and no way to tell—that its input is being read lazily from a file in this case. It can work perfectly well with a 20-character literal or a 500GB data dump on disk. (N.B. Emphasis is mine)
My question is: how does hGetContents
or its resultant values achieve this memory efficiency without – in this example – processData
"being able to tell", and still maintain all benefits that accrue to pure code (i.e. processData
), specifically memoization?
<- hGetContents inh
returns a string so inpStr
is bound to a value of type String
, which is exactly the type that processData
accepts. But if I understand the authors of Real World Haskell correctly, then this string isn't quite like other strings, in that it's not fully loaded into memory (or fully evaluated, if such a things as not-fully-evaluated strings exists...) by the time of the call to processData
.
Therefore, another way to ask my question is: if inpStr
is not fully evaluated or loaded into memory at the time of the call to processData
, then how can it be used to lookup if a memoized call to processData
exists, without first fully evaluating inpStr
?
Are there instances of type String
that each behave differently but cannot be told apart at this level of abstraction?
The String
returned by hGetContents
is no different from any other Haskell string. In general, Haskell data is not fully evaluated unless the programmer has taken extra steps to ensure that it is (e.g. seq
, deepseq
, bang patterns).
Strings are defined as (essentially)
data List a = Nil | Cons a (List a) -- Nil === [], Cons === :
type String = List Char
This means that a string is either empty, or a single character (the head) and another string (the tail). Due to laziness, the tail may not exist in memory, and may even be infinite. Upon processing a String
, a Haskell program will typically check if it's Nil
or Cons
, then branch and proceed as necessary. If the function doesn't need to evaluate the tail, it won't. This function, for example, only needs to check the initial constructor:
safeHead :: String -> Maybe Char
safeHead [] = Nothing
safeHead (x:_) = Just x
This is a perfectly legitimate string
allA's = repeat 'a' :: String
that is infinite. You can manipulate this string sensibly, however if you try to print all of it, or calculate the length, or any sort of unbounded traversal your program won't terminate. But you can use functions like safeHead
without any problem whatsoever, and even consume some finite initial substring.
Your intuition that something strange is happening is correct, however. hGetContents
is implemented using the special function unsafeInterleaveIO, which is essentially a compiler hook into IO
behavior. The less said about this, the better.
You're correct that it would be difficult to check if a memoized call to a function exists without having the argument fully evaluated. However, most compilers don't perform this optimization. The problem is that it's very difficult for a compiler to determine when it's worthwhile to memoize calls, and very easy to consume all of your memory by doing so. Fortunately there are several memoizing libraries you can use to add memoization when appropriate.