haskelldata-structuresheap-memory

Do I need to understand how Haskell represents data to be able to write good Haskell programs?


I'm learning Haskell from a Java background. When I program Java, I feel like I have a strong understanding of how objects are laid out in memory and the consequences of this. For example I know exactly how java.lang.String and java.util.LinkedList work and therefore I know how I should use them. With Haskell I'm a bit lost. For example, how does (:) work? Should I care? Is it specified somewhere?


Solution

  • The short answer is no. When programming in Haskell you should think of your data structures as pure mathematical objects and not worry about how they're represented in memory. The reason for this is that, in the absence of side-effects, there's really nothing to data except the functions that create it and the functions you can use to extract the simpler parts out of which it was constructed.

    To see information about data constructors like (:), or any other terms, use the :type (or just :t for short) command inside GHCi:

    :Prelude> :type (:)
    (:) :: a -> [a] -> [a]
    

    That tells you that the (:) constructor (pronounced "cons"), takes a value of any type, and a list of the same type, and returns a list of the same type. You can also get a bit more info by using the :info command. This will show you what the data definition looks like:

    Prelude> :info (:)
    data [] a = ... | a : [a]   -- Defined in GHC.Types
    infixr 5 :
    

    This tells you that (:) is the constructor which prepends an element to an existing list.

    I also highly recommend Hoogle not only for looking up things by name but for doing the reverse kind of search; where you know the signature of the function you're looking for and want to find if someone's already written it for you. Hoogle is nice because it gives descriptions and example usages.

    Shapes of Inductive Data

    I said above that it's not important to know the representation of your data in memory... you should, however, understand the shape of the data you're dealing with, to avoid poor performance decisions. All data in Haskell is inductively defined, meaning it has a tree-like shape which unfolds ever outwards recursively. You can tell the shape of data by looking at its definition; there's really nothing hidden about its performance characteristics once you know how to read this:

    data MyList a = Nil | Cons a (MyList a)
    

    As you can see from the definition, the only way you can get a new MyList is by the Cons constructor. If you use this constructor multiple times, you'll end up with something roughly of this shape:

    (Cons a5 (Cons a4 (Cons a3 (Cons a2 (Cons a1 Nil)))))
    

    It's just a tree with no branches, that's the definition of a list! And the only way to get at a1 is by popping off each of the Conss in turn; hence access to the last element is O(n), whereas access to the head is constant time. Once you can do this kind of reasoning about data structures based on their definitions, you're all set.