performancehaskelloptimizationhaskell-criterion

Cross module optimizations in GHC


I have a non-recursive function to calculate longest common subsequence that seems to perform well (ghc 7.6.1, compiled with -O2 -fllvm flags) if I measure it with Criterion in the same module. On the other hand, if I convert the function into a module, export just that function (as recommended here), and then measure again with Criterion, I get ~2x slowdown (which goes away if I move the criterion test back to the module where function is defined). I tried marking the function with INLINE pragma which didn't make any difference in cross-module performance measurements.

It seems to me that GHC might be doing a strictness analysis that works well when the function and the main (from which the function is reachable) are in the same module, but not when they are split across. I would appreciate pointers on how to modularize the function so that it performs well when called from other modules. The code in question is too big to paste here - you can see it here if you want to try it out. A small example of what I am trying to do is below (with snippets of code):

-- Function to find longest common subsequence given unboxed vectors a and b
-- It returns indices of LCS in a and b
lcs :: (U.Unbox a, Eq a) => Vector a -> Vector a -> (Vector Int,Vector Int)
lcs a b | (U.length a > U.length b) = lcsh b a True
        | otherwise = lcsh a b False

-- This section below measures performance of lcs function - if I move it to 
-- a different module, performance degrades ~2x - mean goes from ~1.25us to ~2.4us
-- on my test machine
{-- 
config :: Config
config = defaultConfig  { cfgSamples = ljust 100 }

a = U.fromList ['a'..'j'] :: Vector Char
b = U.fromList ['a'..'k'] :: Vector Char

suite :: [Benchmark]
suite = [
          bench "lcs 10" $ whnf (lcs a) b
        ]

main :: IO()
main = defaultMainWith config (return ()) suite
--}

Solution

  • hammar is right, the important issue is that the compiler can see the type that lcs is used at at the same time as it can see the code, so it can specialise the code to that particular type.

    If the compiler doesn't know the type at which the code shall be used, it cannot but only produce polymorphic code. And that is bad for performance - I'm rather surprised it's only a ~2× difference here. Polymorphic code means that for many operations a type-class lookup is needed, and that at least makes it impossible to inline the looked-up function or constant-fold sizes [e.g. for unboxed array/vector access].

    You cannot obtain comparable performance to the one-module case with implementation and use in separate modules without making the code that needs specialising visible at the use site (or, if you know the needed types at the implementation site, specialising there, {-# SPECIALISE foo :: Char -> Int, foo :: Bool -> Integer #-} etc.).

    Making the code visible at the use-site is usually done by exposing the unfolding in the interface file via marking the function {-# INLINABLE #-}.

    I tried marking the function with INLINE pragma which didn't make any difference in cross-module performance measurements.

    Marking only

    lcs :: (U.Unbox a, Eq a) => Vector a -> Vector a -> (Vector Int,Vector Int)
    lcs a b | (U.length a > U.length b) = lcsh b a True
            | otherwise = lcsh a b False
    

    INLINE or INLINABLE doesn't make a difference of course, that function is trivial, and the compiler exposes its unfolding anyway, since it's so small. Even if its unfolding wasn't exposed, the difference would not be measurable.

    You need to expose the unfoldings of the functions doing the actual work too, at least that of the polymorphic ones, lcsh, findSnakes, gridWalk and cmp (cmp is the one that's crucial here, but the others are necessary to 1. see that cmp is needed, 2. call the specialised cmp from them).

    Making those INLINABLE, the difference between the separate-module case

    $ ./diffBench 
    warming up
    estimating clock resolution...
    mean is 1.573571 us (320001 iterations)
    found 2846 outliers among 319999 samples (0.9%)
      2182 (0.7%) high severe
    estimating cost of a clock call...
    mean is 40.54233 ns (12 iterations)
    
    benchmarking lcs 10
    mean: 1.628523 us, lb 1.618721 us, ub 1.638985 us, ci 0.950
    std dev: 51.75533 ns, lb 47.04237 ns, ub 58.45611 ns, ci 0.950
    variance introduced by outliers: 26.787%
    variance is moderately inflated by outliers
    

    and the single-module case

    $ ./oneModule 
    warming up
    estimating clock resolution...
    mean is 1.726459 us (320001 iterations)
    found 2092 outliers among 319999 samples (0.7%)
      1608 (0.5%) high severe
    estimating cost of a clock call...
    mean is 39.98567 ns (14 iterations)
    
    benchmarking lcs 10
    mean: 1.523183 us, lb 1.514157 us, ub 1.533071 us, ci 0.950
    std dev: 48.48541 ns, lb 44.43230 ns, ub 55.04251 ns, ci 0.950
    variance introduced by outliers: 26.791%
    variance is moderately inflated by outliers
    

    is bearably small.