performancehaskelllongest-substring

Performance of Longest Substring Without Repeating Characters in Haskell


Upon reading this Python question and proposing a solution, I tried to solve the same challenge in Haskell.

I've come up with the code below, which seems to work. However, since I'm pretty new to this language, I'd like some help in understand whether the code is good performancewise.

lswrc :: String -> String
lswrc s = reverse $ fst $ foldl' step ("","") s
  where
    step ("","") c = ([c],[c])
    step (maxSubstr,current) c
      | c `elem` current = step (maxSubstr,init current) c
      | otherwise = let candidate = (c:current)
                        longerThan = (>) `on` length
                        newMaxSubstr = if maxSubstr `longerThan` candidate
                                       then maxSubstr
                                       else candidate
                    in (newMaxSubstr, candidate)

Some points I think could be better than they are


Solution

  • Running elem on every iteration makes your algorithm Ω(n^2) (for strings with no repeats). Running length on, in the worst case, every iteration makes your algorithm Ω(n^2) (for strings with no repeats). Running init a lot makes your algorithm Ω(n*sqrt(n)) (for strings that are sqrt(n) repetitions of a sqrt(n)-long string, with every other one reversed, and assuming an O(1) elem replacement).

    A better way is to pay one O(n) cost up front to copy into a data structure with constant-time indexing, and to keep a set (or similar data structure) of seen elements rather than a flat list. Like this:

    import Data.Set (Set)
    import Data.Vector (Vector)
    import qualified Data.Set as S
    import qualified Data.Vector as V
    
    lswrc2 :: String -> String
    lswrc2 "" = ""
    lswrc2 s_ = go S.empty 0 0 0 0 where
        s = V.fromList s_
        n = V.length s
        at = V.unsafeIndex s
        go seen lo hi bestLo bestHi
            | hi == n = V.toList (V.slice bestLo (bestHi-bestLo+1) s)
            -- it is probably faster (possibly asymptotically so?) to use findIndex
            -- to immediately pick the correct next value of lo
            | at hi `S.member` seen = go (S.delete (at lo) seen) (lo+1) hi bestLo bestHi
            | otherwise = let rec = go (S.insert (at hi) seen) lo (hi+1) in
                if hi-lo > bestHi-bestLo then rec lo hi else rec bestLo bestHi
    

    This should have O(n*log(n)) worst-case performance (achieving that worst case on strings with no repeats). There may be ways that are better still; I haven't thought super hard about it.

    On my machine, lswrc2 consistently outperforms lswrc on random strings. On the string ['\0' .. '\100000'], lswrc takes about 40s and lswrc2 takes 0.03s. lswrc2 can handle [minBound .. maxBound] in about 0.4s; I gave up after more than 20 minutes of letting lswrc chew on that list.