parsinghaskelljpegattoparsec

How to parse an entropy-coded JPEG block efficiently?


I'm just trying to jump through a SOS_MT block in a .JPEG file, I don't want to use the data for anything, I just want to know where it ends. According to what I understand from JPEG's article in Wikipedia, while all other blocks in the JPEG file start with a few bytes that indicate the blocks's length, a SOS_MT block is ... well, an evil swamp that you have no option but to parse byte-by-byte until you get to the end of it.

So I came with the following code to do just that:

entropyCoded :: Parser Int
entropyCoded = do
    list_of_lengths <-  many' $
         (
           do
             _ <- notWord8 0xFF
             return 1
         )
         <|>
         (
           do
             _ <- word8 0xFF
             _ <- word8 0
             return 2
         )
         <|>
         (
           do
             l <- many1 (word8 0xFF)
             _ <- satisfy (\x -> ( x >= 0xD0 && x < 0xD7 ))
             return $ 1 + length l
         )
         <|>
         (
           do
             _ <- word8 0xFF
             maybe_ff <- peekWord8'
             if maybe_ff == 0xFF
               then
                 return 1
               else
                 fail "notthere"
         )
    foldM (\ nn n -> nn `seq` return (nn + n) ) 0 list_of_lengths

This code uses Atoparsec and as far as I have had the chance to verify it, it is correct. It is just slow. Any tips on how to improve, performance-wise, this parser?


Solution

  • If you want to skip over an SOS market, just look for the next marker that is not a restart marker.

    Read bytes until you find and FF. If the next value 00, it is a compressed FF value and skip over it. If it's a restart marker skip over it. Otherwise, FF should start the next block.