haskellwiki-markuphttp-conduit

Download wikipedia markup using Haskell


Using http-conduit I want to download the raw wikimedia markup for any page, for example the Wikipedia page Stack Overflow.

Also, I'd like the solution to be applicable to wikimedia pages other than en.wikipedia.org, for example de.wikibooks.org.

Note: This question was immediately answered in Q&A form and therefore intentionally does not show research effort!


Solution

  • This question uses query parameters in http-conduits as described in this previous SO answer.

    We will use the method described here on SO to download the markup content of a page.

    Although this task could be possible using the mediawiki, it seems significantly simpler to use the ?action=raw method without explicitly using the API.

    In order to support different pages (e.g. en.wikimedia.org), I wrote two functions getWikipediaPageMarkup and getEnwikiPageMarkup, the former one being more general and allowing to use custom domains (any domain should work, assuming Mediawiki is installed under /wiki).

    {-# LANGUAGE OverloadedStrings #-}
    import Network.HTTP.Conduit
    import Data.ByteString (ByteString)
    import qualified Data.ByteString.Char8 as B
    import qualified Data.ByteString.Lazy.Char8 as LB
    import Network.HTTP.Types (urlEncode)
    import Data.Monoid ((<>))
    
    -- | Get the Mediawiki marup
    getWikipediaPageMarkup :: ByteString -- ^ The wikipedia domain, e.g. "en.wikipedia.org"
                           -> ByteString -- ^ The wikipedia page title to download
                           -> IO LB.ByteString -- ^ The wikipedia page markup
    getWikipediaPageMarkup domain page = do
        let url = "https://" <> domain <> "/wiki/" <> urlEncode True page
        request <- parseUrl $ B.unpack url
        let request' = setQueryString [("action", Just "raw")] request
        fmap responseBody $ withManager $ httpLbs request'
    
    -- | Like @getWikipediaPageMarkup@, but hardcoded to 'en.wikipedia.org'
    getEnwikiPageMarkup :: ByteString -> IO LB.ByteString
    getEnwikiPageMarkup = getWikipediaPageMarkup "en.wikipedia.org"
    

    Note that a recent http-conduit version is required (minimum: 2.1, tested with 2.1.4) in order to compile the code.