Using http-conduit
I want to download the raw wikimedia markup for any page, for example the Wikipedia page Stack Overflow
.
Also, I'd like the solution to be applicable to wikimedia pages other than en.wikipedia.org
, for example de.wikibooks.org
.
Note: This question was immediately answered in Q&A form and therefore intentionally does not show research effort!
This question uses query parameters in http-conduits as described in this previous SO answer.
We will use the method described here on SO to download the markup content of a page.
Although this task could be possible using the mediawiki, it seems significantly simpler to use the ?action=raw
method without explicitly using the API.
In order to support different pages (e.g. en.wikimedia.org
), I wrote two functions getWikipediaPageMarkup
and getEnwikiPageMarkup
, the former one being more general and allowing to use custom domains (any domain should work, assuming Mediawiki is installed under /wiki
).
{-# LANGUAGE OverloadedStrings #-}
import Network.HTTP.Conduit
import Data.ByteString (ByteString)
import qualified Data.ByteString.Char8 as B
import qualified Data.ByteString.Lazy.Char8 as LB
import Network.HTTP.Types (urlEncode)
import Data.Monoid ((<>))
-- | Get the Mediawiki marup
getWikipediaPageMarkup :: ByteString -- ^ The wikipedia domain, e.g. "en.wikipedia.org"
-> ByteString -- ^ The wikipedia page title to download
-> IO LB.ByteString -- ^ The wikipedia page markup
getWikipediaPageMarkup domain page = do
let url = "https://" <> domain <> "/wiki/" <> urlEncode True page
request <- parseUrl $ B.unpack url
let request' = setQueryString [("action", Just "raw")] request
fmap responseBody $ withManager $ httpLbs request'
-- | Like @getWikipediaPageMarkup@, but hardcoded to 'en.wikipedia.org'
getEnwikiPageMarkup :: ByteString -> IO LB.ByteString
getEnwikiPageMarkup = getWikipediaPageMarkup "en.wikipedia.org"
Note that a recent http-conduit
version is required (minimum: 2.1
, tested with 2.1.4
) in order to compile the code.