pythonwikipedia-api

How can I extract all sections of a Wikipedia page in plain text?


I have the following code in python, which extracts only the introduction of the article on "Artificial Intelligence", while instead I would like to extract all sub-sections (History, Goals ...)

import requests

def get_wikipedia_page(page_title):
  endpoint = "https://en.wikipedia.org/w/api.php"
  params = {
    "format": "json",
    "action": "query",
    "prop": "extracts",
    "exintro": "",
    "explaintext": "",
    "titles": page_title
  }
  response = requests.get(endpoint, params=params)
  data = response.json()
  pages = data["query"]["pages"]
  page_id = list(pages.keys())[0]
  return pages[page_id]["extract"]

page_title = "Artificial intelligence"
wikipedia_page = get_wikipedia_page(page_title)

Someone proposed to use another approach that parses html and uses BeautifulSoup to convert to text:

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Artificial_intelligence"
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in 
line.split("  
"))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

This is not a good-enough solution, as it includes all text that appears on the website (like image text), and it includes citations in the text (e.g. [1]), while the first script removes them.

I suspect that the api of wikipedia should offer a more elegant solution, it would be rather weird if one can get only the first section?


Solution

  • Retrieving Wikipedia pages as HTML

    Like in our web-browsers we can retrieve the complete Wikipedia page by URL and parse the HTML response with Beautiful Soup.

    Wikepedia's API

    As alternative we can use the API, see Wikipedia's API documentation.

    Extract plain-text

    When using the action=query with format=json you can use these 4 options for text-extraction:

    Example: https://en.wikipedia.org/w/api.php?action=query&format=json&titles=Artificial%20intelligence&prop=extracts&explaintext

    Get each section separately

    To retrieve the sections use the action=parse with format=json and those options:

    There is also an API sandbox where you can try several parameters. The resulting GET request will retrieve all the sections of example page "Artificial intelligence": https://en.wikipedia.org/wiki/Special:ApiSandbox#action=parse&format=json&page=Artificial%20intelligence&prop=sections&formatversion=2

    This will respond with a JSON containing all sections:

    {
        "parse": {
            "title": "Artificial intelligence",
            "pageid": 1164,
            "sections": [
                {
                    "toclevel": 1,
                    "level": "2",
                    "line": "History",
                    "number": "1",
                    "index": "1",
                    "fromtitle": "Artificial_intelligence",
                    "byteoffset": 5987,
                    "anchor": "History",
                    "linkAnchor": "History"
                }
    }
    

    (simplified, only first section kept)

    To get the text of one of those sections, specify the section as query-parameter (by id or title), e.g. section=1&sectiontitle=History: https://en.wikipedia.org/wiki/Special:ApiSandbox#action=parse&format=json&page=Artificial_intelligence&section=1&sectiontitle=History&formatversion=2

    This retrieves the text (in HTML format):

    {
        "parse": {
            "title": "Artificial intelligence",
            "pageid": 1164,
            "revid": 1126677096,
            "text": "<div class=\"mw-parser-output\"><h2><span class=\"mw-headline\" id=\"History\">History</span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[</span><a href=\"/w/index.php?title=Artificial_intelligence&amp;action=edit&amp;section=1\" title=\"Edit section: History\">edit</a><span class=\"mw-editsection-bracket\">]</span></span></h2>\n<style data-mw-deduplicate=\"TemplateStyles:r1033289096\">.mw-parser-output .hatnote{font-style:italic}.mw-parser-output div.hatnote{padding-left:1.6em;margin-bottom:0.5em}.mw-parser-output .hatnote i{font-style:normal}.mw-parser-output .hatnote+link+.hatnote{margin-top:-0.5em}</style><div role=\"note\" class=\"hatnote navigation-not-searchable\">Main articles: <a href=\"/wiki/History_of_artificial_intelligence\" title=\"History of artificial intelligence\">History of artificial intelligence</a> and <a href=\"/wiki/Timeline_of_artificial_intelligence\" title=\"Timeline of artificial intelligence\">Timeline of artificial intelligence</a>
    

    Note: above response was cut-off to only show a sample of the text.

    Although above text contents is formatted as HTML, there might be options to get it as plain-text.

    See also

    Python code

    You can also use Python like

    1. package wikipedia:
    import wikpedia
    
    wikipedia.set_lang('en')
    page = wikipedia.page('Artificial intelligence')
    print(page.content)
    
    1. a Gist from Sai Kumar Yava (scionoftech) using requests: A small Python Code to get Wikipedia page content in plain text