mediawikiwikipediawikipedia-api

How to get a text of a specific section via wikipedia api


I would like to extract only a specific section from a wikipedia page:

example: I would like to extract the text from section "Parts" from wikipedia article "House".

https://en.wikipedia.org/wiki/House

The resulting text would be :

Many houses have several large rooms  .....  sections of the home (including in more recent eras a garage). 

We can get the whole text from an article like the following:

https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=house&rvprop=content&format=json

But howto get the text for a specific section ?


Solution

  • Do you need to plain wikitext or the resulting HTML of the parser?

    The below examples gives you the section "Layout" (the 3rd section of the house article, you can use any other section ID as well).

    When you want to retrieve the parsed html of a specific section, you should use the parse api: https://en.wikipedia.org/wiki/Special:ApiSandbox#action=parse&format=json&page=house&prop=text&section=3&disabletoc=1 or, as a API request outside of the sandbox: https://en.wikipedia.org/w/api.php?action=parse&format=json&page=house&prop=text&section=3&disabletoc=1

    If you want to have the wikitext of a specific section, just use the wikitext prop instead of the text prop: https://en.wikipedia.org/w/api.php?action=parse&format=json&page=house&prop=wikitext&section=3&disabletoc=1

    In order to know what section has what index, you can query this information with the "sections" prop, without any section index: https://en.wikipedia.org/w/api.php?action=parse&format=json&page=house&prop=sections&disabletoc=1

    So, as a full example for retrieving the Layout section text in a way of using the API only, you would:

    1. Retrieve the sections of the article: https://en.wikipedia.org/w/api.php?action=parse&format=json&page=house&prop=sections&disabletoc=1

    Response:

    {
        "parse": {
            "title": "House",
            "pageid": 13590,
            "sections": [
                {
                    "toclevel": 1,
                    "level": "2",
                    "line": "Etymology",
                    "number": "1",
                    "index": "1",
                    "fromtitle": "House",
                    "byteoffset": 3549,
                    "anchor": "Etymology"
                },
                {
                    "toclevel": 1,
                    "level": "2",
                    "line": "Elements",
                    "number": "2",
                    "index": "2",
                    "fromtitle": "House",
                    "byteoffset": 4960,
                    "anchor": "Elements"
                },
                {
                    "toclevel": 2,
                    "level": "3",
                    "line": "Layout",
                    "number": "2.1",
                    "index": "3",
                    "fromtitle": "House",
                    "byteoffset": 4976,
                    "anchor": "Layout"
                },
                {
                    "toclevel": 2,
                    "level": "3",
                    "line": "Parts",
                    "number": "2.2",
                    "index": "4",
                    "fromtitle": "House",
                    "byteoffset": 6432,
                    "anchor": "Parts"
                },
                {
                    "toclevel": 2,
                    "level": "3",
                    "line": "History of the interior",
                    "number": "2.3",
                    "index": "5",
                    "fromtitle": "House",
                    "byteoffset": 7539,
                    "anchor": "History_of_the_interior"
                },
                {
                    "toclevel": 3,
                    "level": "4",
                    "line": "Communal rooms",
                    "number": "2.3.1",
                    "index": "6",
                    "fromtitle": "House",
                    "byteoffset": 8786,
                    "anchor": "Communal_rooms"
                },
                {
                    "toclevel": 3,
                    "level": "4",
                    "line": "Interconnecting rooms",
                    "number": "2.3.2",
                    "index": "7",
                    "fromtitle": "House",
                    "byteoffset": 9736,
                    "anchor": "Interconnecting_rooms"
                },
                {
                    "toclevel": 3,
                    "level": "4",
                    "line": "Corridor",
                    "number": "2.3.3",
                    "index": "8",
                    "fromtitle": "House",
                    "byteoffset": 11126,
                    "anchor": "Corridor"
                },
                {
                    "toclevel": 3,
                    "level": "4",
                    "line": "Employment-free house",
                    "number": "2.3.4",
                    "index": "9",
                    "fromtitle": "House",
                    "byteoffset": 13092,
                    "anchor": "Employment-free_house"
                },
                {
                    "toclevel": 2,
                    "level": "3",
                    "line": "Work location, technology and doctors",
                    "number": "2.4",
                    "index": "10",
                    "fromtitle": "House",
                    "byteoffset": 15969,
                    "anchor": "Work_location,_technology_and_doctors"
                },
                {
                    "toclevel": 3,
                    "level": "4",
                    "line": "Technology and privacy",
                    "number": "2.4.1",
                    "index": "11",
                    "fromtitle": "House",
                    "byteoffset": 17291,
                    "anchor": "Technology_and_privacy"
                },
                {
                    "toclevel": 1,
                    "level": "2",
                    "line": "Construction",
                    "number": "3",
                    "index": "12",
                    "fromtitle": "House",
                    "byteoffset": 18782,
                    "anchor": "Construction"
                },
                {
                    "toclevel": 2,
                    "level": "3",
                    "line": "Energy efficiency",
                    "number": "3.1",
                    "index": "13",
                    "fromtitle": "House",
                    "byteoffset": 21899,
                    "anchor": "Energy_efficiency"
                },
                {
                    "toclevel": 2,
                    "level": "3",
                    "line": "Earthquake protection",
                    "number": "3.2",
                    "index": "14",
                    "fromtitle": "House",
                    "byteoffset": 23057,
                    "anchor": "Earthquake_protection"
                },
                {
                    "toclevel": 1,
                    "level": "2",
                    "line": "Found materials",
                    "number": "4",
                    "index": "15",
                    "fromtitle": "House",
                    "byteoffset": 25172,
                    "anchor": "Found_materials"
                },
                {
                    "toclevel": 1,
                    "level": "2",
                    "line": "Legal issues",
                    "number": "5",
                    "index": "16",
                    "fromtitle": "House",
                    "byteoffset": 26235,
                    "anchor": "Legal_issues"
                },
                {
                    "toclevel": 2,
                    "level": "3",
                    "line": "United Kingdom",
                    "number": "5.1",
                    "index": "17",
                    "fromtitle": "House",
                    "byteoffset": 26644,
                    "anchor": "United_Kingdom"
                },
                {
                    "toclevel": 1,
                    "level": "2",
                    "line": "Identifying houses",
                    "number": "6",
                    "index": "18",
                    "fromtitle": "House",
                    "byteoffset": 26922,
                    "anchor": "Identifying_houses"
                },
                {
                    "toclevel": 1,
                    "level": "2",
                    "line": "Animal houses",
                    "number": "7",
                    "index": "19",
                    "fromtitle": "House",
                    "byteoffset": 27397,
                    "anchor": "Animal_houses"
                },
                {
                    "toclevel": 1,
                    "level": "2",
                    "line": "Houses and symbolism",
                    "number": "8",
                    "index": "20",
                    "fromtitle": "House",
                    "byteoffset": 27826,
                    "anchor": "Houses_and_symbolism"
                },
                {
                    "toclevel": 1,
                    "level": "2",
                    "line": "See also",
                    "number": "9",
                    "index": "21",
                    "fromtitle": "House",
                    "byteoffset": 28620,
                    "anchor": "See_also"
                },
                {
                    "toclevel": 1,
                    "level": "2",
                    "line": "References",
                    "number": "10",
                    "index": "22",
                    "fromtitle": "House",
                    "byteoffset": 29690,
                    "anchor": "References"
                },
                {
                    "toclevel": 1,
                    "level": "2",
                    "line": "External links",
                    "number": "11",
                    "index": "23",
                    "fromtitle": "House",
                    "byteoffset": 29720,
                    "anchor": "External_links"
                }
            ]
        }
    }
    
    1. Iterate over the result and find the section you want to have, retrieve the index
    2. Use the index in the next API request to get the section content: https://en.wikipedia.org/wiki/Special:ApiSandbox#action=parse&format=json&page=house&prop=wikitext&section=3&disabletoc=1

    Response:

    {
        "parse": {
            "title": "House",
            "pageid": 13590,
            "wikitext": {
                "*": "=== Layout ===\n[[File:Gingerbread House Essex CT.jpg|thumb|Example of an early [[Victorian architecture|Victorian]] \"Gingerbread House\" in [[Connecticut]], United States, built in 1855]]\n\nIdeally, [[architect]]s of houses design [[room]]s to meet the needs of the people who will live in the house. [[Feng shui]], originally a [[China|Chinese]] method of moving houses according to such factors as rain and micro-climates, has recently expanded its scope to address the design of interior spaces, with a view to promoting harmonious effects on the people living inside the house, although no actual effect has ever been demonstrated. Feng shui can also mean the \"aura\" in or around a dwelling, making it comparable to the [[real estate|real-estate]] sales concept of \"indoor-outdoor flow\".\n\nThe [[square footage]] of a house in the United States reports the area of \"living space\", excluding the garage and other non-living spaces. The \"square metres\" figure of a house in Europe <!-- including Malta ? --> reports the area of the walls enclosing the home, and thus includes any attached garage and non-living spaces.<ref>{{Cite book|title=Land Management: Challenges and Strategies (First Edition)|last=Iyyer|first=Chaitanya|publisher=Global India Publications Pvt Ltd|year=2009|isbn=978-9380228488|location=|pages=}}</ref>{{Citation needed|date=February 2007}} The number of floors or levels making up the house can affect the square footage of a home."
            }
        }
    }
    

    Background: The idea of sections in a page is not integrated in revisions (yet), a revision is "just" the content of the whole page and additional metadata (e.g. in multiple other slots), but the sections are part of the content (which is one slot in the revision only). That's why, when using the revision query API, you can only get the whole text. The page needs to be parsed in order to know what the sections are, as sections are a concept of wikitext, hence involving the parser.