I would like to extract only a specific section from a wikipedia page:
example: I would like to extract the text from section "Parts" from wikipedia article "House".
https://en.wikipedia.org/wiki/House
The resulting text would be :
Many houses have several large rooms ..... sections of the home (including in more recent eras a garage).
We can get the whole text from an article like the following:
But howto get the text for a specific section ?
Do you need to plain wikitext or the resulting HTML of the parser?
The below examples gives you the section "Layout" (the 3rd section of the house article, you can use any other section ID as well).
When you want to retrieve the parsed html of a specific section, you should use the parse api: https://en.wikipedia.org/wiki/Special:ApiSandbox#action=parse&format=json&page=house&prop=text§ion=3&disabletoc=1 or, as a API request outside of the sandbox: https://en.wikipedia.org/w/api.php?action=parse&format=json&page=house&prop=text§ion=3&disabletoc=1
If you want to have the wikitext of a specific section, just use the wikitext prop instead of the text prop: https://en.wikipedia.org/w/api.php?action=parse&format=json&page=house&prop=wikitext§ion=3&disabletoc=1
In order to know what section has what index, you can query this information with the "sections" prop, without any section index: https://en.wikipedia.org/w/api.php?action=parse&format=json&page=house&prop=sections&disabletoc=1
So, as a full example for retrieving the Layout section text in a way of using the API only, you would:
Response:
{
"parse": {
"title": "House",
"pageid": 13590,
"sections": [
{
"toclevel": 1,
"level": "2",
"line": "Etymology",
"number": "1",
"index": "1",
"fromtitle": "House",
"byteoffset": 3549,
"anchor": "Etymology"
},
{
"toclevel": 1,
"level": "2",
"line": "Elements",
"number": "2",
"index": "2",
"fromtitle": "House",
"byteoffset": 4960,
"anchor": "Elements"
},
{
"toclevel": 2,
"level": "3",
"line": "Layout",
"number": "2.1",
"index": "3",
"fromtitle": "House",
"byteoffset": 4976,
"anchor": "Layout"
},
{
"toclevel": 2,
"level": "3",
"line": "Parts",
"number": "2.2",
"index": "4",
"fromtitle": "House",
"byteoffset": 6432,
"anchor": "Parts"
},
{
"toclevel": 2,
"level": "3",
"line": "History of the interior",
"number": "2.3",
"index": "5",
"fromtitle": "House",
"byteoffset": 7539,
"anchor": "History_of_the_interior"
},
{
"toclevel": 3,
"level": "4",
"line": "Communal rooms",
"number": "2.3.1",
"index": "6",
"fromtitle": "House",
"byteoffset": 8786,
"anchor": "Communal_rooms"
},
{
"toclevel": 3,
"level": "4",
"line": "Interconnecting rooms",
"number": "2.3.2",
"index": "7",
"fromtitle": "House",
"byteoffset": 9736,
"anchor": "Interconnecting_rooms"
},
{
"toclevel": 3,
"level": "4",
"line": "Corridor",
"number": "2.3.3",
"index": "8",
"fromtitle": "House",
"byteoffset": 11126,
"anchor": "Corridor"
},
{
"toclevel": 3,
"level": "4",
"line": "Employment-free house",
"number": "2.3.4",
"index": "9",
"fromtitle": "House",
"byteoffset": 13092,
"anchor": "Employment-free_house"
},
{
"toclevel": 2,
"level": "3",
"line": "Work location, technology and doctors",
"number": "2.4",
"index": "10",
"fromtitle": "House",
"byteoffset": 15969,
"anchor": "Work_location,_technology_and_doctors"
},
{
"toclevel": 3,
"level": "4",
"line": "Technology and privacy",
"number": "2.4.1",
"index": "11",
"fromtitle": "House",
"byteoffset": 17291,
"anchor": "Technology_and_privacy"
},
{
"toclevel": 1,
"level": "2",
"line": "Construction",
"number": "3",
"index": "12",
"fromtitle": "House",
"byteoffset": 18782,
"anchor": "Construction"
},
{
"toclevel": 2,
"level": "3",
"line": "Energy efficiency",
"number": "3.1",
"index": "13",
"fromtitle": "House",
"byteoffset": 21899,
"anchor": "Energy_efficiency"
},
{
"toclevel": 2,
"level": "3",
"line": "Earthquake protection",
"number": "3.2",
"index": "14",
"fromtitle": "House",
"byteoffset": 23057,
"anchor": "Earthquake_protection"
},
{
"toclevel": 1,
"level": "2",
"line": "Found materials",
"number": "4",
"index": "15",
"fromtitle": "House",
"byteoffset": 25172,
"anchor": "Found_materials"
},
{
"toclevel": 1,
"level": "2",
"line": "Legal issues",
"number": "5",
"index": "16",
"fromtitle": "House",
"byteoffset": 26235,
"anchor": "Legal_issues"
},
{
"toclevel": 2,
"level": "3",
"line": "United Kingdom",
"number": "5.1",
"index": "17",
"fromtitle": "House",
"byteoffset": 26644,
"anchor": "United_Kingdom"
},
{
"toclevel": 1,
"level": "2",
"line": "Identifying houses",
"number": "6",
"index": "18",
"fromtitle": "House",
"byteoffset": 26922,
"anchor": "Identifying_houses"
},
{
"toclevel": 1,
"level": "2",
"line": "Animal houses",
"number": "7",
"index": "19",
"fromtitle": "House",
"byteoffset": 27397,
"anchor": "Animal_houses"
},
{
"toclevel": 1,
"level": "2",
"line": "Houses and symbolism",
"number": "8",
"index": "20",
"fromtitle": "House",
"byteoffset": 27826,
"anchor": "Houses_and_symbolism"
},
{
"toclevel": 1,
"level": "2",
"line": "See also",
"number": "9",
"index": "21",
"fromtitle": "House",
"byteoffset": 28620,
"anchor": "See_also"
},
{
"toclevel": 1,
"level": "2",
"line": "References",
"number": "10",
"index": "22",
"fromtitle": "House",
"byteoffset": 29690,
"anchor": "References"
},
{
"toclevel": 1,
"level": "2",
"line": "External links",
"number": "11",
"index": "23",
"fromtitle": "House",
"byteoffset": 29720,
"anchor": "External_links"
}
]
}
}
Response:
{
"parse": {
"title": "House",
"pageid": 13590,
"wikitext": {
"*": "=== Layout ===\n[[File:Gingerbread House Essex CT.jpg|thumb|Example of an early [[Victorian architecture|Victorian]] \"Gingerbread House\" in [[Connecticut]], United States, built in 1855]]\n\nIdeally, [[architect]]s of houses design [[room]]s to meet the needs of the people who will live in the house. [[Feng shui]], originally a [[China|Chinese]] method of moving houses according to such factors as rain and micro-climates, has recently expanded its scope to address the design of interior spaces, with a view to promoting harmonious effects on the people living inside the house, although no actual effect has ever been demonstrated. Feng shui can also mean the \"aura\" in or around a dwelling, making it comparable to the [[real estate|real-estate]] sales concept of \"indoor-outdoor flow\".\n\nThe [[square footage]] of a house in the United States reports the area of \"living space\", excluding the garage and other non-living spaces. The \"square metres\" figure of a house in Europe <!-- including Malta ? --> reports the area of the walls enclosing the home, and thus includes any attached garage and non-living spaces.<ref>{{Cite book|title=Land Management: Challenges and Strategies (First Edition)|last=Iyyer|first=Chaitanya|publisher=Global India Publications Pvt Ltd|year=2009|isbn=978-9380228488|location=|pages=}}</ref>{{Citation needed|date=February 2007}} The number of floors or levels making up the house can affect the square footage of a home."
}
}
}
Background: The idea of sections in a page is not integrated in revisions (yet), a revision is "just" the content of the whole page and additional metadata (e.g. in multiple other slots), but the sections are part of the content (which is one slot in the revision only). That's why, when using the revision query API, you can only get the whole text. The page needs to be parsed in order to know what the sections are, as sections are a concept of wikitext, hence involving the parser.