wikipedia-apiwikimedia-dumps

Are the abstracts in in enwiki-latest-abstract.xml.gz corrupted?


I've been looking at the wikimedia abstracts dump file (enwiki-latest-abstract.xml.gz) for the last week and noticed that the abstracts for many items appear to be corrupted.

For example, the wikipedia page for Alabama contains the following dumped abstract:

<title>Wikipedia: Alabama</title>
<url>https://en.wikipedia.org/wiki/Alabama</url>
<abstract>(We dare defend our rights)</abstract>

Similarly, the abstract for the Abraham Lincoln item is:

<title>Wikipedia: Abraham Lincoln</title>
<url>https://en.wikipedia.org/wiki/Abraham_Lincoln</url>
<abstract>| term_start1 = March 4, 1847</abstract>

Which appears to be a partial snippet from the infobox.

This kind of corruption seems to be present for a majority of items in the the enwiki-latest-abstract.xml.gz.

I'd appreciate any advice anyone has on whether this is a bug or whether I have a misunderstanding about this dump file.

Thanks!


Solution

  • This is probably just the extraction code behaving badly; it's not very sophisticated.

    FWIW Wikipedia has two different extract/summary APIs, which both seem behave reasonably here (the older, api.php-based one is a bit broken but not completely broken):

    https://en.wikipedia.org//w/api.php?action=query&format=jsonfm&prop=extracts&titles=Alabama%7CAbraham%20Lincoln&exsentences=1&exintro=1&explaintext=1

    Alabama () is a state in the southeastern region of the United States.

    Abraham Lincoln (; February 12, 1809 \u2013 April 15, 1865) was an American statesman and lawyer who served as the 16th president of the United States (1861\u20131865).

    https://en.wikipedia.org/api/rest_v1/page/summary/Alabama

    Alabama is a state in the southeastern region of the United States. It is bordered by Tennessee to the north, Georgia to the east, Florida and the Gulf of Mexico to the south, and Mississippi to the west. Alabama is the 30th largest by area and the 24th-most populous of the U.S. states. With a total of 1,500 miles (2,400 km) of inland waterways, Alabama has among the most of any state.

    https://en.wikipedia.org/api/rest_v1/page/summary/Abraham_Lincoln

    Abraham Lincoln was an American statesman and lawyer who served as the 16th president of the United States (1861–1865). Lincoln led the nation through its greatest moral, constitutional, and political crisis in the American Civil War. He preserved the Union, abolished slavery, strengthened the federal government, and modernized the U.S. economy.

    Neither of those have dumps though.