mediawikiwikimedia-dumps

Understanding wikimedia dumps


I'm trying to parse the latest wikisource dump. More specifically, I would like to get all the pages under the Category:Ballads page. For this purpose I downloaded the https://dumps.wikimedia.org/enwikisource/latest/enwikisource-latest-pages-articles.xml.bz2 dump. In this dump the relevant page contains everything except the actual links:

<page>
    <title>Category:Ballads</title>
    <ns>14</ns>
    <id>115796</id>
    <revision>
      <id>4753508</id>
      <parentid>4003780</parentid>
      <timestamp>2014-01-25T16:21:08Z</timestamp>
      <contributor>
        <username>EmausBot</username>
        <id>983607</id>
      </contributor>
      <minor />
      <comment>Bot: Migrating 2 interwiki links, now provided by [[Wikipedia:Wikidata|Wikidata]] on [[d:Q8286819]]</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text bytes="51" xml:space="preserve">[[Category:Song lyrics]]
[[Category:Poems by form]]</text>
      <sha1>43eusqpjj6kaqcp6nl1tcmo4ass36ia</sha1>
    </revision>
  </page>
  <page>

My question is, how do I get the actual page content and all the links in this page?

Thank you!


Solution

  • You downloaded the wrong version of a dump. If you're interested in categorylinks, you need to download https://dumps.wikimedia.org/enwikisource/latest/enwikisource-latest-categorylinks.sql.gz, for instance.

    If you want XML format, you would need to parse this information yourself, from raw wikitext. For that, you can use https://dumps.wikimedia.org/enwikisource/latest/enwikisource-latest-pages-meta-current.xml.bz2.

    EDIT per comments:

    enwikisource-latest-pages-meta-current.xml doesn't contain machine-readable information about categories, it only contains information about the current page content. You would need to look for the text XML element, which contains the raw wikitext stored in the page. Usually, at the end of the content, it has something like this:

    [[Category:American Civil War]]
    [[category:American speeches]]
    

    This indicates the page is in category "American Civil War" and "American speeches".

    If you want a parsed info, you would need to deal with the .sql file AFAIK.