pythonwikipedia

How to extract data from a wikilinks?


I want to extract data from the wikilinks returned by the mwparserfromhell lib. I want for instance to parse the following string:

[[File:Warszawa, ul. Freta 16 20170516 002.jpg|thumb|upright=1.18|[[Maria Skłodowska-Curie Museum|Birthplace]] of Marie Curie, at 16 Freta Street, in [[Warsaw]], [[Poland]].]]

If I split the string using the character |, it doesn't work as there is a link inside the description of the image that uses the | as well: [[Maria Skłodowska-Curie Museum|Birthplace]].

I'm using regexp to first replace all links in the string before spliting it. It works (in this case) but it doesn't feel clean (see code bellow). Is there a better way to extract information from such a string?

import re

wiki_code = "[[File:Warszawa, ul. Freta 16 20170516 002.jpg|thumb|upright=1.18|[[Maria Skłodowska-Curie Museum|Birthplace]] of Marie Curie, at 16 Freta Street, in [[Warsaw]], [[Poland]].]]"

# Remove [[File: at the begining of the string
prefix = "[[File:"
if (wiki_code.startswith(prefix)):
    wiki_code = wiki_code[len(prefix):]

# Remove ]] at the end of the string
suffix = "]]"
if (wiki_code.endswith(suffix)):
    wiki_code = wiki_code[:-len(suffix)]

# Replace links with their
link_pattern = re.compile(r'\[\[.*?\]\]')
matches = link_pattern.findall(wiki_code)
for match in matches:
    content = match[2:-2]
    arr = content.split("|")
    label = arr[-1]
    wiki_code = wiki_code.replace(match, label)

print(wiki_code.split("|"))

Solution

  • The links returned by .filter_wikilinks() are instances of the Wikilink class, which have title and text properties.

    These are returned as Wikicode objects.

    Since the actual text is always the last fragment, first you need to find other fragments with the following regex:

    ([^\[\]|]*\|)+

    Everything else from the ending index of the last match until the end of the string is the last fragment.

    >>> import mwparserfromhell
    >>> import re
    >>> wikitext = mwparserfromhell.parse('[[File:Warszawa, ul. Freta 16 20170516 002.jpg|thumb|upright=1.18|[[Maria Skłodowska-Curie Museum|Birthplace]] of Marie Curie, at 16 Freta Street, in [[Warsaw]], [[Poland]].]]')
    >>> image_link = wikitext.filter_wikilinks()[0]
    >>> image_link
    '[[File:Warszawa, ul. Freta 16 20170516 002.jpg|thumb|upright=1.18|[[Maria Skłodowska-Curie Museum|Birthplace]] of Marie Curie, at 16 Freta Street, in [[Warsaw]], [[Poland]].]]'
    >>> image_link.title
    'File:Warszawa, ul. Freta 16 20170516 002.jpg'
    >>> text = str(image_link.text)
    >>> text
    'thumb|upright=1.18|[[Maria Skłodowska-Curie Museum|Birthplace]] of Marie Curie, at 16 Freta Street, in [[Warsaw]], [[Poland]].'
    >>> other_fragments = re.match(r'([^\[\]|]*\|)+', text)
    >>> other_fragments
    <re.Match object; span=(0, 19), match='thumb|upright=1.18|'>
    >>> other_fragments.span(0)[1]
    19
    >>> text[19:]
    '[[Maria Skłodowska-Curie Museum|Birthplace]] of Marie Curie, at 16 Freta Street, in [[Warsaw]], [[Poland]].'
    

    When the caption is not the last fragment

    For such edge cases we can parse the text property again using itertools functions:

    >>> import mwparserfromhell
    >>> import re
    >>> from itertools import chain, groupby
    >>> wikitext = mwparserfromhell.parse('[[File:Marie Curie - Mobile X-Ray-Unit.jpg|thumb|Curie in a mobile X-ray vehicle, {{circa|1915}}|alt=]]')
    >>> image_link = wikitext.filter_wikilinks()[0]
    >>> image_link.text
    'thumb|Curie in a mobile X-ray vehicle, {{circa|1915}}|alt='
    >>> child_nodes = image_link.text.filter(recursive = False)
    >>> child_nodes
    ['thumb|Curie in a mobile X-ray vehicle, ', '{{circa|1915}}', '|alt=']
    >>> isinstance(child_nodes[0], mwparserfromhell.nodes.Text)
    True
    >>> isinstance(child_nodes[1], mwparserfromhell.nodes.Template)
    True
    >>> tokens = list(chain.from_iterable(re.split(r'(\|)', str(node)) if isinstance(node, mwparserfromhell.nodes.Text) else [node] for node in child_nodes))
    >>> tokens
    ['thumb', '|', 'Curie in a mobile X-ray vehicle, ', '{{circa|1915}}', '', '|', 'alt=']
    >>> fragments = []
    >>> for is_not_pipe, group in groupby(tokens, key = lambda token: token != '|'):
    ...   if is_not_pipe:
    ...     fragments.append(''.join(map(str, group)))
    ...
    >>> fragments
    ['thumb', 'Curie in a mobile X-ray vehicle, {{circa|1915}}', 'alt=']