pythonmediawikiwikipediamediawiki-apipywikibot

How do I use pywikibot.Page(site, title).text when the title has an unescaped apostrophe (')?


I have a list of strings called cities, where each string is a city name that is also the title of a wikipedia page. For each city, I'm getting the wikipedia page and then looking at the text content of it:

cities = [(n["name"]) for n in graph.nodes.match("City")]
for city in cities:
       site = pywikibot.Site(code="en", fam="wikivoyage")
       page = pywikibot.Page(site, city)
       text = page.text

One of the cities in my list is a place called L'Aquila and it was not returning anything for text (whereas other entries were). I figured that was because of the ' in the name. So I used re.sub to to escape the ' and pass in that result instead. This gives me what I expected:

cities = [(n["name"]) for n in graph.nodes.match("City")]
city = "L'Aquila"
altered_city = re.sub("'",  "\'", city)
print(altered_city)
site = pywikibot.Site(code="en", fam="wikivoyage")
page = pywikibot.Page(site, altered_city)
print(page)
print(page.text)

Result:

[[wikivoyage:en:L'Aquila]]
{{pagebanner|Pagebanner default.jpg}}
'''L'Aquila''' is the capital of the province of the same name in the region of [[Abruzzo]] in [[Italy]] and is located in the northern part of the..

But the issue is I don't want to hard-code the city name, I want to use the strings from my list. And when I pass this in, it does not give me any results for page.text:

cities = [(n["name"]) for n in graph.nodes.match("City")]
city_from_list = cities[0]
print(city_from_list)
print(type(city_from_list))
altered_city = re.sub("'",  "\'", city_from_list)
site = pywikibot.Site(code="en", fam="wikivoyage")
page = pywikibot.Page(site, altered_city)
print(page)
print(page.text)

Result:

L'Aquila
<class 'str'>
[[wikivoyage:en:L'Aquila]]

I printed out the value and type for the city element I'm getting from the list and it is a String, so I have no idea why it worked above but not here. How are these different?


Solution

  • re.sub("'", "\'", city) does not do anything:

    >>> city = "L'Aquila"
    >>> re.sub("'",  "\'", city)
    "L'Aquila"
    >>> city == re.sub("'",  "\'", city)
    True
    

    Python treats "\'" as "'". See the table at Lexical analysis # String and Bytes literals of the documentation.

    I don't know why the second portion of the code is not working for you, but it should. Maybe you just have not executed the last line. Even if page.text had returned None, the print statement should print None. Try print(type(page.text)).