pythonencodinggeocode

Geocoding returns wrong town when applied to dataframe - returns correct one when applied to text


I have a silly problem: I am trying to geocode several towns in Germany, the names of the towns are stored in a column in a dataframe. The coding works for all towns except for one: Münster.

My dataframe looks something like this (pasting the interesting part here):

68  1034291939  Lüneburg        2010    330       (Lüneburg, Niedersachsen, Deutschland, (53.248...
69  1003394345  Kassel          2010    330       (Kassel, Niestetal, Hessen, Deutschland, (51.3...
70  100975873X  Magdeburg       2010    330.157   (Magdeburg, Sachsen-Anhalt, Deutschland, (52.1...
71  1191280594  Ludwigsburg     2010    340       (Ludwigsburg, Landkreis Ludwigsburg, Baden-Wür...
72  1010526499  Potsdam         2010    338.947   (Potsdam, Brandenburg, Deutschland, (52.400930...
73  1008154156  Duisburg        2010    658.72    (Duisburg, Nordrhein-Westfalen, Deutschland, (...
74  1011028336  Münster         2010    330       (Munster, Éire / Ireland, (52.307621600000004,...
75  1008507016  Jena            2010    338.04    (Jena, Thüringen, Deutschland, (50.9281717, 11...

As you can see, there are other town names in the dataframe containing special characters such as "ü" ("Lüneburg") that don't cause any problems, but Münster - for whatever reason - gets geocoded to Munster in Ireland. For the record, I use the following code to geocode the townname:

df_geo['geo_code'] = df_short.Ort.apply(geocode)

I have tried checking if this is a general problem with the database, but if I run location = geocode('Münster') it returns Münster, Nordrhein-Westfalen, Deutschland just fine. I am at a total loss now as to why this doesn't work when I apply it to the dataframe. I thought it could be a problem with the ü, but the table I read the data from is encoded in utf-8 and displays the ü correctly when I open it (e.g. in Excel). Does anyone have an idea what the problem might be and how I can fix it? Do I need to encode or decode the data differntly before running the geocoding on the dataframe?

Edit: Edited the dataframe so that the different columns are easier to spot

Edit2: To fix this problem, I have now tried extracting the location names and dumping them to a list to then run that list with a loop through the geocoder, but I still have the same issue: all names convert correctly apart from Münster, which always gets coded as Munster, Ireland. I tried the following code:

for entry in places: 
    location = geocode(places)
    print(location)

In addition, I then tried geocoding the string again and all of sudden Münster turned to Munster as well. Interestlingly enough, I have now found out, that there seems to be difference between using " " for the string vs. ' ':

location = geocode("Münster")
print(location)
location2 = geocode('Münster')
print(location2)

returns:

Munster, Éire / Ireland
Münster, Nordrhein-Westfalen, Deutschland

Why is that? So I am now thinking that the problem is that when I use a variable (places), it somehow treats the code as if it were in double quotes while I need it to treat it as single quote. How can I change that? I suppose converting all the names to strings isn't really going to solve the problem (as they're already strings) and would be unneccessarily complicated...


Solution

  • So, after a lot of trial and error I finally discovered that there was a problem with the underlying encoding. I added the following bit of code and now the geocoding runs though smoothly and translates "Münster" to Münster in Germany.

    def normalize(text): 
        new = unicodedata.normalize('NFC', text)
        return new 
    
    df_geo['Ort'] = df_geo['Ort'].apply(normalize)
    
    places = df_geo['Ort'].tolist()
    
    for entry in places: 
        location = geocode(entry)
        print(location)