I have a silly problem: I am trying to geocode several towns in Germany, the names of the towns are stored in a column in a dataframe. The coding works for all towns except for one: Münster.
My dataframe looks something like this (pasting the interesting part here):
68 1034291939 Lüneburg 2010 330 (Lüneburg, Niedersachsen, Deutschland, (53.248...
69 1003394345 Kassel 2010 330 (Kassel, Niestetal, Hessen, Deutschland, (51.3...
70 100975873X Magdeburg 2010 330.157 (Magdeburg, Sachsen-Anhalt, Deutschland, (52.1...
71 1191280594 Ludwigsburg 2010 340 (Ludwigsburg, Landkreis Ludwigsburg, Baden-Wür...
72 1010526499 Potsdam 2010 338.947 (Potsdam, Brandenburg, Deutschland, (52.400930...
73 1008154156 Duisburg 2010 658.72 (Duisburg, Nordrhein-Westfalen, Deutschland, (...
74 1011028336 Münster 2010 330 (Munster, Éire / Ireland, (52.307621600000004,...
75 1008507016 Jena 2010 338.04 (Jena, Thüringen, Deutschland, (50.9281717, 11...
As you can see, there are other town names in the dataframe containing special characters such as "ü" ("Lüneburg") that don't cause any problems, but Münster - for whatever reason - gets geocoded to Munster in Ireland. For the record, I use the following code to geocode the townname:
df_geo['geo_code'] = df_short.Ort.apply(geocode)
I have tried checking if this is a general problem with the database, but if I run
location = geocode('Münster')
it returns Münster, Nordrhein-Westfalen, Deutschland
just fine. I am at a total loss now as to why this doesn't work when I apply it to the dataframe. I thought it could be a problem with the ü, but the table I read the data from is encoded in utf-8 and displays the ü correctly when I open it (e.g. in Excel). Does anyone have an idea what the problem might be and how I can fix it? Do I need to encode or decode the data differntly before running the geocoding on the dataframe?
Edit: Edited the dataframe so that the different columns are easier to spot
Edit2: To fix this problem, I have now tried extracting the location names and dumping them to a list to then run that list with a loop through the geocoder, but I still have the same issue: all names convert correctly apart from Münster, which always gets coded as Munster, Ireland. I tried the following code:
for entry in places:
location = geocode(places)
print(location)
In addition, I then tried geocoding the string again and all of sudden Münster turned to Munster as well. Interestlingly enough, I have now found out, that there seems to be difference between using " " for the string vs. ' ':
location = geocode("Münster")
print(location)
location2 = geocode('Münster')
print(location2)
returns:
Munster, Éire / Ireland
Münster, Nordrhein-Westfalen, Deutschland
Why is that? So I am now thinking that the problem is that when I use a variable (places), it somehow treats the code as if it were in double quotes while I need it to treat it as single quote. How can I change that? I suppose converting all the names to strings isn't really going to solve the problem (as they're already strings) and would be unneccessarily complicated...
So, after a lot of trial and error I finally discovered that there was a problem with the underlying encoding. I added the following bit of code and now the geocoding runs though smoothly and translates "Münster" to Münster in Germany.
def normalize(text):
new = unicodedata.normalize('NFC', text)
return new
df_geo['Ort'] = df_geo['Ort'].apply(normalize)
places = df_geo['Ort'].tolist()
for entry in places:
location = geocode(entry)
print(location)