I'm looking at writing a mashup app that will take submission titles from a subreddit and attempt to plot them on a map based on where they are likely to be relevant. I'd also like to add on things like Twitter later on.
What I'm having difficulty planning is how to detect the most likely to be relevant country from the title. My first guess is to have a list of countries, along with their matching permutations (e.g. "English" matches "England", etc.) and check for occurrences of those items in the text. However this is probably going to be quite slow and will require me listing the possessive* name for each country.
I'm planning on doing this in Python (so as to learn to use it) so I'm wondering is there a) a library that does this (and that I can learn from it) or b) a more obvious way to do this?
To give an idea of the types of input I'm working with here are some samples and what I'm trying to get out of them:
* This is probably the wrong word to use
You can look into the Yahoo! Place Maker API
Placemaker provides geo-enrichment for the hugely significant proportion of Web content that is geographically relevant but not geographically discoverable. Provided with free-form text, the service identifies places mentioned in text, disambiguates those places, and returns unique identifiers (WOEIDs) for each, as well as information about how many times the place was found in the text, and where in the text it was found. The WOEIDs returned by the service can be passed to Yahoo!'s GeoPlanetâ„¢ API for further geographic enrichment and discovery.