[SOLVED] Extracting a country name from a text string

Extracting a country name from a text string

I'm looking at writing a mashup app that will take submission titles from a subreddit and attempt to plot them on a map based on where they are likely to be relevant. I'd also like to add on things like Twitter later on.

What I'm having difficulty planning is how to detect the most likely to be relevant country from the title. My first guess is to have a list of countries, along with their matching permutations (e.g. "English" matches "England", etc.) and check for occurrences of those items in the text. However this is probably going to be quite slow and will require me listing the possessive* name for each country.

I'm planning on doing this in Python (so as to learn to use it) so I'm wondering is there a) a library that does this (and that I can learn from it) or b) a more obvious way to do this?

To give an idea of the types of input I'm working with here are some samples and what I'm trying to get out of them:

"Well they can't arrest all of us - Giving the middle finger to the British legal system (pic)"
- Keyword: British (Great Britain)
"Poll: Wikileaks Assange leading Time 'Person of the Year' - Assange, an Australian who has become a thorn in the side of the Pentagon with his releases of secret US military documents about the wars in Iraq and Afghanistan, had received 21,736 votes as of Friday."
- Keywords: Afghanistan, Iraq, [Australian] (Afghanistan, Iraq, [Australia]) - Australia would be difficult to catch out as mainly irrelevant but this is acceptable for my purposes
"Cyber attack on Nobel peace prize website launched. Stay classy, China."
- Keyword: China (China)
"A Jewish surgeon refuses to operate on a patient and walks out of the operating room after discovering a nazi tattoo on the patient's arm."
- Keywords: none - acceptable for my purposes

* This is probably the wrong word to use

Solution

You can look into the Yahoo! Place Maker API

Placemaker provides geo-enrichment for the hugely significant proportion of Web content that is geographically relevant but not geographically discoverable. Provided with free-form text, the service identifies places mentioned in text, disambiguates those places, and returns unique identifiers (WOEIDs) for each, as well as information about how many times the place was found in the text, and where in the text it was found. The WOEIDs returned by the service can be passed to Yahoo!'s GeoPlanet™ API for further geographic enrichment and discovery.