parsingmachine-learningcanonical-form

Canonicalize NFL team names


This is actually a machine learning classification problem but I imagine there's a perfectly good quick-and-dirty way to do it. I want to map a string describing an NFL team, like "San Francisco" or "49ers" or "San Francisco 49ers" or "SF forty-niners", to a canonical name for the team. (There are 32 NFL teams so it really just means finding the nearest of 32 bins to put a given string in.)

The incoming strings are not actually totally arbitrary (they're from structured data sources like this: http://www.repole.com/sun4cast/stats/nfl2008lines.csv) so it's not really necessary to handle every crazy corner case like in the 49ers example above.

I should also add that in case anyone knows of a source of data containing both moneyline Vegas odds as well as actual game outcomes for the past few years of NFL games, that would obviate the need for this. The reason I need the canonicalization is to match up these two disparate data sets, one with odds and one with outcomes:

Ideas for better, more parsable, sources of data are very welcome!

Added: The substring matching idea might well suffice for this data; thanks! Could it be made a little more robust by picking the team name with the nearest levenshtein distance?


Solution

  • Here's something plenty robust even for arbitrary user input, I think. First, map each team (I'm using a 3-letter code as the canonical name for each team) to a fully spelled out version with city and team name as well as any nicknames in parentheses between city and team name.

    Scan[(fullname[First@#] = #[[2]])&, {
      {"ari", "Arizona Cardinals"},                 {"atl", "Atlanta Falcons"}, 
      {"bal", "Baltimore Ravens"},                  {"buf", "Buffalo Bills"}, 
      {"car", "Carolina Panthers"},                 {"chi", "Chicago Bears"}, 
      {"cin", "Cincinnati Bengals"},                {"clv", "Cleveland Browns"}, 
      {"dal", "Dallas Cowboys"},                    {"den", "Denver Broncos"}, 
      {"det", "Detroit Lions"},                     {"gbp", "Green Bay Packers"}, 
      {"hou", "Houston Texans"},                    {"ind", "Indianapolis Colts"}, 
      {"jac", "Jacksonville Jaguars"},              {"kan", "Kansas City Chiefs"}, 
      {"mia", "Miami Dolphins"},                    {"min", "Minnesota Vikings"}, 
      {"nep", "New England Patriots"},              {"nos", "New Orleans Saints"}, 
      {"nyg", "New York Giants NYG"},               {"nyj", "New York Jets NYJ"}, 
      {"oak", "Oakland Raiders"},                   {"phl", "Philadelphia Eagles"}, 
      {"pit", "Pittsburgh Steelers"},               {"sdc", "San Diego Chargers"}, 
      {"sff", "San Francisco 49ers forty-niners"},  {"sea", "Seattle Seahawks"}, 
      {"stl", "St Louis Rams"},                     {"tam", "Tampa Bay Buccaneers"}, 
      {"ten", "Tennessee Titans"},                  {"wsh", "Washington Redskins"}}]
    

    Then, for any given string, find the longest common subsequence for each of the full names of the teams. To give preference to strings matching at the beginning or the end (eg, "car" should match "carolina panthers" rather than "arizona cardinals") sandwich both the input string and the full names between spaces. Whichever team's full name has the [sic:] longest longest-common-subsequence with the input string is the team we return. Here's a Mathematica implementation of the algorithm:

    teams = keys@fullnames;
    
    (* argMax[f, domain] returns the element of domain for which f of that element is
       maximal -- breaks ties in favor of first occurrence. *)
    SetAttributes[argMax, HoldFirst];
    argMax[f_, dom_List] := Fold[If[f[#1] >= f[#2], #1, #2] &, First@dom, Rest@dom]
    
    canonicalize[s_] := argMax[StringLength@LongestCommonSubsequence[" "<>s<>" ", 
                                     " "<>fullname@#<>" ", IgnoreCase->True]&, teams]