rfuzzy-comparisonagrepstringdist

Successively agrep names in a variable, then create a new variable with the shortest name for close matches


Assume a character vector of company names where the names come in various forms. Here is a small version of 10,000 row data frame; it shows the desired second vector ("two.names").

structure(list(firm = structure(1:8, .Label = c("Carlson Caspers", 
"Carlson Caspers Lindquist & Schuman P.A", "Carlson Caspers Vandenburgh  Lindquist & Schuman P.A.", 
"Carlson Caspers Vandenburgh & Lindquist", "Carmody Torrance", 
"Carmody Torrance et al", "Carmody Torrance Sandak", "Carmody Torrance Sandak & Hennessey LLP"
), class = "factor"), two.name = structure(c(1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L), .Label = c("Carlson Caspers", "Carmody Torrance"
), class = "factor")), .Names = c("firm", "two.name"), row.names = c(NA, 
-8L), class = "data.frame")


                                               firm         two.name
1                                       Carlson Caspers  Carlson Caspers
2               Carlson Caspers Lindquist & Schuman P.A  Carlson Caspers
3 Carlson Caspers Vandenburgh  Lindquist & Schuman P.A.  Carlson Caspers
4               Carlson Caspers Vandenburgh & Lindquist  Carlson Caspers
5                                      Carmody Torrance Carmody Torrance
6                                Carmody Torrance et al Carmody Torrance
7                               Carmody Torrance Sandak Carmody Torrance
8               Carmody Torrance Sandak & Hennessey LLP Carmody Torrance

Assume the vector has been sorted alphabetically by firm name (which I believe puts the shortest version first). How can I use agrep() to start with the first company name, match it to the second and -- assuming a close match -- add the first company name to the new column (short.name) for both of them. Then, match it to the third element, etc. All the Carlson variations would be matched.

If there is not a sufficient match, as when R encounters the first Carmody, start over with it and match to the next element, and so on until the next non-match.

If there is no match between consecutive companies, R should proceed until it finds a match.

The answer to this question uses fuzzy matching on the entire vector and groups by years. Create a unique ID by fuzzy matching of names (via agrep using R) It seems, however, to offer part of the code that would solve my problem. This question uses stringdist(). stringdist

EDIT:

Below, the object matches is a list that shows matches, but I don't know the code to tell R to "take the first one and convert the following matches, if any, to that name and put that name in the new variable column."

as.factor(df$firm)
matches <- lapply(levels(df$firm), agrep, x=levels(df$firm), fixed=TRUE, value=FALSE)

Solution

  • I went and wrote it out in a for-loop, first defining the first line as a short.name and then finding the matches, updating the dataframe and picking the next one to look for. That's what I meant by "do not try to solve this with a one-liner" - you have to make it work first in a much more verbose way, so you can understand what's going on. Then and ONLY if you NEED to, you can try to compress it into a oneliner.

    firm.txt <- as.character(df$firm)
    short.name <- firm.txt[1]
    for (i in 2:length(firm.txt)) {
      # i don't know how to write it any prettier
      match <- agrep(short.name, firm.txt)
      if (length(match) > 0) {
        df$two.name[match] <- short.name
        i <- max(match) + 1
        short.name <- firm.txt[i]
      }
    }