rstringtext-miningtext-analysis

Applying "String matching to estimate similarity" to data frame


String matching to estimate similarity

The above code is exactly what I am looking for, except I cannot seem to figure out how to compare the strings between columns (the "correct" answer and "given" answer) in a data frame and then storing the output from sim.per as a new column ("similarity") in that same data frame. I have tried .e.g,

df$similarity <- sim.per(df$answer, df$given) 

df$similarity <- mapply(sim.per, df$answer, df$given)

The latter also results in an error when the row is empty, which is acceptable in my dataset and should be calculated as 0 instead.

Error in str2[[1]] : subscript out of bounds

Expected output should be:

    answer                   given                              similarity
1   Best way to waste money  Instrument to waste money and time 0.6
2   Roy travels to Africa    He is in Africa                    0.25
3   I go to work                                                0

Any help would be appreciated! Thanks!

Subset of the data:

df <- structure(list(trial = 1:10, answer = structure(c(9L, 2L, 4L, 7L, 1L, 5L, 3L, 6L, 8L, 10L), .Label = c("Best way to waste money", "He ran out of money, so he had to stop playing poker", "I go to work", "Lets all be unique together until we realise we are all the same", "Roy travels to Africa", "She borrowed the book from him many years ago and did not returned it yet", "She did her best to help him", "Students did not cheat on the test, for it was not the right thing to do", "The stranger officiates the meal", "We have a lot of rain in June"), class = "factor"), given = structure(c(10L, 3L, 6L, 8L, 4L, 2L, 1L, 7L, 9L, 5L), .Label = c("", "He is in Africa Roy", "He lost money because he had played poker", "Instrument to waste money and time", "It was raining in June", "People are unique until they try to fit in", "She borrowed the book from the library and forgot to return it", "She did her very best to help him out", "Students know not to cheat", "The guests ate the meal"), class = "factor")), class = "data.frame", row.names = c(NA, -10L))

Solution

  • Here's an example using tidyverse syntax to avoid manual loops and make things a bit more concise and probably faster. In particular, the format step is vectorised so only the score calculation requires iteration.

    library(tidyverse)
    
    df <- structure(list(trial = 1:10, answer = structure(c(9L, 2L, 4L, 7L, 1L, 5L, 3L, 6L, 8L, 10L), .Label = c("Best way to waste money", "He ran out of money, so he had to stop playing poker", "I go to work", "Lets all be unique together until we realise we are all the same", "Roy travels to Africa", "She borrowed the book from him many years ago and did not returned it yet", "She did her best to help him", "Students did not cheat on the test, for it was not the right thing to do", "The stranger officiates the meal", "We have a lot of rain in June"), class = "factor"), given = structure(c(10L, 3L, 6L, 8L, 4L, 2L, 1L, 7L, 9L, 5L), .Label = c("", "He is in Africa Roy", "He lost money because he had played poker", "Instrument to waste money and time", "It was raining in June", "People are unique until they try to fit in", "She borrowed the book from the library and forgot to return it", "She did her very best to help him out", "Students know not to cheat", "The guests ate the meal"), class = "factor")), class = "data.frame", row.names = c(NA, -10L))
    
    format_str <- function(string) {
      string %>%
        str_to_lower %>%
        str_remove_all("[:punct:]") %>%
        str_squish %>%
        str_split(" ")
    }
    
    df %>%
      mutate(
        similarity = map2_dbl(
          .x = format_str(answer),
          .y = format_str(given),
          .f = ~ length(intersect(.x, .y)) / length(.x)
        )
      ) %>%
      as_tibble
    #> # A tibble: 10 x 4
    #>    trial answer                        given                    similarity
    #>    <int> <fct>                         <fct>                         <dbl>
    #>  1     1 The stranger officiates the ~ The guests ate the meal       0.4  
    #>  2     2 He ran out of money, so he h~ He lost money because h~      0.333
    #>  3     3 Lets all be unique together ~ People are unique until~      0.231
    #>  4     4 She did her best to help him  She did her very best t~      1    
    #>  5     5 Best way to waste money       Instrument to waste mon~      0.6  
    #>  6     6 Roy travels to Africa         He is in Africa Roy           0.5  
    #>  7     7 I go to work                  ""                            0    
    #>  8     8 She borrowed the book from h~ She borrowed the book f~      0.467
    #>  9     9 Students did not cheat on th~ Students know not to ch~      0.25 
    #> 10    10 We have a lot of rain in June It was raining in June        0.25
    

    Created on 2018-08-17 by the reprex package (v0.2.0).