rfor-loopcosine-similaritystringdist

R: For Loop for Average Cosine Similarity Score


I am trying to calculate cosine similarity scores between two groups of texts using stringsim from the stringdist package in R. The texts are stemmed tokens stored in two separate character vectors. Ultimately, I am trying to get a similarity score for each item in data1 compared to each item in data2 and then average these to get one score for each item in data1.

I have been able to calculate the score for each comparison individually using stringsim and a subset of each data frame using outer. But there are lots of rows in each data frame, so doing them individually isn't feasible. I am trying to create a for loop to iterate it, but can't seem to get the results I am looking for. I am very new to for loops, so I'm sure I'm missing something there but I can't figure out what it is.

Here is a brief subset of the first five items in each data frame to show what my data look like:

data1 <- c("California State Univers stand beacon excel divers peopl pedagogi place singular determin provid student access opportun lead transform self societi", 
"encourag student alumni passion empathet forev curious ask consequenti action look embodi Californian spirit", 
"Scope Mission California State Univers promot student success opportun high-qual educ prepar student becom leader chang workforc make CSU vital econom engin California", 
"Educat ethnic econom academ divers student bodi nation", "renown qualiti teach prepar job-readi graduat")

data2 <- c("Exist law Sherman Food Drug Cosmet Law contain various provis regard packag label advertis food drug cosmet", 
"bill appropri unspecifi amount Gener Fund Western Institut Food Safeti Secur within Univers California Davi fund research increas knowledg scientif understand caus detect foodborn diseas", 
"SECTION 1 Legislatur herebi find declar follow part healthi nutrit lifestyl Californian encourag increas consumpt fresh fruit veget b California farmer produc highest qualiti food world Howev despit regul mandat safe product process practic foodborn ill occur c Californian eat minim process food Research find resist pathogen bacteria well new pathogen bacteria emerg food suppli d Food safeti team effort everyon � s respons farm processor retail consum", 
"SEC 2 sum ____ $ ____ herebi appropri Gener Fund Western Institut Food Safeti Secur within Univers California Davi fund reserch increas knowledg scientif understand caus detect foodborn diseas", 
"act add repeal Section 8157 Educat Code relat apprenticeship make appropri therefor")

These are the codes I have gotten to work:

require("stringdist")
stringsim(data1[1], data2[1], method = "cosine")

outer(data1, data2, stringsim, method = "cosine")

These are the things I've tried that seem to get close, but haven't quite produced what I'm looking for:

for(i in data1) {
  for(j in data2) {
    stringsim(data1[i], data2[j], method = "cosine")
  }
}
# returns the last item in each data frame for i and j


scores = list()
for(i in data2) {
  scores[[i]] <- stringsim(data1[1], data2[i], method = "cosine")
}
scores
# returns NA for each observation instead of a similarity score


for (i in 1:length(data1)) {
  stringsim(data1[i], data2[1], method = "cosine")
}
# returns the number of observations in the sample, not similarity scores

I have also tried to do something similar using the textSimilarity function in the text package, but have run into issues with the textEmbed function, so I haven't gotten it to calculate any similarity scores (and thus haven't tried to loop it). I'm including the code I've tried in the text package in case that is easier or more effective to make work for this purpose.

require("text")
data1.sub <- textEmbed(data1)
data2.sub <- textEmbed(data2)
test <- textSimilarity(data1.sub, data2.sub, method = "cosine")

I've been stuck on this for a while, so any help to fix this would be greatly appreciated!


Solution

  • Generally, for loops are provided with a vector and then loop through all elements in this vector. Therefore, the problem is that by writing for(i in data1){...} you are already looping through the elements of data1. When you call data1[i], R looks for an elemnt in data1 that has the name given by the ith element of data1, in your case returning NA.

    You can run the following code:

    # create storage matrix
    stringsim.mat <- matrix(NA, nrow=length(data1), ncol=length(data2))
    
    for(i in 1:length(data1)){ #loop from first to last element in data1
      for(j in 1:length(data2)){ #loop from first to last element in data2
        stringsim.mat[i,j] <- stringsim(data1[i], data2[j], method = "cosine")
      }
    }
    stringsim.mat
    

    It creates a string similarity matrix with a row for each element in data1 and a column for each element in data2. Then it loops through all elements in both vectors and stores the similarity measure for elements in the respective cell.

    Note, that a more elegant (and likely faster) solution would be to run:

    stringdist::stringsimmatrix(data1, data2, method = "cosine")
    

    Both return the same output.