I am trying to calculate cosine similarity scores between two groups of texts using stringsim
from the stringdist
package in R. The texts are stemmed tokens stored in two separate character vectors. Ultimately, I am trying to get a similarity score for each item in data1 compared to each item in data2 and then average these to get one score for each item in data1.
I have been able to calculate the score for each comparison individually using stringsim
and a subset of each data frame using outer
. But there are lots of rows in each data frame, so doing them individually isn't feasible. I am trying to create a for
loop to iterate it, but can't seem to get the results I am looking for. I am very new to for
loops, so I'm sure I'm missing something there but I can't figure out what it is.
Here is a brief subset of the first five items in each data frame to show what my data look like:
data1 <- c("California State Univers stand beacon excel divers peopl pedagogi place singular determin provid student access opportun lead transform self societi",
"encourag student alumni passion empathet forev curious ask consequenti action look embodi Californian spirit",
"Scope Mission California State Univers promot student success opportun high-qual educ prepar student becom leader chang workforc make CSU vital econom engin California",
"Educat ethnic econom academ divers student bodi nation", "renown qualiti teach prepar job-readi graduat")
data2 <- c("Exist law Sherman Food Drug Cosmet Law contain various provis regard packag label advertis food drug cosmet",
"bill appropri unspecifi amount Gener Fund Western Institut Food Safeti Secur within Univers California Davi fund research increas knowledg scientif understand caus detect foodborn diseas",
"SECTION 1 Legislatur herebi find declar follow part healthi nutrit lifestyl Californian encourag increas consumpt fresh fruit veget b California farmer produc highest qualiti food world Howev despit regul mandat safe product process practic foodborn ill occur c Californian eat minim process food Research find resist pathogen bacteria well new pathogen bacteria emerg food suppli d Food safeti team effort everyon � s respons farm processor retail consum",
"SEC 2 sum ____ $ ____ herebi appropri Gener Fund Western Institut Food Safeti Secur within Univers California Davi fund reserch increas knowledg scientif understand caus detect foodborn diseas",
"act add repeal Section 8157 Educat Code relat apprenticeship make appropri therefor")
These are the codes I have gotten to work:
require("stringdist")
stringsim(data1[1], data2[1], method = "cosine")
outer(data1, data2, stringsim, method = "cosine")
These are the things I've tried that seem to get close, but haven't quite produced what I'm looking for:
for(i in data1) {
for(j in data2) {
stringsim(data1[i], data2[j], method = "cosine")
}
}
# returns the last item in each data frame for i and j
scores = list()
for(i in data2) {
scores[[i]] <- stringsim(data1[1], data2[i], method = "cosine")
}
scores
# returns NA for each observation instead of a similarity score
for (i in 1:length(data1)) {
stringsim(data1[i], data2[1], method = "cosine")
}
# returns the number of observations in the sample, not similarity scores
I have also tried to do something similar using the textSimilarity
function in the text
package, but have run into issues with the textEmbed
function, so I haven't gotten it to calculate any similarity scores (and thus haven't tried to loop it). I'm including the code I've tried in the text
package in case that is easier or more effective to make work for this purpose.
require("text")
data1.sub <- textEmbed(data1)
data2.sub <- textEmbed(data2)
test <- textSimilarity(data1.sub, data2.sub, method = "cosine")
I've been stuck on this for a while, so any help to fix this would be greatly appreciated!
Generally, for loops are provided with a vector and then loop through all elements in this vector. Therefore, the problem is that by writing for(i in data1){...}
you are already looping through the elements of data1
.
When you call data1[i]
, R looks for an elemnt in data1
that has the name given by the ith element of data1
, in your case returning NA
.
You can run the following code:
# create storage matrix
stringsim.mat <- matrix(NA, nrow=length(data1), ncol=length(data2))
for(i in 1:length(data1)){ #loop from first to last element in data1
for(j in 1:length(data2)){ #loop from first to last element in data2
stringsim.mat[i,j] <- stringsim(data1[i], data2[j], method = "cosine")
}
}
stringsim.mat
It creates a string similarity matrix with a row for each element in data1
and a column for each element in data2
. Then it loops through all elements in both vectors and stores the similarity measure for elements in the respective cell.
Note, that a more elegant (and likely faster) solution would be to run:
stringdist::stringsimmatrix(data1, data2, method = "cosine")
Both return the same output.