rsequencen-gramclickstream

Pairwise sequence list matching in R


I have a problem in doing sequence alignment/matching in R for lists. Let me explain better, my data are clickstream data and i have sequences divided in n-grams. The sequence looks something like

1. ABDCGHEI... NaNa
2. ACSNa.... NaNa

and so on where Na stays for "Not available", needed to match sequence lengths. Now i put all of these sequences in a list in a rude way like

dativec = as.vector(dataseq2)
for(i in 1:length(dativec)) {
  prova2[[i]] = dativec[i]
}
BigramTokenizer <- function(x) {
  NGramTokenizer(x, Weka_control(min = 2, max = 2))
}
prova3 = lapply(prova2, BigramTokenizer)

and divided them in n-grams, e. g. bigrams looks like this:

[[1]] "A B" "B D" "D C".... "Na Na"
[[2]] "A C" "C S" .... "Na Na"

Now the challenge is : how can i match every bigram of each element of my list, with each bigram of the other elements in the list? I tried to use the Biostrings package but the function pairwiseAlignment only gives back a score for the first bigram of each element in the list, while i just need to know if they're identical or not, and i need it all comparisons not just the first elements. The desired result is the percentage of equal sub-ngrams without the information about positions. I only care about identity. I also tried to use setdiff function but apparently it doesn't work in the way i want.

Edited for more clarity


Solution

  • You can use outer:

    bigrams <- list (a = c("A B", "B D", "D C", "Na Na"),
                     b = c("A C", "C S", "Na Na"))
    
    with(bigrams, outer(a, b, `==`))
    
    ##>       [,1]  [,2]  [,3]
    ##> [1,] FALSE FALSE FALSE
    ##> [2,] FALSE FALSE FALSE
    ##> [3,] FALSE FALSE FALSE
    ##> [4,] FALSE FALSE  TRUE