rgroupingsimilaritydoc2vecpairwise

pairwise similarity with consecutive points


I have a large matrix of document similarity created with paragraph2vec_similarity in doc2vec package. I converted it to a data frame and added a TITLE column to the beginning to later sort or group it.

Current Dummy Output:

Title Header DocName_1900.txt_1 DocName_1900.txt_2 DocName_1900.txt_3 DocName_1901.txt_1 DocName_1901.txt_2
Doc1 DocName_1900.txt_1 1.000000 0.7369358 0.6418045 0.6268959 0.6823404
Doc1 DocName_1900.txt_2 0.7369358 1.000000 0.6544884 0.7418507 0.5174367
Doc1 DocName_1900.txt_3 0.6418045 0.6544884 1.000000 0.6180578 0.5274650
Doc2 DocName_1901.txt_1 0.6268959 0.7418507 0.6180578 1.000000 0.5755243
Doc2 DocName_1901.txt_2 0.6823404 0.5174367 0.5274650 0.5755243 1.000000

What I want is a data frame giving similarity in consecutive order for each following document. That is, the score for Doc1.1 and Doc1.2; and Doc1.2 and Doc1.3. Because I am only interested with similarity scores inside each individual document -- in diagonal order as shown in bold above.

Expected Output

Title Similarity for 1-2 Similarity for 2-3 Similarity for 3-4
Doc1 0.7369358 0.6544884 NA
Doc2 0.5755243 NA NA NA
Doc3 0.6049844 0.5250659 0.5113757

I was able to produce one giving the similarity scores of one doc with the remaining all docs with x<-data.frame(col=colnames(m)[col(m)], row=rownames(m)[row(last)], similarity=c(m)). This is the closest I could get. Is there a better way? Because I am dealing with more than 500 titles with varying lengths. There is still the option of using diag but it gets everything to the end of matrix and I loose document grouping.


Solution

  • If I understood your problem correctly one possible solution within the tidyverse is to make the data long, remove the leading letters from Title and Header, split them on the dot and filter by comparing the results. Finally a new column is generated to serve as column names after this the data is made wide again:

    library(tidyverse)
    
    # set up / read in dummy data
    df <- data.table::fread("Title  Header  Doc1.1  Doc1.2  Doc1.3  Doc2.1  Doc2.2
    Doc1    Doc1.1  1.000000    0.7369358   0.6418045   0.6268959   0.6823404
    Doc1    Doc1.2  0.7369358   1.000000    0.6544884   0.7418507   0.5174367
    Doc1    Doc1.3  0.6418045   0.6544884   1.000000    0.6180578   0.5274650
    Doc2    Doc2.1  0.6268959   0.7418507   0.6180578   1.000000    0.5755243
    Doc2    Doc2.2  0.6823404   0.5174367   0.5274650   0.5755243   1.000000")
    
    df %>%
        tidyr::pivot_longer(-c(Title, Header)) %>% 
        dplyr::mutate(across(c(Header, name), ~ stringr::str_remove(.x, "^[a-zA-Z]+"))) %>%
        tidyr::separate(Header, sep = "\\.", into = c("f1","f2")) %>%
        tidyr::separate(name, sep = "\\.", into = c("s1","s2")) %>% 
        dplyr::filter(f1 == s1 & (as.numeric(f2) - as.numeric(s2)) == 1) %>% 
        dplyr::mutate(cols = paste("Similarity for", s2, "-", f2)) %>% 
        tidyr::pivot_wider(-c(f1, f2, s1, s2), names_from = "cols", values_from = value)
    
    
    # A tibble: 2 x 3
      Title `Similarity for 1 - 2` `Similarity for 2 - 3`
      <chr>                  <dbl>                  <dbl>
    1 Doc1                   0.737                  0.654
    2 Doc2                   0.576                 NA    
    

    Edit due to new column names (more string manipulation needed):

    library(tidyverse)
    
    # set up / read in dummy data
    df <- data.table::fread("Title  Header  DocName_1900.txt_1  DocName_1900.txt_2  DocName_1900.txt_3  DocName_1901.txt_1  DocName_1901.txt_2
    Doc1    Doc1.1  1.000000    0.7369358   0.6418045   0.6268959   0.6823404
    Doc1    Doc1.2  0.7369358   1.000000    0.6544884   0.7418507   0.5174367
    Doc1    Doc1.3  0.6418045   0.6544884   1.000000    0.6180578   0.5274650
    Doc2    Doc2.1  0.6268959   0.7418507   0.6180578   1.000000    0.5755243
    Doc2    Doc2.2  0.6823404   0.5174367   0.5274650   0.5755243   1.000000")
    
    df %>%
        tidyr::pivot_longer(-c(Title, Header)) %>% 
        dplyr::mutate(across(c(Header, name), ~ stringr::str_remove(.x, "^[a-zA-Z]+_*"))) %>%
        tidyr::separate(Header, sep = "\\.", into = c("f1","f2")) %>%
        tidyr::separate(name, sep = "\\.txt_", into = c("s1","s2")) %>% 
        dplyr::mutate(s1 = as.numeric(s1)-1899) %>%
        dplyr::filter(f1 == s1 & (as.numeric(f2) - as.numeric(s2)) == 1) %>% 
        dplyr::mutate(cols = paste("Similarity for", s2, "-", f2)) %>% 
        tidyr::pivot_wider(-c(f1, f2, s1, s2), names_from = "cols", values_from = value)
    
    # A tibble: 2 x 3
      Title `Similarity for 1 - 2` `Similarity for 2 - 3`
      <chr>                  <dbl>                  <dbl>
    1 Doc1                   0.737                  0.654
    2 Doc2                   0.576                 NA