pairwise similarity with consecutive points

I have a large matrix of document similarity created with paragraph2vec_similarity in doc2vec package. I converted it to a data frame and added a TITLE column to the beginning to later sort or group it.

Current Dummy Output:

Title	Header	DocName_1900.txt_1	DocName_1900.txt_2	DocName_1900.txt_3	DocName_1901.txt_1	DocName_1901.txt_2
Doc1	DocName_1900.txt_1	1.000000	0.7369358	0.6418045	0.6268959	0.6823404
Doc1	DocName_1900.txt_2	0.7369358	1.000000	0.6544884	0.7418507	0.5174367
Doc1	DocName_1900.txt_3	0.6418045	0.6544884	1.000000	0.6180578	0.5274650
Doc2	DocName_1901.txt_1	0.6268959	0.7418507	0.6180578	1.000000	0.5755243
Doc2	DocName_1901.txt_2	0.6823404	0.5174367	0.5274650	0.5755243	1.000000

What I want is a data frame giving similarity in consecutive order for each following document. That is, the score for Doc1.1 and Doc1.2; and Doc1.2 and Doc1.3. Because I am only interested with similarity scores inside each individual document -- in diagonal order as shown in bold above.

Expected Output

Title	Similarity for 1-2	Similarity for 2-3	Similarity for 3-4
Doc1	0.7369358	0.6544884	NA
Doc2	0.5755243	NA	NA	NA
Doc3	0.6049844	0.5250659	0.5113757

I was able to produce one giving the similarity scores of one doc with the remaining all docs with x<-data.frame(col=colnames(m)[col(m)], row=rownames(m)[row(last)], similarity=c(m)). This is the closest I could get. Is there a better way? Because I am dealing with more than 500 titles with varying lengths. There is still the option of using diag but it gets everything to the end of matrix and I loose document grouping.

Solution

If I understood your problem correctly one possible solution within the tidyverse is to make the data long, remove the leading letters from Title and Header, split them on the dot and filter by comparing the results. Finally a new column is generated to serve as column names after this the data is made wide again:

library(tidyverse)

# set up / read in dummy data
df <- data.table::fread("Title  Header  Doc1.1  Doc1.2  Doc1.3  Doc2.1  Doc2.2
Doc1    Doc1.1  1.000000    0.7369358   0.6418045   0.6268959   0.6823404
Doc1    Doc1.2  0.7369358   1.000000    0.6544884   0.7418507   0.5174367
Doc1    Doc1.3  0.6418045   0.6544884   1.000000    0.6180578   0.5274650
Doc2    Doc2.1  0.6268959   0.7418507   0.6180578   1.000000    0.5755243
Doc2    Doc2.2  0.6823404   0.5174367   0.5274650   0.5755243   1.000000")

df %>%
    tidyr::pivot_longer(-c(Title, Header)) %>% 
    dplyr::mutate(across(c(Header, name), ~ stringr::str_remove(.x, "^[a-zA-Z]+"))) %>%
    tidyr::separate(Header, sep = "\\.", into = c("f1","f2")) %>%
    tidyr::separate(name, sep = "\\.", into = c("s1","s2")) %>% 
    dplyr::filter(f1 == s1 & (as.numeric(f2) - as.numeric(s2)) == 1) %>% 
    dplyr::mutate(cols = paste("Similarity for", s2, "-", f2)) %>% 
    tidyr::pivot_wider(-c(f1, f2, s1, s2), names_from = "cols", values_from = value)


# A tibble: 2 x 3
  Title `Similarity for 1 - 2` `Similarity for 2 - 3`
  <chr>                  <dbl>                  <dbl>
1 Doc1                   0.737                  0.654
2 Doc2                   0.576                 NA

Edit due to new column names (more string manipulation needed):

library(tidyverse)

# set up / read in dummy data
df <- data.table::fread("Title  Header  DocName_1900.txt_1  DocName_1900.txt_2  DocName_1900.txt_3  DocName_1901.txt_1  DocName_1901.txt_2
Doc1    Doc1.1  1.000000    0.7369358   0.6418045   0.6268959   0.6823404
Doc1    Doc1.2  0.7369358   1.000000    0.6544884   0.7418507   0.5174367
Doc1    Doc1.3  0.6418045   0.6544884   1.000000    0.6180578   0.5274650
Doc2    Doc2.1  0.6268959   0.7418507   0.6180578   1.000000    0.5755243
Doc2    Doc2.2  0.6823404   0.5174367   0.5274650   0.5755243   1.000000")

df %>%
    tidyr::pivot_longer(-c(Title, Header)) %>% 
    dplyr::mutate(across(c(Header, name), ~ stringr::str_remove(.x, "^[a-zA-Z]+_*"))) %>%
    tidyr::separate(Header, sep = "\\.", into = c("f1","f2")) %>%
    tidyr::separate(name, sep = "\\.txt_", into = c("s1","s2")) %>% 
    dplyr::mutate(s1 = as.numeric(s1)-1899) %>%
    dplyr::filter(f1 == s1 & (as.numeric(f2) - as.numeric(s2)) == 1) %>% 
    dplyr::mutate(cols = paste("Similarity for", s2, "-", f2)) %>% 
    tidyr::pivot_wider(-c(f1, f2, s1, s2), names_from = "cols", values_from = value)

# A tibble: 2 x 3
  Title `Similarity for 1 - 2` `Similarity for 2 - 3`
  <chr>                  <dbl>                  <dbl>
1 Doc1                   0.737                  0.654
2 Doc2                   0.576                 NA