I have a large matrix of document similarity created with paragraph2vec_similarity
in doc2vec
package. I converted it to a data frame and added a TITLE column to the beginning to later sort or group it.
Current Dummy Output:
Title | Header | DocName_1900.txt_1 | DocName_1900.txt_2 | DocName_1900.txt_3 | DocName_1901.txt_1 | DocName_1901.txt_2 |
---|---|---|---|---|---|---|
Doc1 | DocName_1900.txt_1 | 1.000000 | 0.7369358 | 0.6418045 | 0.6268959 | 0.6823404 |
Doc1 | DocName_1900.txt_2 | 0.7369358 | 1.000000 | 0.6544884 | 0.7418507 | 0.5174367 |
Doc1 | DocName_1900.txt_3 | 0.6418045 | 0.6544884 | 1.000000 | 0.6180578 | 0.5274650 |
Doc2 | DocName_1901.txt_1 | 0.6268959 | 0.7418507 | 0.6180578 | 1.000000 | 0.5755243 |
Doc2 | DocName_1901.txt_2 | 0.6823404 | 0.5174367 | 0.5274650 | 0.5755243 | 1.000000 |
What I want is a data frame giving similarity in consecutive order for each following document. That is, the score for Doc1.1 and Doc1.2; and Doc1.2 and Doc1.3. Because I am only interested with similarity scores inside each individual document -- in diagonal order as shown in bold above.
Expected Output
Title | Similarity for 1-2 | Similarity for 2-3 | Similarity for 3-4 | |
---|---|---|---|---|
Doc1 | 0.7369358 | 0.6544884 | NA | |
Doc2 | 0.5755243 | NA | NA | NA |
Doc3 | 0.6049844 | 0.5250659 | 0.5113757 |
I was able to produce one giving the similarity scores of one doc with the remaining all docs with x<-data.frame(col=colnames(m)[col(m)], row=rownames(m)[row(last)], similarity=c(m))
. This is the closest I could get. Is there a better way? Because I am dealing with more than 500 titles with varying lengths. There is still the option of using diag
but it gets everything to the end of matrix and I loose document grouping.
If I understood your problem correctly one possible solution within the tidyverse
is to make the data long, remove the leading letters from Title and Header, split them on the dot and filter by comparing the results. Finally a new column is generated to serve as column names after this the data is made wide again:
library(tidyverse)
# set up / read in dummy data
df <- data.table::fread("Title Header Doc1.1 Doc1.2 Doc1.3 Doc2.1 Doc2.2
Doc1 Doc1.1 1.000000 0.7369358 0.6418045 0.6268959 0.6823404
Doc1 Doc1.2 0.7369358 1.000000 0.6544884 0.7418507 0.5174367
Doc1 Doc1.3 0.6418045 0.6544884 1.000000 0.6180578 0.5274650
Doc2 Doc2.1 0.6268959 0.7418507 0.6180578 1.000000 0.5755243
Doc2 Doc2.2 0.6823404 0.5174367 0.5274650 0.5755243 1.000000")
df %>%
tidyr::pivot_longer(-c(Title, Header)) %>%
dplyr::mutate(across(c(Header, name), ~ stringr::str_remove(.x, "^[a-zA-Z]+"))) %>%
tidyr::separate(Header, sep = "\\.", into = c("f1","f2")) %>%
tidyr::separate(name, sep = "\\.", into = c("s1","s2")) %>%
dplyr::filter(f1 == s1 & (as.numeric(f2) - as.numeric(s2)) == 1) %>%
dplyr::mutate(cols = paste("Similarity for", s2, "-", f2)) %>%
tidyr::pivot_wider(-c(f1, f2, s1, s2), names_from = "cols", values_from = value)
# A tibble: 2 x 3
Title `Similarity for 1 - 2` `Similarity for 2 - 3`
<chr> <dbl> <dbl>
1 Doc1 0.737 0.654
2 Doc2 0.576 NA
Edit due to new column names (more string manipulation needed):
library(tidyverse)
# set up / read in dummy data
df <- data.table::fread("Title Header DocName_1900.txt_1 DocName_1900.txt_2 DocName_1900.txt_3 DocName_1901.txt_1 DocName_1901.txt_2
Doc1 Doc1.1 1.000000 0.7369358 0.6418045 0.6268959 0.6823404
Doc1 Doc1.2 0.7369358 1.000000 0.6544884 0.7418507 0.5174367
Doc1 Doc1.3 0.6418045 0.6544884 1.000000 0.6180578 0.5274650
Doc2 Doc2.1 0.6268959 0.7418507 0.6180578 1.000000 0.5755243
Doc2 Doc2.2 0.6823404 0.5174367 0.5274650 0.5755243 1.000000")
df %>%
tidyr::pivot_longer(-c(Title, Header)) %>%
dplyr::mutate(across(c(Header, name), ~ stringr::str_remove(.x, "^[a-zA-Z]+_*"))) %>%
tidyr::separate(Header, sep = "\\.", into = c("f1","f2")) %>%
tidyr::separate(name, sep = "\\.txt_", into = c("s1","s2")) %>%
dplyr::mutate(s1 = as.numeric(s1)-1899) %>%
dplyr::filter(f1 == s1 & (as.numeric(f2) - as.numeric(s2)) == 1) %>%
dplyr::mutate(cols = paste("Similarity for", s2, "-", f2)) %>%
tidyr::pivot_wider(-c(f1, f2, s1, s2), names_from = "cols", values_from = value)
# A tibble: 2 x 3
Title `Similarity for 1 - 2` `Similarity for 2 - 3`
<chr> <dbl> <dbl>
1 Doc1 0.737 0.654
2 Doc2 0.576 NA