I have a dataframe structured the following way with much more rows and columns:
Report_ID | Block_ID | Number | Character |
---|---|---|---|
1 | 1 | 5 | A |
2 | 1 | 3 | A |
3 | 1 | 2 | B |
4 | 2 | 10 | A |
5 | 2 | 11 | B |
6 | 2 | 100 | C |
7 | 3 | 2 | D |
8 | 3 | #NA | A |
9 | 3 | 8 | D |
10 | 3 | 2 | A |
df structure for possible matches
and I would like to compare the records per Block-ID and calculate the similarity per each combination of Report-IDs.
A result df could look like this:
Report_ID_combination | Similarity_Number | Similarity_Character |
---|---|---|
1_2 | 0.60 | 1 |
1_3 | 0.40 | 0 |
4_5 | 0.91 | 0 |
4_6 | 0.10 | 0 |
7_8 | NA | 0 |
7_9 | 0.25 | 1 |
7_10 | 1.00 | 0 |
df structure for similarity calculation
Calculations for similarity:
Similarity_Number = lower value of number divided by higher value of number per report_ID_combination
Similarity_Character = if character in first report matches the character in second report, return 1, elso 0.
If one of the 2 reports is reported with NA for an attribute, the similarity for this attribute is also NA.
So, I'm not looking for any edit distance, Jaccard similarity or something like this, but a package in R where the similarity is calculated blockwise and where I can manually adjust the calculation of similarity (maybe change to absolute calculation for numbers or take into account NA values).
As far as I know
RecordLinkage
comes only with 0/1 similarity calculation for comparisons between numbers.
Thanks, Marius
Possible approach:
(Assumes all possible combinations are required within the block, e.g. 2_3
in addition to 1_2
and 1_3
.)
library(tidyverse)
df <- tribble(
~"Report_ID", ~"Block_IDm", ~"Number", ~"Character",
1, 1, 5, "A",
2, 1, 3, "A",
3, 1, 2, "B",
4, 2, 10, "A",
5, 2, 11, "B",
6, 2, 100, "C",
7, 3, 2, "D",
8, 3, NA, "A",
9, 3, 8, "D",
10, 3, 2, "A"
)
combos <- df |>
split(df$Block_IDm) |>
map(\(df) combn(df$Report_ID, 2) |>
t() |>
as_tibble()) |>
bind_rows()
map2(combos$V1, combos$V2, \(x, y){
df |>
filter(Report_ID == x | Report_ID == y) |>
summarise(
Report_ID_combination = str_c(Report_ID, collapse = "_"),
Similarity_Number = min(Number) / max(Number),
Similarity_Character = if_else(first(Character) == last(Character), 1, 0)
)
}) |>
bind_rows()
#> # A tibble: 12 × 3
#> Report_ID_combination Similarity_Number Similarity_Character
#> <chr> <dbl> <dbl>
#> 1 1_2 0.6 1
#> 2 1_3 0.4 0
#> 3 2_3 0.667 0
#> 4 4_5 0.909 0
#> 5 4_6 0.1 0
#> 6 5_6 0.11 0
#> 7 7_8 NA 0
#> 8 7_9 0.25 1
#> 9 7_10 1 0
#> 10 8_9 NA 0
#> 11 8_10 NA 1
#> 12 9_10 0.25 0
Created on 2024-03-06 with reprex v2.1.0