rduplicatesmatchingsimilarityrecord-linkage

record matching/similarity calculation for numbers and characters


I have a dataframe structured the following way with much more rows and columns:

Report_ID Block_ID Number Character
1 1 5 A
2 1 3 A
3 1 2 B
4 2 10 A
5 2 11 B
6 2 100 C
7 3 2 D
8 3 #NA A
9 3 8 D
10 3 2 A

df structure for possible matches

and I would like to compare the records per Block-ID and calculate the similarity per each combination of Report-IDs.

A result df could look like this:

Report_ID_combination Similarity_Number Similarity_Character
1_2 0.60 1
1_3 0.40 0
4_5 0.91 0
4_6 0.10 0
7_8 NA 0
7_9 0.25 1
7_10 1.00 0

df structure for similarity calculation

Calculations for similarity:
Similarity_Number = lower value of number divided by higher value of number per report_ID_combination
Similarity_Character = if character in first report matches the character in second report, return 1, elso 0.
If one of the 2 reports is reported with NA for an attribute, the similarity for this attribute is also NA.

So, I'm not looking for any edit distance, Jaccard similarity or something like this, but a package in R where the similarity is calculated blockwise and where I can manually adjust the calculation of similarity (maybe change to absolute calculation for numbers or take into account NA values).
As far as I know RecordLinkage comes only with 0/1 similarity calculation for comparisons between numbers.

Thanks, Marius


Solution

  • Possible approach:

    (Assumes all possible combinations are required within the block, e.g. 2_3 in addition to 1_2 and 1_3.)

    library(tidyverse)
    
    df <- tribble(
      ~"Report_ID", ~"Block_IDm", ~"Number", ~"Character",
      1, 1, 5, "A",
      2, 1, 3, "A",
      3, 1, 2, "B",
      4, 2, 10, "A",
      5, 2, 11, "B",
      6, 2, 100, "C",
      7, 3, 2, "D",
      8, 3, NA, "A",
      9, 3, 8, "D",
      10, 3, 2, "A"
    )
    
    combos <- df |>
      split(df$Block_IDm) |>
      map(\(df) combn(df$Report_ID, 2) |>
        t() |>
        as_tibble()) |>
      bind_rows()
    
    map2(combos$V1, combos$V2, \(x, y){
      df |>
        filter(Report_ID == x | Report_ID == y) |>
        summarise(
          Report_ID_combination = str_c(Report_ID, collapse = "_"),
          Similarity_Number = min(Number) / max(Number),
          Similarity_Character = if_else(first(Character) == last(Character), 1, 0)
        )
    }) |>
      bind_rows()
    #> # A tibble: 12 × 3
    #>    Report_ID_combination Similarity_Number Similarity_Character
    #>    <chr>                             <dbl>                <dbl>
    #>  1 1_2                               0.6                      1
    #>  2 1_3                               0.4                      0
    #>  3 2_3                               0.667                    0
    #>  4 4_5                               0.909                    0
    #>  5 4_6                               0.1                      0
    #>  6 5_6                               0.11                     0
    #>  7 7_8                              NA                        0
    #>  8 7_9                               0.25                     1
    #>  9 7_10                              1                        0
    #> 10 8_9                              NA                        0
    #> 11 8_10                             NA                        1
    #> 12 9_10                              0.25                     0
    

    Created on 2024-03-06 with reprex v2.1.0