rdata.tablematchingsapplyagrep

Fuzzy matching strings within a single column and documenting possible matches


I have a relatively large dataset of ~ 5k rows containing titles of journal/research papers. Here is a small sample of the dataset:

dt = structure(list(Title = c("Community reinforcement approach in the treatment of opiate addicts", 
"Therapeutic justice: Life inside drug court", "Therapeutic justice: Life inside drug court", 
"Tuberculosis screening in a novel substance abuse treatment center in Malaysia: Implications for a comprehensive approach for integrated care", 
"An ecosystem for improving the quality of personal health records", 
"Patterns of attachment and alcohol abuse in sexual and violent non-sexual offenders", 
"A Model for the Assessment of Static and Dynamic Factors in Sexual Offenders", 
"A model for the assessment of static and dynamic factors in sexual offenders", 
"The problem of co-occurring disorders among jail detainees: Antisocial disorder, alcoholism, drug abuse, and depression", 
"Co-occurring disorders among mentally ill jail detainees. Implications for public policy", 
"Comorbidity and Continuity of Psychiatric Disorders in Youth After Detention: A Prospective Longitudinal Study", 
"Behavioral Health and Adult Milestones in Young Adults With Perinatal HIV Infection or Exposure", 
"Behavioral health and adult milestones in young adults with perinatal HIV infection or exposure", 
"Revising the paradigm for jail diversion for people with mental and substance use disorders: Intercept 0", 
"Diagnosis of active and latent tuberculosis: summary of NICE guidance", 
"Towards tackling tuberculosis in vulnerable groups in the European Union: the E-DETECT TB consortium"
)), row.names = c(NA, -16L), class = c("tbl_df", "tbl", "data.frame"
))

You can see that there are some duplicates of titles in there, but with formatting/case differences. I want to identify titles that are duplicated and create a new variable that documents which rows are possibly matching. To do this, I have attempted to use the agrep function as suggested here :

dt$is.match <- sapply(dt$Title,agrep,dt$Title)

This identifies matches, but saves the results as a list in the new variable column. Is there a way to do this (preferably using base r or data.table) where the results of agrep are not saved as a list, but only identifying which rows are matches (e.g., 6:7)?

Thanks in advance - hope I have provided enough information.


Solution

  • Do you need something like this?

    dt$is.match <- sapply(dt$Title,function(x) toString(agrep(x, dt$Title)), USE.NAMES = FALSE)
    
    dt
    # A tibble: 16 x 2
    #   Title                                                                                                    is.match
    #   <chr>                                                                                                    <chr>   
    # 1 Community reinforcement approach in the treatment of opiate addicts                                      1       
    # 2 Therapeutic justice: Life inside drug court                                                              2, 3    
    # 3 Therapeutic justice: Life inside drug court                                                              2, 3    
    # 4 Tuberculosis screening in a novel substance abuse treatment center in Malaysia: Implications for a comp… 4       
    # 5 An ecosystem for improving the quality of personal health records                                        5       
    # 6 Patterns of attachment and alcohol abuse in sexual and violent non-sexual offenders                      6       
    # 7 A Model for the Assessment of Static and Dynamic Factors in Sexual Offenders                             7, 8    
    # 8 A model for the assessment of static and dynamic factors in sexual offenders                             7, 8    
    # 9 The problem of co-occurring disorders among jail detainees: Antisocial disorder, alcoholism, drug abuse… 9       
    #10 Co-occurring disorders among mentally ill jail detainees. Implications for public policy                 10      
    #11 Comorbidity and Continuity of Psychiatric Disorders in Youth After Detention: A Prospective Longitudina… 11      
    #12 Behavioral Health and Adult Milestones in Young Adults With Perinatal HIV Infection or Exposure          12, 13  
    #13 Behavioral health and adult milestones in young adults with perinatal HIV infection or exposure          12, 13  
    #14 Revising the paradigm for jail diversion for people with mental and substance use disorders: Intercept 0 14      
    #15 Diagnosis of active and latent tuberculosis: summary of NICE guidance                                    15      
    #16 Towards tackling tuberculosis in vulnerable groups in the European Union: the E-DETECT TB consortium     16