rdplyrdistinctdisambiguation

Is there a way to use dplyr distinct() to consider similar values as equal?


I have to do an analysis of scientific papers published in a list of over 20,000 journals. My list has over 450,000 records but with several duplicates (ex: a paper with more than one author from different institutions appear more than once).

Well, I need to count the distinct number of papers per journal, but the problem is that different authors not always provide the information in the same way, and I can get something like the following table:

JOURNAL          PAPER
0001-1231        A PRE-TEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS
0001-1231        A PRETEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS
0001-1231        THE P3 INFECTION TIME IS W[1]-HARD PARAMETERIZED BY THE TREEWIDTH
0001-1231        THE P3 INFECTION TIME IS W-HARD PARAMETERIZED BY THE TREEWIDTH
0001-1231        COMPOSITIONAL AND LOCAL LIVELOCK ANALYSIS FOR CSP
0001-1231        COMPOSITIONAL AND LOCAL LIVELOCK ANALYSIS FOR CSP
0001-1231        AIDING EXPLORATORY TESTING WITH PRUNED GUI MODELS
0001-1231        DECYCLING WITH A MATCHING
0001-1231        DECYCLING WITH A MATCHING
0001-1231        DECYCLING WITH A MATCHING
0001-1231        DECYCLING WITH A MATCHING.
0001-1231        DECYCLING WITH A MATCHING
0001-1231        ON THE HARDNESS OF FINDING THE GEODETIC NUMBER OF A SUBCUBIC GRAPH
0001-1231        ON THE HARDNESS OF FINDING THE GEODETIC NUMBER OF A SUBCUBIC GRAPH.
0001-1232        DECISION TREE CLASSIFICATION WITH BOUNDED NUMBER OF ERRORS
0001-1232        AN INCREMENTAL LINEAR-TIME LEARNING ALGORITHM FOR THE OPTIMUM-PATH
0001-1232        AN INCREMENTAL LINEAR-TIME LEARNING ALGORITHM FOR THE OPTIMUM-PATH 
0001-1232        COOPERATIVE CAPACITATED FACILITY LOCATION GAMES
0001-1232        OPTIMAL SUFFIX SORTING AND LCP ARRAY CONSTRUCTION FOR ALPHABETS
0001-1232        FAST MODULAR REDUCTION AND SQUARING IN GF (2 M )
0001-1232        FAST MODULAR REDUCTION AND SQUARING IN GF (2 M)
0001-1232        ON THE GEODETIC NUMBER OF COMPLEMENTARY PRISMS
0001-1232        DESIGNING MICROTISSUE BIOASSEMBLIES FOR SKELETAL REGENERATION
0001-1232        GOVERNANCE OF BRAZILIAN PUBLIC ENVIRONMENTAL FUNDS: ILLEGAL ALLOCATION
0001-1232        GOVERNANCE OF BRAZILIAN PUBLIC ENVIRONMENTAL FUNDS: ILLEGAL ALLOCATION
0001-1232        GOVERNANCE OF BRAZILIAN PUBLIC ENVIRONMENTAL FUNDS - ILLEGAL ALLOCATION

My goal is to use something like:

data%>%
distinct(JOURNAL, PAPER)%>%
group_by(JOURNAL)%>%
mutate(papers_in_journal = n())

So, I would have information like:

JOURNAL      papers_in_journal
0001-1231    6
0001-1232    7

The problem is that you can see some errors in the name of the papers published. Some have a "period" at the end; some have spaces or replace symbols; some have other minor variations such as W[1]-HARD versus W-HARD. So, if I run the code as is, what I have is:

JOURNAL      papers_in_journal
0001-1231    10
0001-1232    10

My question: is there any way to consider a similarity margin either in the use of distinct() or a similar command, so I can have something like distinct(JOURNAL, PAPER %whithin% 0.95)?

In this sense, I want the command to consider:

A PRE-TEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS
=
A PRETEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS

THE P3 INFECTION TIME IS W[1]-HARD PARAMETERIZED BY THE TREEWIDTH
=
THE P3 INFECTION TIME IS W-HARD PARAMETERIZED BY THE TREEWIDTH

DECYCLING WITH A MATCHING
=
DECYCLING WITH A MATCHING.

etc.

I imagine there is no such simple solution using distinct(), and I was not able to find any alternative commands to do that. So, if it is not possible and you can suggest any disambiguation algorithm I might use, I appreciate as well.

Thank you.


Solution

  • One option is to use agrep with lapply to find the indices of journal articles that are ≤10% dissimilar (the default for agrep, which you can change with the max.distance argument), then take the first article of each and vectorize it using sapply, get the unique indices, the length of the vector, and wrap a tapply around it all to select the number of "dissimilar" articles within each journal.

      tapply(data$PAPER, data$JOURNAL, FUN=function(x) {
          length(unique(sapply(lapply(x, function(y) agrep(y, x) ), "[", 1))
         } )
    
    # 0001-1231 0001-1232 
    #         6         8 
    

    For a dplyr version, which returns the results in a nicer format, I put the above code in a function, then used group_by() followed by summarise().

    dissimilar <- function(x, distance=0.1) {
      length(unique(sapply(lapply(x, function(y) 
         agrep(y, x, max.distance = distance) ), "[", 1)))
    }
    

    With "dissimilar" being defined according to the documentation of agrep.

    library(dplyr)
    
    data2 %>%
      group_by(JOURNAL) %>%
      summarise(n=dissimilar(PAPER))
    
    # A tibble: 2 x 2
      JOURNAL       n
      <chr>     <int>
    1 0001-1231     6
    2 0001-1232     8
    

    However, for a larger dataset, such as one containing thousands of journals and 450,000+ articles, the above will be rather slow (about 10-15 minutes on my 2.50GHz Intel). I realised that the dissimilar function was unnecessarily comparing every row with every other row, which makes little sense. Ideally, each row should only be compared with itself and all remaining rows. For example, the first journal contains 5 very similar articles in rows 8-12. One use of agrep at row #8 returns all 5 indices, and therefore there is no need to compare rows 9-12 with any others. So I replaced the lapply with a for loop, and the process now takes only 2-3 minutes with a dataset of 450,000 rows.

    dissimilar <- function(x, distance=0.1) {
      lst <- list()               # initialise the list
      k <- 1:length(x)            # k is the index of PAPERS to compare with
      for(i in k){                # i = each PAPER, k = itself and all remaining
        lst[[i]] <- agrep(x[i], x[k], max.distance = distance) + i - 1 
                                  # + i - 1 ensures that the original index in x is maintained
        k <- k[!k %in% lst[[i]]]  # remove elements which are similar
      }
      lst <- sapply(lst, "[", 1)  # take only the first of each item in the list
      length(na.omit(lst))        # count number of elements
    }
    

    Now expand the original example dataset so that there are 450,000 records containing around 18,000 journals, each of which contain around 25 articles.

    n <- 45000
    data2 <- do.call("rbind", replicate(round(n/26), data, simplify=FALSE))[1:n,]
    data2$JOURNAL[27:n] <- rep(paste0("0002-", seq(1, n/25)), each=25)[1:(n-26)]
    
    data2 %>%
      group_by(JOURNAL) %>%
      summarise(n=dissimilar(PAPER))
    
    # A tibble: 18,001 x 2
       JOURNAL        n
       <chr>      <int>
     1 0001-1231      6 # <-- Same
     2 0001-1232      8
     3 0002-1        14
     4 0002-10       14
     5 0002-100      14
     6 0002-1000     13
     7 0002-10000    14
     8 0002-10001    14
     9 0002-10002    14
    10 0002-10003    14
    
    # ... with 17,991 more rows
    

    The challenge is to find a way to speed up the process even more.