I have to do an analysis of scientific papers published in a list of over 20,000 journals. My list has over 450,000 records but with several duplicates (ex: a paper with more than one author from different institutions appear more than once).
Well, I need to count the distinct number of papers per journal, but the problem is that different authors not always provide the information in the same way, and I can get something like the following table:
JOURNAL PAPER
0001-1231 A PRE-TEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS
0001-1231 A PRETEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS
0001-1231 THE P3 INFECTION TIME IS W[1]-HARD PARAMETERIZED BY THE TREEWIDTH
0001-1231 THE P3 INFECTION TIME IS W-HARD PARAMETERIZED BY THE TREEWIDTH
0001-1231 COMPOSITIONAL AND LOCAL LIVELOCK ANALYSIS FOR CSP
0001-1231 COMPOSITIONAL AND LOCAL LIVELOCK ANALYSIS FOR CSP
0001-1231 AIDING EXPLORATORY TESTING WITH PRUNED GUI MODELS
0001-1231 DECYCLING WITH A MATCHING
0001-1231 DECYCLING WITH A MATCHING
0001-1231 DECYCLING WITH A MATCHING
0001-1231 DECYCLING WITH A MATCHING.
0001-1231 DECYCLING WITH A MATCHING
0001-1231 ON THE HARDNESS OF FINDING THE GEODETIC NUMBER OF A SUBCUBIC GRAPH
0001-1231 ON THE HARDNESS OF FINDING THE GEODETIC NUMBER OF A SUBCUBIC GRAPH.
0001-1232 DECISION TREE CLASSIFICATION WITH BOUNDED NUMBER OF ERRORS
0001-1232 AN INCREMENTAL LINEAR-TIME LEARNING ALGORITHM FOR THE OPTIMUM-PATH
0001-1232 AN INCREMENTAL LINEAR-TIME LEARNING ALGORITHM FOR THE OPTIMUM-PATH
0001-1232 COOPERATIVE CAPACITATED FACILITY LOCATION GAMES
0001-1232 OPTIMAL SUFFIX SORTING AND LCP ARRAY CONSTRUCTION FOR ALPHABETS
0001-1232 FAST MODULAR REDUCTION AND SQUARING IN GF (2 M )
0001-1232 FAST MODULAR REDUCTION AND SQUARING IN GF (2 M)
0001-1232 ON THE GEODETIC NUMBER OF COMPLEMENTARY PRISMS
0001-1232 DESIGNING MICROTISSUE BIOASSEMBLIES FOR SKELETAL REGENERATION
0001-1232 GOVERNANCE OF BRAZILIAN PUBLIC ENVIRONMENTAL FUNDS: ILLEGAL ALLOCATION
0001-1232 GOVERNANCE OF BRAZILIAN PUBLIC ENVIRONMENTAL FUNDS: ILLEGAL ALLOCATION
0001-1232 GOVERNANCE OF BRAZILIAN PUBLIC ENVIRONMENTAL FUNDS - ILLEGAL ALLOCATION
My goal is to use something like:
data%>%
distinct(JOURNAL, PAPER)%>%
group_by(JOURNAL)%>%
mutate(papers_in_journal = n())
So, I would have information like:
JOURNAL papers_in_journal
0001-1231 6
0001-1232 7
The problem is that you can see some errors in the name of the papers published. Some have a "period" at the end; some have spaces or replace symbols; some have other minor variations such as W[1]-HARD versus W-HARD. So, if I run the code as is, what I have is:
JOURNAL papers_in_journal
0001-1231 10
0001-1232 10
My question: is there any way to consider a similarity margin either in the use of distinct() or a similar command, so I can have something like distinct(JOURNAL, PAPER %whithin% 0.95)?
In this sense, I want the command to consider:
A PRE-TEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS
=
A PRETEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS
THE P3 INFECTION TIME IS W[1]-HARD PARAMETERIZED BY THE TREEWIDTH
=
THE P3 INFECTION TIME IS W-HARD PARAMETERIZED BY THE TREEWIDTH
DECYCLING WITH A MATCHING
=
DECYCLING WITH A MATCHING.
etc.
I imagine there is no such simple solution using distinct(), and I was not able to find any alternative commands to do that. So, if it is not possible and you can suggest any disambiguation algorithm I might use, I appreciate as well.
Thank you.
One option is to use agrep
with lapply
to find the indices of journal articles that are ≤10% dissimilar (the default for agrep
, which you can change with the max.distance
argument), then take the first article of each and vectorize it using sapply
, get the unique
indices, the length of the vector, and wrap a tapply
around it all to select the number of "dissimilar" articles within each journal.
tapply(data$PAPER, data$JOURNAL, FUN=function(x) {
length(unique(sapply(lapply(x, function(y) agrep(y, x) ), "[", 1))
} )
# 0001-1231 0001-1232
# 6 8
For a dplyr version, which returns the results in a nicer format, I put the above code in a function, then used group_by()
followed by summarise()
.
dissimilar <- function(x, distance=0.1) {
length(unique(sapply(lapply(x, function(y)
agrep(y, x, max.distance = distance) ), "[", 1)))
}
With "dissimilar" being defined according to the documentation of agrep
.
library(dplyr)
data2 %>%
group_by(JOURNAL) %>%
summarise(n=dissimilar(PAPER))
# A tibble: 2 x 2
JOURNAL n
<chr> <int>
1 0001-1231 6
2 0001-1232 8
However, for a larger dataset, such as one containing thousands of journals and 450,000+ articles, the above will be rather slow (about 10-15 minutes on my 2.50GHz Intel). I realised that the dissimilar
function was unnecessarily comparing every row with every other row, which makes little sense. Ideally, each row should only be compared with itself and all remaining rows. For example, the first journal contains 5 very similar articles in rows 8-12. One use of agrep
at row #8 returns all 5 indices, and therefore there is no need to compare rows 9-12 with any others. So I replaced the lapply
with a for loop, and the process now takes only 2-3 minutes with a dataset of 450,000 rows.
dissimilar <- function(x, distance=0.1) {
lst <- list() # initialise the list
k <- 1:length(x) # k is the index of PAPERS to compare with
for(i in k){ # i = each PAPER, k = itself and all remaining
lst[[i]] <- agrep(x[i], x[k], max.distance = distance) + i - 1
# + i - 1 ensures that the original index in x is maintained
k <- k[!k %in% lst[[i]]] # remove elements which are similar
}
lst <- sapply(lst, "[", 1) # take only the first of each item in the list
length(na.omit(lst)) # count number of elements
}
Now expand the original example dataset so that there are 450,000 records containing around 18,000 journals, each of which contain around 25 articles.
n <- 45000
data2 <- do.call("rbind", replicate(round(n/26), data, simplify=FALSE))[1:n,]
data2$JOURNAL[27:n] <- rep(paste0("0002-", seq(1, n/25)), each=25)[1:(n-26)]
data2 %>%
group_by(JOURNAL) %>%
summarise(n=dissimilar(PAPER))
# A tibble: 18,001 x 2
JOURNAL n
<chr> <int>
1 0001-1231 6 # <-- Same
2 0001-1232 8
3 0002-1 14
4 0002-10 14
5 0002-100 14
6 0002-1000 13
7 0002-10000 14
8 0002-10001 14
9 0002-10002 14
10 0002-10003 14
# ... with 17,991 more rows
The challenge is to find a way to speed up the process even more.