rregextextgsubcitations

Extracting in-text citations (character strings) from text in R


I'm trying to write a function that would allow me to paste written text, and it would return a list of the in-text citations that were used in the writing. For example, this is what I currently have:

pull_cites<- function (text){
gsub("[\\(\\)]", "", regmatches(text, gregexpr("\\(.*?\\)", text))[[1]])
    }
    
pull_cites("This is a test. I only want to select the (cites) in parenthesis. I do not want it to return words in 
    parenthesis that do not have years attached, such as abbreviations (abbr). For example, citing (Smith 2010) is 
    something I would want to be returned. I would also want multiple citations returned separately such as 
    (Smith 2010; Jones 2001; Brown 2020). I would also want Cooper (2015) returned as Cooper 2015, and not just 2015.")

And in this example, it returns

[1] "cites"                              "abbr"                               "Smith 2010"                        
[4] "Smith 2010; Jones 2001; Brown 2020" "2015"

But I would want it to return something like:

[1] "Smith 2010"
[2] "Smith 2010"                
[3] "Jones 2001"
[4] "Brown 2020"
[5] "Cooper 2015"

Any ideas on how to make this function more specific? I am using R. Thanks!


Solution

  • You can also use

    x <- "This is a test. I only want to select the (cites) in parenthesis. I do not want it to return words in parenthesis that do not have years attached, such as abbreviations (abbr). For example, citing (Smith 2010) is something I would want to be returned. I would also want multiple citations returned separately such as (Smith 2010; Jones 2001; Brown 2020). I would also want Cooper (2015) returned as Cooper 2015, and not just 2015."
    rx <- "(?:\\b(\\p{Lu}\\w*(?:\\s+\\p{Lu}\\w*)*))?\\s*\\(([^()]*\\d{4})\\)"
    library(stringr)
    res <- str_match_all(x, rx)
    result <- lapply(res, function(z) {ifelse(!is.na(z[,2]) & str_detect(z[,3],"^\\d+$"), paste(trimws(z[,2]),  trimws(z[,3])), z[,3])})    
    unlist(sapply(result, function(z) strsplit(paste(z, collapse=";"), "\\s*;\\s*")))
    ## -> [1] "Smith 2010"  "Smith 2010"  "Jones 2001"  "Brown 2020"  "Cooper 2015"
    

    See the R demo and the regex demo.

    The regex matches

    The str_match_all(x, rx) function finds all matches and keeps the captured substrings. Then, the Group 2 and 3 values are concatenated if Group 2 is not NA and Group 3 is all digits, else, the match is used as is. Later, the items in the res variable are joined with a ; char and split with ; (enclosed with any zero or more whitespaces).