rregexgsub

getting title of citation with regex


I am not so extremely familiar with regex but I would like to extract the title of a paper from a citation: The title is in between the year (for example 1991 in the 1st citation) and the following dot in the sentence. I make it here in italics.

"1Moulds J.M., Nickells M.W., Moulds J.J., et al. (1991) The C3b/C4b receptor is recognized by the Knops, McCoy, Swain-langley, and York blood group antisera. J. Exp. Med.5:1159-63."

"2Rochowiak A., Niemir Z.I. (2010) The structure and role of CR1 complement receptor in pathology. Pol. Merkur Lekarski. 28:84–88."

"3WHO. Geneva: WHO; 2018. World Malaria Report 2018".

The citation are stored in a data frame (df) in the column "citation" Output:

The C3b/C4b receptor is recognized by the Knops, McCoy, Swain-langley, and York blood group antisera

The structure and role of CR1 complement receptor in pathology

I wrote a regex which looks like this:

df$citation = sub('[^"]*?)', "", df$citation)
df$citation = sub("\\..*", "", df$citation)

Any advice on how to make it one line only? In addition, it would be good to have a regex which if it does not find the year in parenthesis such as for the third citation it will delete the citation. Possible to do this?


Solution

  • Given your set of requirements, you can use

    sub("^.*?\\b(?:19|20)\\d{2}\\)\\s*([^.]+).*", "\\1", df$citation, perl=TRUE)
    

    See the regex demo

    Details