I am not so extremely familiar with regex but I would like to extract the title of a paper from a citation: The title is in between the year (for example 1991 in the 1st citation) and the following dot in the sentence. I make it here in italics.
"1Moulds J.M., Nickells M.W., Moulds J.J., et al. (1991) The C3b/C4b receptor is recognized by the Knops, McCoy, Swain-langley, and York blood group antisera. J. Exp. Med.5:1159-63."
"2Rochowiak A., Niemir Z.I. (2010) The structure and role of CR1 complement receptor in pathology. Pol. Merkur Lekarski. 28:84–88."
"3WHO. Geneva: WHO; 2018. World Malaria Report 2018".
The citation are stored in a data frame (df) in the column "citation" Output:
The C3b/C4b receptor is recognized by the Knops, McCoy, Swain-langley, and York blood group antisera
The structure and role of CR1 complement receptor in pathology
I wrote a regex which looks like this:
df$citation = sub('[^"]*?)', "", df$citation)
df$citation = sub("\\..*", "", df$citation)
Any advice on how to make it one line only? In addition, it would be good to have a regex which if it does not find the year in parenthesis such as for the third citation it will delete the citation. Possible to do this?
Given your set of requirements, you can use
sub("^.*?\\b(?:19|20)\\d{2}\\)\\s*([^.]+).*", "\\1", df$citation, perl=TRUE)
See the regex demo
- start of string.*?
- any 0+ chars, other than line break chars, as few as possible\b(?:19|20)\d{2}
- word boundary, 19
or 20
and any two digits\)
- a )
- 0+ whitespaces([^.]+)
- Group 1: one or more chars other than .
- any 0+ chars, other than line break chars, as many as possible.