pythonregextextcitations

Python regex to get citations in a paper


I was adapting this code for extracting citations from a text:

#!/usr/bin/env python3
# https://stackoverflow.com/a/16826935

import re
from sys import stdin

text = stdin.read()

author = "(?:[A-Z][A-Za-z'`-]+)"
etal = "(?:et al.?)"
additional = "(?:,? (?:(?:and |& )?" + author + "|" + etal + "))"
year_num = "(?:19|20)[0-9][0-9]"
page_num = "(?:, p.? [0-9]+)?"  # Always optional
year = "(?:, *"+year_num+page_num+"| *\("+year_num+page_num+"\))"
regex = "(" + author + additional+"*" + year + ")"

matches = re.findall(regex, text)
matches = list( dict.fromkeys(matches) )
matches.sort()

#print(matches)
print ("\n".join(matches))

However, it recognizes some uppercased words as author names. For example, in the text:

Although James (2020) recognized blablabla, Smith et al. (2020) found mimimi. 
Those inconsistent results are a sign of lalala (Green, 2010; Grimm, 1990). 
Also James (2020) ...

The output would be

Also James (2020)
Although James (2020)
Green, 2010
Grimm, 1990
Smith et al. (2020)

Is there a way to "blacklist" some words in the above code without removing the entire match? I wish it recognized James' work but removed "Also" and "Although" from the citation.

Thanks in advance.


Solution

  • You may use

    author = r"(?:[A-Z][A-Za-z'`-]+)"
    etal = r"(?:et al\.?)"
    additional = f"(?:,? (?:(?:and |& )?{author}|{etal}))"
    year_num = "(?:19|20)[0-9][0-9]"
    page_num = "(?:, p\.? [0-9]+)?"  # Always optional
    year = fr"(?:, *{year_num}{page_num}| *\({year_num}{page_num}\))"
    regex = fr'\b(?!(?:Although|Also)\b){author}{additional}*{year}'
    matches = re.findall(regex, text)
    

    See the Python demo and the resulting regex demo.

    The main difference is at regex = fr'\b(?!(?:Although|Also)\b){author}{additional}*{year}', the \b(?!(?:Although|Also)\b) part will fail if the word that is immediately on the right is Although or Also.

    Also, note I escaped dots that are supposed to match literal dots, and use the f-strings to make the code look a bit more compact.