pythonregexnlptext-extractioncitations

How to extract sentences containing citation mark from text file


For example, I have 3 sentences like at below where 1 sentence in the middle contains citation mark (Warren and Pereira, 1982). The citation is always in bracket with this format: (~string~comma(,)~space~number~)

He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982). CHAT-80 was a state of the art natural language system that was impressive on its own merits.

I'm using Regex to extract only the middle sentence but it keeps print all the 3 sentences. The result should be like this:

The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982).


Solution

  • The setup... 2 sentences representing the cases of interest:

    text = "He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982). CHAT-80 was a state of the art natural language system that was impressive on its own merits."
    
    t2 = "He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982) fgbhdr was a state of the art natural. CHAT-80 was a state of the art natural language system that was impressive on its own merits."
    

    First, to match in the case where the citation is at the end of a sentence:

    p1 = "\. (.*\([A-za-z]+ .* [0-9]+\)\.+?)"
    

    To match when the citation is not at the end of a sentence:

    p2 = "\. (.*\([A-za-z]+ .* [0-9]+\)[^\.]+\.+?)"
    

    Combining both cases with the `|' regex operator:

    p_main = re.compile("\. (.*\([A-za-z]+ .* [0-9]+\)\.+?)"
                    "|\. (.*\([A-za-z]+ .* [0-9]+\)[^\.]+\.+?)")
    

    Running:

    >>> print(re.findall(p_main, text))
    [('The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982).', '')]
    
    >>>print(re.findall(p_main, t2))
    [('', 'The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982) fgbhdr was a state of the art natural.')]
    

    In both cases you get the sentence with the citation.

    A good resource is the python regular expressions documentation and the accompanying regex howto page.

    Cheers