For example, I have 3 sentences like at below where 1 sentence in the middle contains citation mark (Warren and Pereira, 1982)
. The citation is always in bracket with this format: (~string~comma(,)~space~number~)
He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982). CHAT-80 was a state of the art natural language system that was impressive on its own merits.
I'm using Regex to extract only the middle sentence but it keeps print all the 3 sentences. The result should be like this:
The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982).
The setup... 2 sentences representing the cases of interest:
text = "He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982). CHAT-80 was a state of the art natural language system that was impressive on its own merits."
t2 = "He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982) fgbhdr was a state of the art natural. CHAT-80 was a state of the art natural language system that was impressive on its own merits."
First, to match in the case where the citation is at the end of a sentence:
p1 = "\. (.*\([A-za-z]+ .* [0-9]+\)\.+?)"
To match when the citation is not at the end of a sentence:
p2 = "\. (.*\([A-za-z]+ .* [0-9]+\)[^\.]+\.+?)"
Combining both cases with the `|' regex operator:
p_main = re.compile("\. (.*\([A-za-z]+ .* [0-9]+\)\.+?)"
"|\. (.*\([A-za-z]+ .* [0-9]+\)[^\.]+\.+?)")
Running:
>>> print(re.findall(p_main, text))
[('The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982).', '')]
>>>print(re.findall(p_main, t2))
[('', 'The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982) fgbhdr was a state of the art natural.')]
In both cases you get the sentence with the citation.
A good resource is the python regular expressions documentation and the accompanying regex howto page.
Cheers