pythonregexpdfnlppdfplumber

Trouble parsing interview transcript (Q&As) where questioner name is sometimes redacted


I have the following python script I wrote as a reproducible example of my current pdf-parsing hangup. It:

running the python code below generates the following output:

~/askliz  main !1 ?21  python stack_overflow_q_example.py                                                      ✔  docenv Py  22:41:00 
Test for passage0 passed.
Test for passage1 passed.
Test for passage7 passed.
Test for passage8 passed.
Traceback (most recent call last):
  File "/home/max/askliz/stack_overflow_q_example.py", line 91, in <module>
    assert nltk.edit_distance(passages[10][:len(actual_passage10_start)], actual_passage10_start) <= ACCEPTABLE_TEXT_DISCREPANCY, e_msg
AssertionError: Failed on passage 10

Your mission, should you choose to accept it: get this passage10 test to pass without breaking one of the previous tests. I'm hoping there's a clever regex or other modification in extract_q_a_locations below that will do the trick, but I'm open to any solution that passes all these tests, as I chose these test passages deliberately.

A little background on this transcript text, in case it's not as fun reading to you as it is to me: Sometimes a passage starts with a "Q" or "A", and sometimes it starts with a name (e.g. "Ms. Cheney."). The test that's failing, for passage 10, is where a question is asked by a staff member whose name is then redacted. The only way I've managed to get that test to pass has inadvertently broken one of the other tests, because not all redactions indicate the start of a question. (Note: in the pdf/ocr library I'm using, pdfplumber, redacted text usually shows up as just a bunch of extra spaces).

Code below:

import nltk
import re
import requests
import pdfplumber


def extract_q_a_locations(examination_text:str)->list:

    # (when parsed by pdfplumber) every Q/A starts with a newline, then spaces, 
    # then a line number and more spaces 
    prefix_regex = '\n\s+\d+\s+'

    # sometimes what comes next is a 'Q' or 'A' and more spaces
    qa_regex = '[QA]\s+'

    # other times what comes next is the name of a congressperson or lawyer for the witness
    speaker_regex = "(?:(?:Mr\.|Ms\.) \w+\.|-\s+)"

    # the combined regex I've been using is looking for the prefix then QA or Speaker regex
    pattern = f"{prefix_regex}(?:{speaker_regex}|{qa_regex})"
    delims = list(re.finditer(pattern, text))
    return delims

def get_q_a_passages(qa_delimiters, text):
    q_a_list = []
    for delim, next_delim in zip(qa_delimiters[:-1], qa_delimiters[1:]):
        # prefix is either 'Q', 'A', or the name of the speaker
        prefix = text[delim.span()[0]:delim.span()[1]].strip().split()[-1]

        # the text chunk is the actual dialogue text. everything from current delim to next one
        text_chunk = text[delim.span()[1]:next_delim.span()[0]]
        
        # now we want to remove some of the extra cruft from layout=True OCR in pdfplumber
        text_chunk = re.sub("\n\s+\d+\s+", " ", text_chunk)  # remove line numbers
        text_chunk = " ".join(text_chunk.split())            # remove extra whitespace
        
        q_a_list.append(f"{prefix} {text_chunk}")

    return q_a_list

if __name__ == "__main__":

    # download pdf
    PDF_URL = "https://www.govinfo.gov/content/pkg/GPO-J6-TRANSCRIPT-CTRL0000928888/pdf/GPO-J6-TRANSCRIPT-CTRL0000928888.pdf"
    FILENAME = "interview_transcript_stackoverflow.pdf"

    response = requests.get(PDF_URL)
    with open(FILENAME, "wb") as f:
        f.write(response.content)

    # read pdf as text
    with pdfplumber.open(FILENAME) as pdf:
        text = "".join([p.extract_text(layout=True) for p in pdf.pages])

    # I care about the Q&A transcript, which starts after the "EXAMINATION" header
    startidx = text.find("EXAMINATION")
    text = text[startidx:]

    # extract Q&A passages
    passage_locations = extract_q_a_locations(text)
    passages = get_q_a_passages(passage_locations, text)

    # TESTS
    ACCEPTABLE_TEXT_DISCREPANCY = 2

    # The tests below all pass already.
    actual_passage0_start = "Q So I do first want to bring up exhibit"
    assert nltk.edit_distance(passages[0][:len(actual_passage0_start)], actual_passage0_start) <= ACCEPTABLE_TEXT_DISCREPANCY
    print("Test for passage0 passed.")

    actual_passage1 = "A This is correct."
    assert nltk.edit_distance(passages[1][:len(actual_passage1)], actual_passage1) <= ACCEPTABLE_TEXT_DISCREPANCY
    print("Test for passage1 passed.")

    # (Note: for the next two passages/texts, prefix/questioner is captured as "Cheney" & 
    # "Jordan", not "Ms. Cheney" & "Mr. Jordan". I'm fine with either way.
    actual_passage7_start = "Cheney. And we also, just as" 
    assert nltk.edit_distance(passages[7][:len(actual_passage7_start)], actual_passage7_start) <= ACCEPTABLE_TEXT_DISCREPANCY
    print("Test for passage7 passed.")

    actual_passage8_start = "Jordan. They are pro bono"
    assert nltk.edit_distance(passages[8][:len(actual_passage8_start)], actual_passage8_start) <= ACCEPTABLE_TEXT_DISCREPANCY
    print("Test for passage8 passed.")

    # HERE'S MY PROBLEM. 
    # This test fails because my regex fails to capture the question which starts with the 
    # redacted name of the staff/questioner. The only way I've managed to get this test to 
    # pass has also broken at least one of the tests above. 
    actual_passage10_start = " So at this point, as we discussed earlier, I'm going to"
    e_msg = "Failed on passage 10"
    assert nltk.edit_distance(passages[10][:len(actual_passage10_start)], actual_passage10_start) <= ACCEPTABLE_TEXT_DISCREPANCY, e_msg

Solution

  • I have assumed that the redactions in between the passage are not required. What I have done is replaced the redacted name's spaces with Ms. Fakename. . This I did because as you have mentioned in your question, the required passages are either starting with a name or Q or A. When it starts with a name, you'll notice that the name ends with a period and then starts with a capital letter. When the name is redacted, and that is an answer, there are a lot of spaces before it. Combining all these observations, I was able to have all the tests passing by adding the following snippet

        lines = text.splitlines()
    
        for i in range(len(lines)):
            if re.fullmatch(r" {10,}\d{1,2} {15,}[A-Z].+", lines[i]):
                lines[i] = re.sub(r" {15,}", "       Ms. Fakename. ", lines[i], count=1)
        
        text = "\n".join(lines)
    

    with the final code as

    import nltk
    import re
    import requests
    import pdfplumber
    
    
    def extract_q_a_locations(examination_text:str)->list:
    
        # (when parsed by pdfplumber) every Q/A starts with a newline, then spaces, 
        # then a line number and more spaces 
        prefix_regex = '\n\s+\d+\s+'
    
        # sometimes what comes next is a 'Q' or 'A' and more spaces
        qa_regex = '[QA]\s+'
    
        # other times what comes next is the name of a congressperson or lawyer for the witness
        speaker_regex = "(?:(?:Mr\.|Ms\.) \w+\.|-\s+)"
    
        # the combined regex I've been using is looking for the prefix then QA or Speaker regex
        pattern = f"{prefix_regex}(?:{speaker_regex}|{qa_regex})"
        delims = list(re.finditer(pattern, text))
        return delims
    
    def get_q_a_passages(qa_delimiters, text):
        q_a_list = []
        for delim, next_delim in zip(qa_delimiters[:-1], qa_delimiters[1:]):
            # prefix is either 'Q', 'A', or the name of the speaker
            prefix = text[delim.span()[0]:delim.span()[1]].strip().split()[-1]
    
            # the text chunk is the actual dialogue text. everything from current delim to next one
            text_chunk = text[delim.span()[1]:next_delim.span()[0]]
            
            # now we want to remove some of the extra cruft from layout=True OCR in pdfplumber
            text_chunk = re.sub("\n\s+\d+\s+", " ", text_chunk)  # remove line numbers
            text_chunk = " ".join(text_chunk.split())            # remove extra whitespace
            
            q_a_list.append(f"{prefix} {text_chunk}")
    
        return q_a_list
    
    if __name__ == "__main__":
    
        # download pdf
        PDF_URL = "https://www.govinfo.gov/content/pkg/GPO-J6-TRANSCRIPT-CTRL0000928888/pdf/GPO-J6-TRANSCRIPT-CTRL0000928888.pdf"
        FILENAME = "interview_transcript_stackoverflow.pdf"
    
        response = requests.get(PDF_URL)
        with open(FILENAME, "wb") as f:
            f.write(response.content)
    
        # read pdf as text
        with pdfplumber.open(FILENAME) as pdf:
            text = "".join([p.extract_text(layout=True) for p in pdf.pages])
        
        lines = text.splitlines()
    
        for i in range(len(lines)):
            if re.fullmatch(r" {10,}\d{1,2} {15,}[A-Z].+", lines[i]):
                lines[i] = re.sub(r" {15,}", "       Ms. Fakename. ", lines[i], count=1)
        
        text = "\n".join(lines)
    
        # I care about the Q&A transcript, which starts after the "EXAMINATION" header
        startidx = text.find("EXAMINATION")
        text = text[startidx:]
    
        # extract Q&A passages
        passage_locations = extract_q_a_locations(text)
        passages = get_q_a_passages(passage_locations, text)
    
        # TESTS
        ACCEPTABLE_TEXT_DISCREPANCY = 2
    
        # The tests below all pass already.
        actual_passage0_start = "Q So I do first want to bring up exhibit"
        assert nltk.edit_distance(passages[0][:len(actual_passage0_start)], actual_passage0_start) <= ACCEPTABLE_TEXT_DISCREPANCY
        print("Test for passage0 passed.")
    
        actual_passage1 = "A This is correct."
        assert nltk.edit_distance(passages[1][:len(actual_passage1)], actual_passage1) <= ACCEPTABLE_TEXT_DISCREPANCY
        print("Test for passage1 passed.")
    
        # (Note: for the next two passages/texts, prefix/questioner is captured as "Cheney" & 
        # "Jordan", not "Ms. Cheney" & "Mr. Jordan". I'm fine with either way.
        actual_passage7_start = "Cheney. And we also, just as" 
        assert nltk.edit_distance(passages[7][:len(actual_passage7_start)], actual_passage7_start) <= ACCEPTABLE_TEXT_DISCREPANCY
        print("Test for passage7 passed.")
    
        actual_passage8_start = "Jordan. They are pro bono"
        assert nltk.edit_distance(passages[8][:len(actual_passage8_start)], actual_passage8_start) <= ACCEPTABLE_TEXT_DISCREPANCY
        print("Test for passage8 passed.")
    
        # HERE'S MY PROBLEM. 
        # This test fails because my regex fails to capture the question which starts with the 
        # redacted name of the staff/questioner. The only way I've managed to get this test to 
        # pass has also broken at least one of the tests above. 
        actual_passage10_start = "Fakename So at this point, as we discussed earlier, I'm going to"
        e_msg = "Failed on passage 10"
        assert nltk.edit_distance(passages[10][:len(actual_passage10_start)], actual_passage10_start) <= ACCEPTABLE_TEXT_DISCREPANCY, e_msg
    

    Note that in the last test, I added "Fakename" as the prefix. If this is not desired, the passages list can be updated to remove the manually added prefix.