[SOLVED] Detect "DOC-" followed by numeric ID with regex in Python

Detect "DOC-" followed by numeric ID with regex in Python

I want to find all the occurrences of a specific term (and its variations) in a word document. These are the steps:

Extract the text from the word document
Try to find pattern via regex

Extract Text from Word Document

The document variable contains the extracted text with the following function getText(filename):

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

Find Pattern with Regex

The pattern consists of words that start with DOC- and after the hyphen- there are 9 digits.

I have tried the following without success:

with start and end line markers

pattern = re.compile('^DOC\.\d{9}$')
pattern.findall(document)

without

pattern = re.compile('DOC\.\d{9}')
pattern.findall(document)

Can someone help me?

Solution

You can use a combinbation of word and numeric right-hand boundaries.

Also, you say there must be a dash after DOC, but you use a . in the pattern. I believe you wanted to also match any en- or em-dash, so I'd suggest to use a more precise pattern, like [-–—]. Note there are other ways to match any Unicode dash char, see Searching for all Unicode variation of hyphens in Python.

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

print( re.findall(r'\bDOC[-–—]\d{9}(?!\d)', getText(filename)) )

Details:

\b - a word boundary
DOC - DOC substring
[-–—] - a dash symbol (hyphen, en- or em-dash)
\d{9} - nine digits
(?!\d) - immediately to the right of the current location, there must be no digit.