pythonregex

Detect "DOC-" followed by numeric ID with regex in Python


I want to find all the occurrences of a specific term (and its variations) in a word document. These are the steps:

  1. Extract the text from the word document
  2. Try to find pattern via regex

Extract Text from Word Document

The document variable contains the extracted text with the following function getText(filename):

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

Find Pattern with Regex

The pattern consists of words that start with DOC- and after the hyphen- there are 9 digits.

I have tried the following without success:

  1. with start and end line markers

    pattern = re.compile('^DOC\.\d{9}$')
    pattern.findall(document)
    
  2. without

    pattern = re.compile('DOC\.\d{9}')
    pattern.findall(document)
    

Can someone help me?


Solution

  • You can use a combinbation of word and numeric right-hand boundaries.

    Also, you say there must be a dash after DOC, but you use a . in the pattern. I believe you wanted to also match any en- or em-dash, so I'd suggest to use a more precise pattern, like [-–—]. Note there are other ways to match any Unicode dash char, see Searching for all Unicode variation of hyphens in Python.

    import docx
    
    def getText(filename):
        doc = docx.Document(filename)
        fullText = []
        for para in doc.paragraphs:
            fullText.append(para.text)
        return '\n'.join(fullText)
    
    print( re.findall(r'\bDOC[-–—]\d{9}(?!\d)', getText(filename)) )
    

    Details: