I want to find all the occurrences of a specific term (and its variations) in a word document. These are the steps:
The document
variable contains the extracted text with the following function getText(filename)
:
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
The pattern consists of words that start with DOC-
and after the hyphen-
there are 9 digits.
I have tried the following without success:
with start and end line markers
pattern = re.compile('^DOC\.\d{9}$')
pattern.findall(document)
without
pattern = re.compile('DOC\.\d{9}')
pattern.findall(document)
Can someone help me?
You can use a combinbation of word and numeric right-hand boundaries.
Also, you say there must be a dash after DOC
, but you use a .
in the pattern. I believe you wanted to also match any en- or em-dash, so I'd suggest to use a more precise pattern, like [-–—]
. Note there are other ways to match any Unicode dash char, see Searching for all Unicode variation of hyphens in Python.
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
print( re.findall(r'\bDOC[-–—]\d{9}(?!\d)', getText(filename)) )
Details:
\b
- a word boundaryDOC
- DOC
substring[-–—]
- a dash symbol (hyphen, en- or em-dash)\d{9}
- nine digits(?!\d)
- immediately to the right of the current location, there must be no digit.