pythonpython-2.7python-3.xxlm

Searching for a list of words in XML file in Python?


I have this XML file that contains more than 2000 phrases, below is a small sample.

<TEXT>

<PHRASE>
<V>played</V>
<N>John</N>
<PREP>with</PREP>
<en x='PERS'>Adam</en>
<PREP>in</PREP>
<en x='LOC'> ASL school/en>
</PHRASE>

<PHRASE>
<V y='0'>went</V>
<en x='PERS'>Mark</en>
<PREP>to</PREP>
<en x='ORG>United Nations</en>
<PREP>for</PREP>
<PREP>a</PREP>
<N>visit</N>
</PHRASE>

<PHRASE>
<PREP>in</PREP>
<en x='DATE'>1987</en>
<en x='PERS'>Nick</en>
<V>founded</V>
<en x='ORG'>XYZ company</en>
</PHRASE>

<PHRASE>
<en x='ORG'>Google's</en>
<en x='PERS'>Frank</en>
<V>went</V>
<N>yesterday</N>
<PREP>to</PREP>
<en x='LOC'>San Fransisco/en>
</PHRASE>
</TEXT>

And I have a list of patterns:

 finalPatterns=['went \n to \n','created\n  the\n', 'founded\n a\n', 'went\n yesterday\n to\n', 'a\n visit\n', 'founded\n in\n']

What I want is to take each finalPattern for example: went to and search for its presence in each phrase in the text, if any phrase contains both went AND to then it print out its 2 <en> tags. [Not if en tags not equal to PERS & ORG nothing is printed]

When it searches for:

-"went" & "to" --> this is the output: Frank -San Fransisco
-"founded" & "in" --> output: Nick-XYZ Company

That's what I did but it didn't work. Nothing was printed.

for phrase in root.findall('./PHRASE'):
 ens = {en.get('x'): en.text for en in phrase.findall('en')}
 if 'ORG' in ens and 'PERS' in ens:
   if all(word in phrase for word in finalPatterns):
      x="".join(phrase.itertext())   #print whats in between [since I would also like to print the whole sentence]
      print("ORG is: {}, PERS is: {} /".format(ens["ORG"],ens["PERS"]))

Solution

  • This should do the trick:

    phrasewords = [w.text for w in phrase.findall('V')+phrase.findall('N')+phrase.findall('PREP')]
    for words in finalPatterns:
        if all(word in phrasewords for word in words.split()):
             print "found"