pythonxmlxml-parsingelementtreedtd

Fixing xml.etree.ElementTree.ParseError: undefined entity &egrave


I was working with a .pri file. Which has a xml format. Like below.

<?xml version="1.0"?>
<!DOCTYPE text SYSTEM "text.dtd">
<text id="fn000001">
  <au id="fn000001.1" s="N00023">
    <w id="fn000001.1.1">                          hi              </w>
    <w id="fn000001.1.2">                          there           </w>
    <l id="fn000001.1.3">                          ?               </l>
  </au>
</text>

Now if I call a single file, by using below command, it works properly.

import xml.etree.ElementTree as ET
tree = ET.parse('/path/fn000001.pri')
root = tree.getroot()
print(root.get('id'))

Now I want to apply this to all the .pri files in the folder. For that, I am using below command,

import glob
import xml.etree.ElementTree as ET
a = glob.glob('/path/*.pri')
  
for files in a:
    tree = ET.parse(files)
    print(tree)

That throws the error,

tree = ET.parse(files)
  File "/usr/lib/python3.6/xml/etree/ElementTree.py", line 1196, in parse
    tree.parse(source, parser)
  File "/usr/lib/python3.6/xml/etree/ElementTree.py", line 597, in parse
    self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: undefined entity &egrave;: line 147, column 52

Please suggest possible solutions. The related .dtd is in the same folder.


Solution

  • Ok! I came across many posts and answers related to this question. As per the reasons behind this error provided in one comment,

    As I said, ElementTree does not support entities declared in a separate DTD file. Either declare entities in the XML file or use lxml. Or don't use entities at all.

    So, main question is what to do so that, ElementTree supports the entities declared in a separate DTD file.

    This solution is provided here. ParseError: undefined entity while parsing XML file in Python

    But what if you have 1000s of XML files and you want to parse them all at the same time? Then this solution will not work out.

        from lxml import etree
        parser = etree.XMLParser(dtd_validation=True)
        tree = etree.parse("file.xml", parser)
    

    All you need to give dtd_validation=True and the code will fetch the information from .dtd file and map it with your .XML file. Make sure .dtd file is in the same directory of your all XML files.