pythonxmlxml-parsingnltk

Inserting XML tags at specific part of file without disrupting format


I'm trying to work with some XML files to do sentence tagging whilst maintaining the original structure of the file. The files look like so:

<text xml:lang="">
    <body>
      <div>
        <p>
          <p>
            <lb xml:id="p1z1" />19.
                    <lb xml:id="p1z2" />esse Christolam meam te adeo candide et humaniter Bullingere colendissime,
                    <lb xml:id="p1z3" />esse epistolam meam interpretatum. Caeterum, quod scribis te ex consilio consanguine
                    <lb xml:id="p1z4" />et affinium generi tui responsum fratri meo coram dedisse, non
                    <lb xml:id="p1z5" />possum satis mirari, qui hoc factum sit. Res enim ista ad me suum ad
                    <lb xml:id="p1z6" />fratrem pertinebat. Nec ita fueram abs te dimissus, quod vel tu tale
                    <lb xml:id="p1z7" />quid reciperes vel ego probarem, sed ita tua sponte pollicebaris vel te,
                    <lb xml:id="p1z8" />vel generum mihi per literas responsurum. Frater igitur dixit quidem
                    <lb xml:id="p1z9" />mihi te in praesentia nescio quorum (qui namque fuerint excidit) voluisse
                    <lb xml:id="p1z10" />respondere se vero voluisse recipere, imo admonuisse te ut, quemadmodum
                    <lb xml:id="p1z11" />promisisses, ita faceres. Ego simulatque tergiversationem istam cognoscere
                    <lb xml:id="p1z12" />non potui aliter interpretari quam ali fortassis aliquid monstri,
                    <lb xml:id="p1z13" />ut dicitur. Nam quae plana sunt et integra sive dicantur sive scripsisse
                    <lb xml:id="p1z14" />nihil refert. Utut sit, ego iniuriam illam, ex qua omnes istae
                    <lb xml:id="p1z15" />difficultates sunt ortae, iampridem domino deque commendavi, qui
                    <lb xml:id="p1z16" />per Mosen. Mea est ultro et ego retribuam eis in tempore.
                    <lb xml:id="p1z17" />De altero etiam capite accipio tuam excusationem. Quum enim tam sancte
                    <lb xml:id="p1z18" />affirmes te semper erga nos non aliter quam bene et fuisse et
...
...
...
        </p>
      </div>
    </body>
  </text>
</TEI>

The sentences I need to tag span over several lines. The lines are tagged with the line break tag "<lb xml:id="n" />". I need to somehow tag the sentences, and then append them back with their original formal to the file. The issue I encounter is that while the text contains newline characters, as soon as I create an instance of a sentence and try to append to the line break tag, the new line character isn't valid....

The output should look like:

<text xml:lang="">
    <body>
      <div>
        <p>
          <p>
            <lb xml:id="p1z1" /><s n="1" xml:lang="la">19.</s>
                    <lb xml:id="p1z2" /><s n="1" xml:lang="la">esse Christolam meam te adeo candide et humaniter Bullingere colendissime,
                    <lb xml:id="p1z3" />esse epistolam meam interpretatum.</s><s n="2" xml:lang="la"> Caeterum, quod scribis te ex consilio consanguine
                    <lb xml:id="p1z4" />et affinium generi tui responsum fratri meo coram dedisse, non
                    <lb xml:id="p1z5" />possum satis mirari, qui hoc factum sit.</s><s n="3" xml:lang="la"> Res enim ista ad me suum ad
                    <lb xml:id="p1z6" />fratrem pertinebat.</s><s n="4" xml:lang="la"> Nec ita fueram abs te dimissus, quod vel tu tale
                    <lb xml:id="p1z7" />quid reciperes vel ego probarem, sed ita tua sponte pollicebaris vel te,
                    <lb xml:id="p1z8" />vel generum mihi per literas responsurum.</s><s n="5" xml:lang="la"> Frater igitur dixit quidem
                    <lb xml:id="p1z9" />mihi te in praesentia nescio quorum (qui namque fuerint excidit) voluisse
                    <lb xml:id="p1z10" />respondere se vero voluisse recipere, imo admonuisse te ut, quemadmodum
                    <lb xml:id="p1z11" />promisisses, ita faceres.</s><s n="6" xml:lang="la"> Ego simulatque tergiversationem istam cognoscere
                    <lb xml:id="p1z12" />non potui aliter interpretari quam ali fortassis aliquid monstri,
                    <lb xml:id="p1z13" />ut dicitur.</s><s n="7" xml:lang="la"> Nam quae plana sunt et integra sive dicantur sive scripsisse
                    <lb xml:id="p1z14" />nihil refert.</s><s n="8" xml:lang="la"> Utut sit, ego iniuriam illam, ex qua omnes istae
                    <lb xml:id="p1z15" />difficultates sunt ortae, iampridem domino deque commendavi, qui
                    <lb xml:id="p1z16" />per Mosen.</s><s n="9" xml:lang="la"> Mea est ultro et ego retribuam eis in tempore.</s>
                    <lb xml:id="p1z17" /><s n="10" xml:lang="la">De altero etiam capite accipio tuam excusationem.</s><s n="11" xml:lang="la"> Quum enim tam sancte
                    <lb xml:id="p1z18" />affirmes te semper erga nos non aliter quam bene et fuisse et
...
...
...
       </p>
      </div>
    </body>
  </text>
</TEI>

My code looks like:

import xml.etree.ElementTree as ET
from nltk.tokenize import sent_tokenize
import nltk

# Ensure NLTK's sentence tokenizer is available
nltk.download('punkt')

def remove_ns_prefix(tree):
    for elem in tree.iter():
        if '}' in elem.tag:
            elem.tag = elem.tag.split('}', 1)[1]  # Removing namespace
    return tree

def process_file(input_xml, output_xml):
    tree = ET.parse(input_xml)
    root = remove_ns_prefix(tree.getroot())

    for body in root.findall('.//body'):
        for paragraph in body.findall('.//p'):
            # Extract all lb elements and following texts
            lb_elements = list(paragraph.findall('.//lb'))
            lb_ids = [lb.attrib.get('xml:id', '') for lb in lb_elements]  # Store lb ids
            text_after_lb = [(lb.tail if lb.tail else '') for lb in lb_elements]
            
            # Combine the text and tokenize into sentences
            entire_text = ' '.join(text_after_lb)
            sentences = sent_tokenize(entire_text)
            sentences2 = " ".join(sentences).split("\n")
            print(sentences2)
            
            # Clear the paragraph's existing content
            paragraph.clear()

            # Pair up lb tags and sentences using zip, reinsert them into the paragraph
            for lb_id, sentence in zip(lb_ids, sentences):
                # Reinsert lb element
                lb_attrib = {'xml:id': lb_id} if lb_id else {}
                new_lb = ET.SubElement(paragraph, 'lb', attrib=lb_attrib)
                # Attach sentence to this lb
                if sentence:
                    sentence_elem = ET.SubElement(paragraph, 's', attrib={'xml:lang': 'la'})
                    sentence_elem.text = sentence

    # Write the modified tree to a new file
    tree.write(output_xml, encoding='utf-8', xml_declaration=True, method='xml')

I'm losing my mind. Hopefully I have an XML pro who is willing to come to my rescue.

I've also tried first tagging, and then reinserting the line break tags afterwards, but due to the nature of XML it's tough. The next thing I would maybe attempt is to create temporary .txt files and go line by line and insert the tags on the lines that don't match...

Any and all help appreciated at this point.


Solution

  • The job can be done taking advantage of tail attribute of lb elements which are the items with index > 0 in this list (element.tail split by r'(\.|\n)' regexp). Label element is placed detecting sentence start and end (dots).

    ['<lb xml:id="p1z1"/>', '19', '.', '', '\n', '            ']
    

    that list represents this element; quoted to show whitespace

    '<lb xml:id="p1z1"/>19.
                    '
    

    Script does no take into account namespaces and is provided as POC of the parsing technique. It could be cleaner to label sentences with a self closing element

    <lb xml:id="p1z2"/><s n="2"/>esse Christolam meam te adeo candide et humaniter Bullingere colendissime,
    <lb xml:id="p1z3"/>esse epistolam meam interpretatum.<s n="3"/> Caeterum, quod scribis te ex consilio consanguine
    

    Given this sample

    <text xml:lang="">
      <body>
        <div>
          <p>
            <p>
                <lb xml:id="p1z1"/>19.
                <lb xml:id="p1z2"/>esse Christolam meam te adeo candide et humaniter Bullingere colendissime,
                <lb xml:id="p1z3"/>esse epistolam meam interpretatum. Caeterum, quod scribis te ex consilio consanguine
                <lb xml:id="p1z4"/>et affinium generi tui responsum fratri meo coram dedisse, non
                <lb xml:id="p1z5"/>possum satis mirari, qui hoc factum sit. Res enim ista ad me suum ad
                <lb xml:id="p1z6"/>fratrem pertinebat. Nec ita fueram abs te dimissus, quod vel tu tale
                <lb xml:id="p1z7"/>quid reciperes vel ego probarem, sed ita tua sponte pollicebaris vel te,
                <lb xml:id="p1z8"/>vel generum mihi per literas responsurum. Frater igitur dixit quidem
                <lb xml:id="p1z9"/>mihi te in praesentia nescio quorum (qui namque fuerint excidit) voluisse
                <lb xml:id="p1z10"/>respondere se vero voluisse recipere, imo admonuisse te ut, quemadmodum
                <lb xml:id="p1z11"/>promisisses, ita faceres. Ego simulatque tergiversationem istam cognoscere
                <lb xml:id="p1z12"/>non potui aliter interpretari quam ali fortassis aliquid monstri,
                <lb xml:id="p1z13"/>ut dicitur. Nam quae plana sunt et integra sive dicantur sive scripsisse
                <lb xml:id="p1z14"/>nihil refert. Utut sit, ego iniuriam illam, ex qua omnes istae
                <lb xml:id="p1z15"/>difficultates sunt ortae, iampridem domino deque commendavi, qui
                <lb xml:id="p1z16"/>per Mosen. Mea est ultro et ego retribuam eis in tempore.
                <lb xml:id="p1z17"/>De altero etiam capite accipio tuam excusationem. Quum enim tam sancte
                <lb xml:id="p1z18"/>affirmes te semper erga nos non aliter quam bene et fuisse et
            </p>
          </p>
        </div>
      </body>
    </text>
    

    Result

    <text xml:lang="">
      <body>
        <div>
          <p>
            <p>
                <lb xml:id="p1z1"/><s n="1"/>19.
                <lb xml:id="p1z2"/><s n="2"/>esse Christolam meam te adeo candide et humaniter Bullingere colendissime,
                <lb xml:id="p1z3"/>esse epistolam meam interpretatum.<s n="3"/> Caeterum, quod scribis te ex consilio consanguine
                <lb xml:id="p1z4"/>et affinium generi tui responsum fratri meo coram dedisse, non
                <lb xml:id="p1z5"/>possum satis mirari, qui hoc factum sit.<s n="4"/> Res enim ista ad me suum ad
                <lb xml:id="p1z6"/>fratrem pertinebat.<s n="5"/> Nec ita fueram abs te dimissus, quod vel tu tale
                <lb xml:id="p1z7"/>quid reciperes vel ego probarem, sed ita tua sponte pollicebaris vel te,
                <lb xml:id="p1z8"/>vel generum mihi per literas responsurum.<s n="6"/> Frater igitur dixit quidem
                <lb xml:id="p1z9"/>mihi te in praesentia nescio quorum (qui namque fuerint excidit) voluisse
                <lb xml:id="p1z10"/>respondere se vero voluisse recipere, imo admonuisse te ut, quemadmodum
                <lb xml:id="p1z11"/>promisisses, ita faceres.<s n="7"/> Ego simulatque tergiversationem istam cognoscere
                <lb xml:id="p1z12"/>non potui aliter interpretari quam ali fortassis aliquid monstri,
                <lb xml:id="p1z13"/>ut dicitur.<s n="8"/> Nam quae plana sunt et integra sive dicantur sive scripsisse
                <lb xml:id="p1z14"/>nihil refert.<s n="9"/> Utut sit, ego iniuriam illam, ex qua omnes istae
                <lb xml:id="p1z15"/>difficultates sunt ortae, iampridem domino deque commendavi, qui
                <lb xml:id="p1z16"/>per Mosen.<s n="10"/> Mea est ultro et ego retribuam eis in tempore.
                <lb xml:id="p1z17"/><s n="11"/>De altero etiam capite accipio tuam excusationem.<s n="12"/> Quum enim tam sancte
                <lb xml:id="p1z18"/>affirmes te semper erga nos non aliter quam bene et fuisse et
            </p>
          </p>
        </div>
      </body>
    </text>
    

    Set self_close = False to get the OP's labels. With restoring parsed elements back to the doc

    import re
    from lxml import etree
    doc = etree.parse('/home/luis/tmp/tmp.xml')
    # find parent element
    parent = doc.xpath('//div/p/p')[0]
    
    # keep indentation of first lb
    all='<p>' + parent.text
    i=1
    is_open=False
    self_close = True
    for t in parent.xpath('lb'):
      parts = ['']
      parts.extend(re.split(r'(\.|\n)', t.tail))
      
      t.tail=None
      parts[0]=etree.tostring(t).decode('utf-8')
    
      #print(parts)
      for p, e in enumerate(parts):
        skip = (e == '' or re.match(r'^(\n|\s+)$', e) is not None)
        
        if p > 0 and not is_open and not skip:
          if self_close:
            parts[p] = f'<s n="{i}"/>{e}'
          else:
            parts[p] = f'<s n="{i}">{e}'
            
          is_open=True
        elif is_open and e == '.':
          if not self_close:
            parts[p] = '.</s>'
          is_open=False
          i += 1
        elif p == len(parts) - 1:
            all += ''.join(parts)
        else:
          continue
    
    # last sentence does not end with a dot?
    # hardcoded here but could be detected
    if not self_close:
      all+='</s>'
    
    all +='</p>'
    # parse back to an element
    xfrag = etree.fromstring(all)
    xfrag.tail = parent.tail
    
    # replace parent element on document
    parent.getparent().replace(parent, xfrag)
    print(etree.tostring(doc).decode('utf-8'))