pythonxmlparsingxliff

Problems parsing XML/XLIFF with inline elements


I am trying to parse xliff (XML) variant from SDL Trados translation software, which contains translations, and the "sdlxliff" file I'm parsing looks like this (somewhat simplified and "prettified").

XML/XLIFF file being processed ("sample.sdlxliff"):

<?xml version="1.0" encoding="utf-8"?><xliff xmlns:sdl="http://sdl.com/FileTypes/SdlXliff/1.0" xmlns="urn:oasis:names:tc:xliff:document:1.2" version="1.2" sdl:version="1.0"><file original="\\TRADOS_SERVER\Trados\2017\Doc_Helps\en-US\import\Test.xml" datatype="x-sdlfilterframework2" source-language="en-US" target-language="hr-HR"><header><sniff-info><detected-encoding detection-level="Certain" encoding="utf-8"/><detected-source-lang detection-level="Guess" lang="en-US"/><props><value key="xmlDeclaration">true</value><value key="standalone">yes</value><value key="HasUtf8Bom">false</value><value key="IsFragment">false</value></props></sniff-info></header>
<body>
  <trans-unit id="a1f4768e-a026-46c2-b65d-599d2108d176">
    <source>
      <g id="461">Add or edit text: </g>Just begin typing. The blinking insertion point indicates where your text starts. To edit text,   <g id="462">select the text</g>, then type. Use the controls in the Format <g id="463">  <g id="464"/></g> sidebar on the right.
    </source>
    <seg-source>
      <g id="461">
      <mrk mtype="seg" mid="182">Add or edit text:</mrk> </g>
      <mrk mtype="seg" mid="183">Just begin typing.</mrk> 
      <mrk mtype="seg" mid="184">The blinking insertion point indicates where your text starts.</mrk> 
      <mrk mtype="seg" mid="185">To edit text, <g id="462">select the text</g>, then type.</mrk> 
      <mrk mtype="seg" mid="186">Use the controls in the Format <g id="463"><g id="464"/></g> sidebar on the right.</mrk>
    </seg-source>
    <target>
      <g id="461">
      <mrk mtype="seg" mid="182">Dodajte ili uredite tekst:</mrk> </g>
      <mrk mtype="seg" mid="183">Samo počnite tipkati.</mrk> 
      <mrk mtype="seg" mid="184">Trepereća točka umetanja pokazuje gdje počinje vaš tekst.</mrk> 
      <mrk mtype="seg" mid="185">Za uređivanje teksta <g id="462">odaberite tekst</g>, zatim unesite tekst.</mrk> 
      <mrk mtype="seg" mid="186">Upotrijebite kontrole u rubnom stupcu Formatiraj <g id="463"><g id="464"/></g> s desne strane.</mrk>
    </target> 
    <blahblahblah></blahblahblah>
  </trans-unit>
  <trans-unit id="7f7ede5e-75b9-403a-b1c6-43f654ea8245">
    <source>
      <g id="492"><g id="493">The toolbar with buttons.</g></g>
    </source>
    <seg-source>
      <g id="492">
      <g id="493"> 
      <mrk mtype="seg" mid="199">The toolbar with buttons.</mrk></g></g>
    </seg-source>
    <target>
      <g id="492">
      <g id="493"> 
      <mrk mtype="seg" mid="199">Alatna traka sa tipkama.</mrk></g></g>
    </target>
    <blahblahblah></blahblahblah>
  </trans-unit>
</body>
</file></xliff>

So, the XML/XLIFF file has "seg-source" and "target" parts, which I am interested in, and which I want to extract and later print to plain tab-delimited TXT file, or whatever..

However, I am having problems with inline tags - like in this line:

<mrk mtype="seg" mid="185">To edit text, <g id="462">select the text</g>, then type.</mrk> 

-> where I am getting only the part of the string before the first inline '<g id="xxx">' tag :(

Instead of "To edit text, select the text, then type.", I am getting only "To edit text,".

Python code I have tried:

# parsesdlxliff-test.py:

from lxml import etree

tree = etree.parse("sample.sdlxliff")
root = tree.getroot()

for element in root:
  pass # not important
  # now the children
  for all_tags in element.findall('.//'):
    if 'mrk' in all_tags.tag:
      attrs = all_tags.attrib
      numb = attrs.get("mid")
      # remove all internal tags within 'mrk', leave only clean string/text? - how?
      print(numb, all_tags.text)

The result I'm getting with this code:

182 Add or edit text:
183 Just begin typing.
184 The blinking insertion point indicates where your text starts.
185 To edit text, 
186 Use the controls in the Format 
182 Dodajte ili uredite tekst:
183 Samo počnite tipkati.
184 Trepereća točka umetanja pokazuje gdje počinje vaš tekst.
185 Za uređivanje teksta 
186 Upotrijebite kontrole u rubnom stupcu Formatiraj 
199 The toolbar with buttons.
199 Alatna traka sa tipkama.

As can be seen in resulting lines no. 185 and 186 ('mid' numbers), there is text missing after the first inline tag (both in 'seg-source' and in 'target').

Ultimately, what I want to get is something like this (illustration only):

Add or edit text: <TAB> Dodajte ili uredite tekst:
To edit text, select the text, then type. <TAB> Za uređivanje teksta odaberite tekst, zatim unesite tekst.
Use the controls in the Format sidebar on the right. <TAB> Upotrijebite kontrole u rubnom stupcu Formatiraj s desne strane.

I.e. tab-delimited source-target sentence pairs.

I can pair them later using 'mid' numbers, but only after I manage to get the whole strings (get rid of internal tags somehow?)...

In short, how do I get/extract the whole strings, including the parts after any '<gxxx>' or '</g>' internal tags?


Solution

  • If I understand you correctly, something like this should work:

    import lxml.html as lh #while an xml parser would be more appropriate, in this case it's cleaner to use an html parser
    
    diff = """[your xml above]"""
    doc = lh.fromstring(diff.encode('utf-8'))
    engs = []
    cros = []
    eng = doc.xpath('//seg-source//mrk')
    cro = doc.xpath('//target//mrk')
    for e in eng:
        engs.append(e.text_content())
    for c in cro:
        cros.append(c.text_content())
    for eng, cro in zip(engs, cros):
        print(eng, '<tab>',cro)
    

    Output:

    Add or edit text: <tab> Dodajte ili uredite tekst:
    Just begin typing. <tab> Samo počnite tipkati.
    The blinking insertion point indicates where your text starts. <tab> Trepereća točka umetanja pokazuje gdje počinje vaš tekst.
    To edit text, select the text, then type. <tab> Za uređivanje teksta odaberite tekst, zatim unesite tekst.
    Use the controls in the Format  sidebar on the right. <tab> Upotrijebite kontrole u rubnom stupcu Formatiraj  s desne strane.
    The toolbar with buttons. <tab> Alatna traka sa tipkama.