pythonbeautifulsouptranslationxliff

Translating XLIFF files using BeautifulSoup


I am translating Xliff file using BeautifulSoup and googletrans packages. I managed to extract all strings and translate them and managed to replace strings by creating new tag with a translations, e.g.

<trans-unit id="100890::53706_004">
<source>Continue in store</source>
<target>Kontynuuj w sklepie</target>
</trans-unit>

The problem appears when the source tag has other tags inside.

e.g.

<source><x ctype="x-htmltag" equiv-text="&lt;b&gt;" id="html_tag_191"/>Choose your product\
<x ctype="x-htmltag" equiv-text="&lt;/b&gt;" id="html_tag_192"/>From a list: </source>

There are different numbers of these tags and different order of where string appears. E.g. <source> text1 <x /> <x/> text2 <x/> text3 </source>. Each x tag is unique with different id and attributes.

Is there a way to modify the text inside the tag without having to create a new tag? I was thinking I could extract x tags and its attributes but the order or string and x tag in different code lines differs a lot I'm not sure how to do that. Maybe there is other package better suited for translating xliff files?


Solution

  • You can use for-loop to work with all children in source.
    And you can duplicate them with copy.copy(child) and append to target.
    At the same time you can check if child is NavigableString and convert it.


    text = '''<source><x ctype="x-htmltag" equiv-text="&lt;b&gt;" id="html_tag_191"/>Choose your product\
    <x ctype="x-htmltag" equiv-text="&lt;/b&gt;" id="html_tag_192"/>From a list: </source>'''
    
    conversions = {
        'Choose your product': 'Wybierz swój produkt',
        'From a list: ': 'Z listy: ',
    }
    
    from bs4 import BeautifulSoup as BS
    from bs4.element import NavigableString
    import copy
    
    #soup = BS(text, 'html.parser')  # it has problem to parse it
    #soup = BS(text, 'html5lib')     # it has problem to parse it
    soup = BS(text, 'lxml')
    
    # create `<target>`
    target = soup.new_tag('target')
    
    # add `<target>` after `<source>
    source = soup.find('source')
    source.insert_after('', target)
    
    # work with children in `<source>`
    for child in source:
        print('type:', type(child))
    
        # duplicate child and add to `<target>`
        child = copy.copy(child)
        target.append(child)
    
        # convert text and replace in child in `<target>`        
        if isinstance(child, NavigableString):
            new_text = conversions[child.string]
            child.string.replace_with(new_text)
    
    print('--- target ---')
    print(target)
    print('--- source ---')
    print(source)
    print('--- soup ---')
    print(soup)
    

    Result (little reformated to make it more readable):

    type: <class 'bs4.element.Tag'>
    type: <class 'bs4.element.NavigableString'>
    type: <class 'bs4.element.Tag'>
    type: <class 'bs4.element.NavigableString'>
    
    --- target ---
    
    <target>
      <x ctype="x-htmltag" equiv-text="&lt;b&gt;" id="html_tag_191"></x>
      Wybierz swój produkt
      <x ctype="x-htmltag" equiv-text="&lt;/b&gt;" id="html_tag_192"></x>
      Z listy: 
    </target>
    
    --- source ---
    
    <source>
      <x ctype="x-htmltag" equiv-text="&lt;b&gt;" id="html_tag_191"></x>
      Choose your product
      <x ctype="x-htmltag" equiv-text="&lt;/b&gt;" id="html_tag_192"></x>
      From a list: 
    </source>
    
    --- soup ---
    
    <html><body>
    <source>
      <x ctype="x-htmltag" equiv-text="&lt;b&gt;" id="html_tag_191"></x>
      Choose your product
      <x ctype="x-htmltag" equiv-text="&lt;/b&gt;" id="html_tag_192"></x>
      From a list: 
    </source>
    <target>
      <x ctype="x-htmltag" equiv-text="&lt;b&gt;" id="html_tag_191"></x>
      Wybierz swój produkt
      <x ctype="x-htmltag" equiv-text="&lt;/b&gt;" id="html_tag_192"></x>
      Z listy: 
    </target>
    </body></html>