rubyxmlxml-parsingnokogiritmx

How to search for prop elements in TMX with Nokogiri


I have a TMX translation memory file that I need to parse to be able to import it into a new DB. I'm using Ruby + Nokogiri. This is the TMX (xml) structure:

<body>
<tu creationdate="20181001T113609Z" creationid="some_user">
<prop type="Att::Attribute1">Value1</prop>
<prop type="Txt::Attribute2">Value2</prop>
<prop type="Txt::Attribute3">Value3</prop>
<prop type="Txt::Attribute4">Value4</prop>
<tuv xml:lang="EN-US">
<seg>Testing</seg>
</tuv>
<tuv xml:lang="SL">
<seg>Testiranje</seg>
</tuv>
</tu>
</body>

I've only included 1 TU node here for simplicity.

This is my current script:

require 'nokogiri'

doc = File.open("test_for_import.xml") { |f| Nokogiri::XML(f) }

doc.xpath('//tu').each do |x|
  puts "Creation date: " + x.attributes["creationdate"]
  puts "User: " + x.attributes["creationid"]

  x.children.each do |y|
    puts y.children
  end

end

This yields the following:

Creation date: 20181001T113609Z
User: some_user
Value1
Value2
Value3
Value4

<seg>Testing</seg>


<seg>Testiranje</seg>

What I need to do get is to search for Attribute1 and it's corresponding value and assign to a variable. These will then be used as attributes when creating translation records in the new DB. I need the same for seg to get the source and the translation. I don't want to rely on the sequence, even though it should/is always the same.

What is the best way to continue? All the elements are of class Nokogiri::XML::NodeSet . Even after looking at the docs for this I'm still stuck.

Can someone help?

Best, Sebastjan


Solution

  • The easiest way to traverse a node tree like this is using XPath. You've already used XPath for getting your top-level tu element, but you can extend XPath queries much further to get specific elements like you're looking for.

    Here on DevHints is a handy cheat-sheet for what you can do with XPath.

    Relative to your x variable which points to the tu element, here are the XPaths you'll want to use:

    Here's a complete code example using those XPaths. The at_xpath method returns one result, whereas the xpath method returns all results.

    require 'nokogiri'
    
    doc = File.open("test_for_import.xml") { |f| Nokogiri::XML(f) }
    
    doc.xpath('//tu').each do |x|
      puts "Creation date: " + x.attributes["creationdate"]
      puts "User: " + x.attributes["creationid"]
    
      # Get Attribute 1
      # There should only be one result for this, so using `at_xpath`
      attr1 = x.at_xpath('prop[@type="Att::Attribute1"]')
      puts "Attribute 1: " + attr1.text
    
      # Get each seg
      # There will be many results, so using `xpath`
      segs = x.xpath('//seg')
      segs.each do |seg|
        puts "Seg: " + seg.text
      end
    end
    

    This outputs:

    Creation date: 20181001T113609Z
    User: some_user
    Attribute 1: Value1
    Seg: Testing
    Seg: Testiranje