pythonxml-parsingjythongoogle-refine

How can you parse xml in Google Refine using jython/python ElementTree


I trying to parse some xml in Google Refine using Jython and ElementTree but I'm struggling to find any documentation to help me getting this working (probably not helped by not being a python coder)

Here's an extract of the XML I'm trying to parse. I'm trying to return a joined string of all the dc:indentifier:

<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
  <dc:creator>J. Koenig</dc:creator>
  <dc:date>2010-01-13T15:47:38Z</dc:date>
  <dc:date>2010-01-13T15:47:38Z</dc:date>
  <dc:date>2010-01-13T15:47:38Z</dc:date>
  <dc:identifier>CCTL0059</dc:identifier>
  <dc:identifier>CCTL0059</dc:identifier>
  <dc:identifier>http://open.jorum.ac.uk:80/xmlui/handle/123456789/335</dc:identifier>
  <dc:format>application/pdf</dc:format>
</oai_dc:dc>

Here's the code I've got so far. This is a test to return anything as right now all I'm getting is 'Error: null'

from elementtree import ElementTree as ET
element = ET.parse(value)

namespace = "{http://www.openarchives.org/OAI/2.0/oai_dc/}"
e = element.findall('{0}identifier'.format(namespace))
for i in e:
   count += 1
return count

Solution

  • You can use a GREL expression like this, try it:

    forEach(value.parseHtml().select("dc|identifier"),v,v.htmlText()).join(",")
    

    For each identifier found, give me the htmlText and join them all with commas. parseHtml() uses Jsoup.org library and really just parses tags and structure. It also knows about parsing namespaces with the format of ns|identifier and is a nice way to get what your after in this case.