openrefine

Extract XML attribute value based on element value


I have a few cells in OpenRefine with some XML in it (coming from nominatim) and, for each node, I would like to extract the value of an attribute only if the value of element in the same node is equal to a specific string ('Paris'). I am using jython to loop across the element and, if the element value is equal to Paris, return the desired attribute. Here's the code for it:

from xml.etree import ElementTree as ET
element = ET.fromstring(value).encode('utf8')
root = element.getroot()
resultsList = root.findall(".//place")
for result in resultsList:
    typerecord = result.find("city")
    if typerecord.text == "Paris":
        return result.attrib["lat"]

However it doesn't seems to work, even if the code seems fine to me. I get the following error:

Error: Traceback (most recent call last):
  File "<string>", line 3, in __temp_242115945__
  File "/opt/openrefine/webapp/extensions/jython/module/MOD-INF/lib/jython-standalone-2.7.2.jar/Lib/xml/etree/ElementTree.py", line 1313, in XML
  File "/opt/openrefine/webapp/extensions/jython/module/MOD-INF/lib/jython-standalone-2.7.2.jar/Lib/xml/etree/ElementTree.py", line 1653, in feed
  File "/opt/openrefine/webapp/extensions/jython/module/MOD-INF/lib/jython-standalone-2.7.2.jar/Lib/xml/etree/ElementTree.py", line 1653, in feed
  File "/opt/openrefine/webapp/extensions/jython/module/MOD-INF/lib/jython-standalone-2.7.2.jar/Lib/xml/parsers/expat.py", line 193, in Parse
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 115: ordinal not in range(128)

which appears to be more about the encoding of the characters. I added to the script .encode('utf8') but nothing change.

Here a sample of the XML:

<?xml version="1.0" encoding="UTF-8" ?>
<searchresults attribution="Data © OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright" exclude_place_ids="18482590,103398643,118557459,109798886" more_url="https://nominatim.openstreetmap.org/search/?street=11+rue+Girardon&amp;city=Paris&amp;country=France&amp;addressdetails=1&amp;extratags=1&amp;polygon_geojson=1&amp;exclude_place_ids=18482590%2C103398643%2C118557459%2C109798886&amp;format=xml" querystring="11 rue Girardon, Paris, France" timestamp="Tue, 25 Oct 22 09:32:26 +0000">
    <place address_rank="30" boundingbox="43.6242386,43.6243386,1.4264894,1.4265894" class="place" display_name="11, Rue François Girardon, Minimes - Barrière de Paris, Toulouse Nord, Toulouse, Haute-Garonne, Occitanie, France métropolitaine, 31200, France" geojson="{&quot;type&quot;:&quot;Point&quot;,&quot;coordinates&quot;:[1.4265394,43.6242886]}" importance="0.5201" lat="43.6242886" lon="1.4265394" osm_id="2084506137" osm_type="node" place_id="18482590" place_rank="30" type="house">
        <extratags/>
        <house_number>11</house_number>
        <road>Rue François Girardon</road>
        <neighbourhood>Minimes - Barrière de Paris</neighbourhood>
        <suburb>Toulouse Nord</suburb>
        <city>Toulouse</city>
        <municipality>Toulouse</municipality>
        <county>Haute-Garonne</county>
        <ISO3166-2-lvl6>FR-31</ISO3166-2-lvl6>
        <state>Occitanie</state>
        <ISO3166-2-lvl4>FR-OCC</ISO3166-2-lvl4>
        <region>France métropolitaine</region>
        <postcode>31200</postcode>
        <country>France</country>
        <country_code>fr</country_code>
    </place>
    <place address_rank="26" boundingbox="48.8872626,48.8876471,2.3372233,2.3374922" class="highway" display_name="Rue Girardon, Quartier des Grandes-Carrières, Paris 18e Arrondissement, Paris, Île-de-France, France métropolitaine, 75018, France" geojson="{&quot;type&quot;:&quot;LineString&quot;,&quot;coordinates&quot;:[[2.3372233,48.8872626],[2.3372534,48.8873072],[2.337453,48.8875915],[2.3374922,48.8876471]]}" importance="0.52" lat="48.8875915" lon="2.337453" osm_id="10662867" osm_type="way" place_id="103398643" place_rank="26" type="residential">
        <extratags>
            <tag key="lit" value="yes"/>
            <tag key="surface" value="sett"/>
            <tag key="maxspeed" value="30"/>
            <tag key="sidewalk" value="both"/>
            <tag key="smoothness" value="intermediate"/>
            <tag key="cycleway:both" value="no"/>
            <tag key="zone:maxspeed" value="FR:30"/>
            <tag key="motor_vehicle:conditional" value="no @ (Su,PH 11:00-18:00)"/>
        </extratags>
        <road>Rue Girardon</road>
        <city_block>Quartier des Grandes-Carrières</city_block>
        <suburb>Paris 18e Arrondissement</suburb>
        <city_district>Paris</city_district>
        <city>Paris</city>
        <ISO3166-2-lvl6>FR-75</ISO3166-2-lvl6>
        <state>Île-de-France</state>
        <ISO3166-2-lvl4>FR-IDF</ISO3166-2-lvl4>
        <region>France métropolitaine</region>
        <postcode>75018</postcode>
        <country>France</country>
        <country_code>fr</country_code>
    </place>
    <place address_rank="26" boundingbox="48.8885135,48.8886689,2.3380551,2.3381062" class="highway" display_name="Rue Girardon, Quartier des Grandes-Carrières, Paris 18e Arrondissement, Paris, Île-de-France, France métropolitaine, 75018, France" geojson="{&quot;type&quot;:&quot;LineString&quot;,&quot;coordinates&quot;:[[2.3381062,48.8885135],[2.3380648,48.8886091],[2.3380551,48.8886689]]}" importance="0.52" lat="48.8886091" lon="2.3380648" osm_id="23371363" osm_type="way" place_id="109798886" place_rank="26" type="pedestrian">
        <extratags>
            <tag key="lit" value="yes"/>
            <tag key="surface" value="paving_stones"/>
            <tag key="smoothness" value="good"/>
        </extratags>
        <road>Rue Girardon</road>
        <city_block>Quartier des Grandes-Carrières</city_block>
        <suburb>Paris 18e Arrondissement</suburb>
        <city_district>Paris</city_district>
        <city>Paris</city>
        <ISO3166-2-lvl6>FR-75</ISO3166-2-lvl6>
        <state>Île-de-France</state>
        <ISO3166-2-lvl4>FR-IDF</ISO3166-2-lvl4>
        <region>France métropolitaine</region>
        <postcode>75018</postcode>
        <country>France</country>
        <country_code>fr</country_code>
    </place>
</searchresults>

Given the sample and the code, what I am expecting as a result is:

48.8875915
48.8886091

anyone who can help or suggest some GREL alternative for it?


Solution

  • Personally, I find non-trivial Python to be a royal pain to debug in OpenRefine's Jython preview, because GREL's fluent style is much easier to build up incrementally, so here's a GREL equivalent for your Python:

    forEach(value.parseXml().select('place'),p,if(p.select('city')[0].htmlText()=='Paris',p.htmlAttr('lat'),None)).join('|')
    

    It returns 48.8875915|48.8886091 (you can't store an array in a cell)

    Having said that, there are two problems with your Python:

    The patched up code is below, but note that it only returns the first value. It would need additional modifications to return all matches concatenated together in a string.

    from xml.etree import ElementTree as ET
    root = ET.fromstring(value.encode('utf8'))
    resultsList = root.findall(".//place")
    for result in resultsList:
        typerecord = result.find("city")
        if typerecord.text == "Paris":
            return result.attrib["lat"]