phprubyxmlnokogirihpricot

How to pull data from KML/XML?


I have some data I converted to XML from a KML file and I was curious how to use PHP or Ruby to get back things like the neighborhood names and coordinates. I know when they have a tag around them like so.

<cities>
  <neighborhood>Gotham</neighborhood>
</cities>

but the data is unfortunately formatted as:

<SimpleData name="neighborhd">Colgate Center</SimpleData>

instead of

<neighborhd>Colgate Center</neighborhd>

This is the KML source:

How can I use PHP or Ruby to pull data from something like this? I installed some Ruby gems for parsing XML data but XML is just something I haven't worked with much.


Solution

  • Your XML is invalid, but Nokogiri will attempt to fix it up.

    Here's how to check for invalid XML/XHTML/HTML and how to rewrite the section you want.

    Here's the setup:

    require 'nokogiri'
    
    doc = Nokogiri.XML(<<EOT)
    <?xml version="1.0" encoding="UTF-8"?>
    <kml xmlns="http://earth.google.com/kml/2.2" xmlns:atom="http://www.w3.org/2005/Atom">
      <Document>
        <Schema name="Sample_Neighborhoods_Samples" id="Sample_Neighborhoods_Samples">
          <SimpleField type="int" name="nid"/>
          <SimpleField type="string" name="neighborhd"/>
          <SimpleField type="string" name="place"/>
          <SimpleField type="string" name="placecode"/>
          <SimpleField type="string" name="nbr_type"/>
          <SimpleField type="string" name="po_name"/>
          <SimpleField type="string" name="metro"/>
          <SimpleField type="string" name="country"/>
          <SimpleField type="string" name="state"/>
          <SimpleField type="string" name="statefips"/>
          <SimpleField type="string" name="county"/>
          <SimpleField type="string" name="countyfips"/>
          <SimpleField type="string" name="mcd"/>
          <SimpleField type="string" name="mcdfips"/>
          <SimpleField type="string" name="cbsa"/>
          <SimpleField type="string" name="cbsacode"/>
          <SimpleField type="string" name="cbsatype"/>
          <SimpleField type="double" name="cenlat"/>
          <SimpleField type="double" name="cenlon"/>
          <SimpleField type="int" name="color"/>
          <SimpleField type="string" name="ncs_code"/>
          <SimpleField type="string" name="release"/>
        </Schema>
        <Style id="KMLSTYLER_6">
          <LabelStyle>
            <scale>1.0</scale>
          </LabelStyle>
          <LineStyle>
            <colorMode>normal</colorMode>
          </LineStyle>
          <PolyStyle>
            <color>7f4080ff</color>
            <colorMode>random</colorMode>
          </PolyStyle>
        </Style>
        <name>Sample_Neighborhoods_NYC</name>
        <visibility>1</visibility>
        <Folder id="kml_ft_Sample_Neighborhoods_Samples">
          <name>Sample_Neighborhoods_Samples</name>
          <Folder id="kml_ft_Sample_Neighborhoods_Samples_Sample_Neighborhoods_NYC">
            <name>Sample_Neighborhoods_NYC</name>
            <Placemark id="kml_1">
              <name>Colgate Center</name>
              <Snippet> </Snippet>
              <styleUrl>#KMLSTYLER_6</styleUrl>
              <ExtendedData>
                <SchemaData schemaUrl="#Sample_Neighborhoods_Samples">
                  <SimpleData name="nid">7086</SimpleData>
                  <SimpleData name="neighborhd">Colgate Center</SimpleData>
                  <SimpleData name="place">Jersey City</SimpleData>
                  <SimpleData name="placecode">36000</SimpleData>
                  <SimpleData name="nbr_type">S</SimpleData>
                  <SimpleData name="po_name">JERSEY CITY</SimpleData>
                  <SimpleData name="metro">New York City, NY</SimpleData>
                  <SimpleData name="country">USA</SimpleData>
                  <SimpleData name="state">NJ</SimpleData>
                  <SimpleData name="statefips">34</SimpleData>
                  <SimpleData name="county">Hudson</SimpleData>
                  <SimpleData name="countyfips">34017</SimpleData>
                  <SimpleData name="mcd">Jersey City</SimpleData>
                  <SimpleData name="mcdfips">36000</SimpleData>
                  <SimpleData name="cbsa">New York-Northern New Jersey-Long Island, NY-NJ-PA</SimpleData>
                  <SimpleData name="cbsacode">35620</SimpleData>
                  <SimpleData name="cbsatype">Metro</SimpleData>
                  <SimpleData name="cenlat">40.7145135000001</SimpleData>
                  <SimpleData name="cenlon">-74.0343385</SimpleData>
                  <SimpleData name="color">1</SimpleData>
                  <SimpleData name="ncs_code">40910000</SimpleData>
                  <SimpleData name="release">1.12.2</SimpleData>
                </SchemaData>
              </ExtendedData>
              <Polygon>
                <outerBoundaryIs>
                  <LinearRing>
                    <coordinates>-74.036628,40.712211,0 -74.0357779999999,40.7120810000001,0                     -74.035535,40.7122010000001,0 -74.0348299999999,40.71209,0 -74.034903,40.711804,0 -74.033761,40.7116560000001,0 -74.0334089999999,40.7121090000001,0 -74.032996,40.7141330000001,0 -74.0331899999999,40.7141790000001,0 -74.032656,40.7162500000001,0 -74.032231,40.716194,0 -74.032049,40.716908,0 -74.033871,40.7170370000001,0 -74.035629,40.7173710000001,0 -74.035669,40.7171650000001,0 -74.036009,40.715335,0 -74.036325,40.713625,0 -74.036482,40.7123580000001,0 -74.036628,40.712211,0 </coordinates>
                  </LinearRing>
                </outerBoundaryIs>
              </Polygon>
            </Placemark>
            <Placemark id="kml_2">
              <name>Colgate Center</name>
              <Snippet> </Snippet>
              <ExtendedData>
    EOT
    

    Here's how to see if there are errors. Any time errors is not empty you have a problem.

    puts doc.errors
    

    Here's one way to find the SimpleData nodes throughout a document. I prefer to use CSS accessors over XPath for readability reasons. Sometimes XPath is better because it allows better granularity when searching. You need to learn them both.

    doc.search('ExtendedData SimpleData').each do |simple_data|
      node_name = simple_data['name']
      puts "<%s>%s</%s>" % [node_name, simple_data.text.strip, node_name]
    end
    

    Here's the output after running:

    Premature end of data in tag ExtendedData line 87
    Premature end of data in tag Placemark line 84
    Premature end of data in tag Folder line 44
    Premature end of data in tag Folder line 42
    Premature end of data in tag Document line 3
    Premature end of data in tag kml line 2
    <nid>7086</nid>
    <neighborhd>Colgate Center</neighborhd>
    <place>Jersey City</place>
    <placecode>36000</placecode>
    <nbr_type>S</nbr_type>
    <po_name>JERSEY CITY</po_name>
    <metro>New York City, NY</metro>
    <country>USA</country>
    <state>NJ</state>
    <statefips>34</statefips>
    <county>Hudson</county>
    <countyfips>34017</countyfips>
    <mcd>Jersey City</mcd>
    <mcdfips>36000</mcdfips>
    <cbsa>New York-Northern New Jersey-Long Island, NY-NJ-PA</cbsa>
    <cbsacode>35620</cbsacode>
    <cbsatype>Metro</cbsatype>
    <cenlat>40.7145135000001</cenlat>
    <cenlon>-74.0343385</cenlon>
    <color>1</color>
    <ncs_code>40910000</ncs_code>
    <release>1.12.2</release>
    

    I'm not trying to modify the DOM, but it's easy to do:

    doc.search('ExtendedData SimpleData').each do |simple_data|
      node_name = simple_data['name']
      simple_data.replace("<%s>%s</%s>" % [node_name, simple_data.text.strip, node_name])
    end
    
    puts doc.to_xml
    

    After running this is the affected section:

    <ExtendedData>
      <SchemaData schemaUrl="#Sample_Neighborhoods_Samples">
        <nid>7086</nid>
        <neighborhd>Colgate Center</neighborhd>
        <place>Jersey City</place>
        <placecode>36000</placecode>
        <nbr_type>S</nbr_type>
        <po_name>JERSEY CITY</po_name>
        <metro>New York City, NY</metro>
        <country>USA</country>
        <state>NJ</state>
        <statefips>34</statefips>
        <county>Hudson</county>
        <countyfips>34017</countyfips>
        <mcd>Jersey City</mcd>
        <mcdfips>36000</mcdfips>
        <cbsa>New York-Northern New Jersey-Long Island, NY-NJ-PA</cbsa>
        <cbsacode>35620</cbsacode>
        <cbsatype>Metro</cbsatype>
        <cenlat>40.7145135000001</cenlat>
        <cenlon>-74.0343385</cenlon>
        <color>1</color>
        <ncs_code>40910000</ncs_code>
        <release>1.12.2</release>
      </SchemaData>
    </ExtendedData>