javaxpathxml-parsingjdomalexa-internet

XPath expressions for extracting information from AWIS (Alexa.com) XML data


I somehow can't manage to extract information from AWIS results (containing Alexa data).

I've a bunch of XML files containing AWIS data from which I want to extract information bits such as Rank and PageViews for 3 month period.

The two (colliding) namespaces are somehow confusing and my XPath expressions are not working as intended. (Even a simple //aws:Rank/text() is not working.)

It would be great if somebody could assist me to get going.

Currently, I am using jdom library, but wouldn't mind using something else. This is a tiny example that does not work as suspected:

Document doc = new SAXBuilder().build(file);
XPath xpath = XPath.newInstance("//aws:Rank");
xpath.addNamespace("aws", "http://awis.amazonaws.com/doc/2005-07-11/");
Element rank = (Element) xpath.selectSingleNode(doc);

I'd prefer to use javax.xml though...

Here's an example of the XML:

<?xml version="1.0"?>
<aws:UrlInfoResponse xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/">
<aws:Response xmlns:aws="http://awis.amazonaws.com/doc/2005-07-11">
<aws:OperationRequest>
<aws:RequestId>XXXX-XXXX-XXXX-XXXX-XXXX</aws:RequestId>
</aws:OperationRequest>
<aws:UrlInfoResult>
<aws:Alexa>

  <aws:ContactInfo>
    <aws:DataUrl type="canonical">ahparis.com</aws:DataUrl>
    <aws:PhoneNumbers>
      <aws:PhoneNumber>+33 140289796</aws:PhoneNumber>
    </aws:PhoneNumbers>
    <aws:OwnerName>John Fay</aws:OwnerName>
    <aws:Email>hostmaster@superbregistrar.net</aws:Email>
    <aws:PhysicalAddress>
      <aws:Streets>
        <aws:Street>22 rue Saint Sauveur</aws:Street>
      </aws:Streets>
      <aws:City>Paris 75002,</aws:City>
      <aws:Country>FRANCE</aws:Country>
    </aws:PhysicalAddress>
    <aws:CompanyStockTicker/>
  </aws:ContactInfo>
  <aws:ContentData>
    <aws:DataUrl type="canonical">ahparis.com</aws:DataUrl>
    <aws:SiteData>
      <aws:Title>Ah Paris</aws:Title>
      <aws:Description>Short term apartment rentals. Search engine, descriptions, photos, rates.</aws:Description>
      <aws:OnlineSince>26-Feb-2003</aws:OnlineSince>
    </aws:SiteData>
    <aws:Keywords>
      <aws:Keyword>FranĖ¤ais</aws:Keyword>
      <aws:Keyword>Ile-de-France</aws:Keyword>
    </aws:Keywords>
    <aws:OwnedDomains>
      <aws:OwnedDomain>
        <aws:Domain>paris-tournament.org</aws:Domain>
        <aws:Title>paris-tournament.org</aws:Title>
      </aws:OwnedDomain>
    </aws:OwnedDomains>
  </aws:ContentData>
  <aws:TrafficData>
    <aws:DataUrl type="canonical">ahparis.com</aws:DataUrl>
    <aws:Rank>2547606</aws:Rank>
    <aws:RankByCountry/>
    <aws:RankByCity/>
    <aws:UsageStatistics>
      <aws:UsageStatistic>
        <aws:TimeRange>
          <aws:Months>3</aws:Months>
        </aws:TimeRange>
        <aws:Rank>
          <aws:Value>2547606</aws:Value>
          <aws:Delta>-658661</aws:Delta>
        </aws:Rank>
        <aws:Reach>
          <aws:Rank>
            <aws:Value>2964984</aws:Value>
            <aws:Delta>-152875</aws:Delta>
          </aws:Rank>
          <aws:PerMillion>
            <aws:Value>0.28</aws:Value>
            <aws:Delta>-10.64%</aws:Delta>
          </aws:PerMillion>
        </aws:Reach>
        <aws:PageViews>
          <aws:PerMillion>
            <aws:Value>0.01</aws:Value>
            <aws:Delta>+100%</aws:Delta>
          </aws:PerMillion>
          <aws:Rank>
            <aws:Value>2143379</aws:Value>
            <aws:Delta>-1628449</aws:Delta>
          </aws:Rank>
          <aws:PerUser>
            <aws:Value>4.0</aws:Value>
            <aws:Delta>+120%</aws:Delta>
          </aws:PerUser>
        </aws:PageViews>
      </aws:UsageStatistic>
      <aws:UsageStatistic>
        <aws:TimeRange>
          <aws:Months>1</aws:Months>
        </aws:TimeRange>
        <aws:Rank>
          <aws:Value>1430628</aws:Value>
          <aws:Delta>-3224794</aws:Delta>
        </aws:Rank>
        <aws:Reach>
          <aws:Rank>
            <aws:Value>1656655</aws:Value>
            <aws:Delta>-5103474</aws:Delta>
          </aws:Rank>
          <aws:PerMillion>
            <aws:Value>0.5</aws:Value>
            <aws:Delta>+500%</aws:Delta>
          </aws:PerMillion>
        </aws:Reach>
        <aws:PageViews>
          <aws:PerMillion>
            <aws:Value>0.02</aws:Value>
            <aws:Delta>+100%</aws:Delta>
          </aws:PerMillion>
          <aws:Rank>
            <aws:Value>1279227</aws:Value>
            <aws:Delta>-859817</aws:Delta>
          </aws:Rank>
          <aws:PerUser>
            <aws:Value>4</aws:Value>
            <aws:Delta>-63.11%</aws:Delta>
          </aws:PerUser>
        </aws:PageViews>
      </aws:UsageStatistic>
      <aws:UsageStatistic>
        <aws:TimeRange>
          <aws:Days>7</aws:Days>
        </aws:TimeRange>
        <aws:Rank>
          <aws:Value>1927968</aws:Value>
          <aws:Delta>+757770</aws:Delta>
        </aws:Rank>
        <aws:Reach>
          <aws:Rank>
            <aws:Value>2942088</aws:Value>
            <aws:Delta>+1612570</aws:Delta>
          </aws:Rank>
          <aws:PerMillion>
            <aws:Value>0.3</aws:Value>
            <aws:Delta>-64.64%</aws:Delta>
          </aws:PerMillion>
        </aws:Reach>
        <aws:PageViews>
          <aws:PerMillion>
            <aws:Value>0.05</aws:Value>
            <aws:Delta>+80%</aws:Delta>
          </aws:PerMillion>
          <aws:Rank>
            <aws:Value>708394</aws:Value>
            <aws:Delta>-413955</aws:Delta>
          </aws:Rank>
          <aws:PerUser>
            <aws:Value>10</aws:Value>
            <aws:Delta>+400%</aws:Delta>
          </aws:PerUser>
        </aws:PageViews>
      </aws:UsageStatistic>
    </aws:UsageStatistics>
    <aws:ContributingSubdomains>
      <aws:ContributingSubdomain>
        <aws:DataUrl>ahparis.com</aws:DataUrl>
        <aws:TimeRange>
          <aws:Months>1</aws:Months>
        </aws:TimeRange>
        <aws:Reach>
          <aws:Percentage>100.00%</aws:Percentage>
        </aws:Reach>
        <aws:PageViews>
          <aws:Percentage>100.00%</aws:Percentage>
          <aws:PerUser>4</aws:PerUser>
        </aws:PageViews>
      </aws:ContributingSubdomain>
    </aws:ContributingSubdomains>
  </aws:TrafficData>
</aws:Alexa>
</aws:UrlInfoResult>
<aws:ResponseStatus xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/">
<aws:StatusCode>Success</aws:StatusCode>
</aws:ResponseStatus>
</aws:Response>
</aws:UrlInfoResponse>

Solution

  • It looks like a typo in the namespace URI - your code has

    xpath.addNamespace("aws", "http://awis.amazonaws.com/doc/2005-07-11/");
    

    (with a trailing slash) but the document has

    xmlns:aws="http://awis.amazonaws.com/doc/2005-07-11"
    

    (without the slash).

    I'd prefer to use javax.xml though...

    Namespace handling is a real pain in javax.xml.xpath, because there's no default implementation of the NamespaceContext interface provided in the Java class library. You have to either implement your own or use a third-party implementation (I usually go for the SimpleNamespaceContext from Spring). If you're going to be doing a lot of XPath manipulation I'd suggest looking at Saxon 9 (the HE version is free of charge) and use its s9api, as this supports the far more powerful version 2.0 of the XPath language.