javaandroidhttpxpathhtmlcleaner

HTML Cleaner + XPath Not Working in Android App


I'm building a simple news readers app and I am using HTMLCleaner to retrieve and parse the data. I've sucessfully gotten the data I need using the commandline version of HTMLCleaner and using xmllint for example:

java -jar htmlcleaner-2.6.jar src=http://www.reuters.com/home nodebyxpath=//div[@id=\"topStory\"]

and

curl www.reuters.com | xmllint --html --xpath //div[@id='"topStory"'] -

both return the data I want. Then when I try to make this request using HTMLCleaner in my code I get no results. Even more troubling is that even basic queries like //div only return 8 nodes in my app while command line reports 70+ which is correct.

Here is the code I have now. It is in an Android class extending AsyncTask so its performed in the background. The final code will actually get the text data I need but I'm having trouble just getting it to return a result. When I Log Title Node the node count is zero.

I've tried every manner of escaping the xpath query strings but it makes no difference. The HTMLCleaner code is in a separate source folder in my project and is (at least I think) compiled to dalvik with the rest of my app so an incompatible jar file shouldn't be the problem.

I've tried to dump the HTMLCleaner file but it doesn't work well with LogCat and alot of the page markup is missing when I dump it which made me think that HTMLCleaner was parsing incorrectly and discarding most of the page but how can that be the case when the commandline version works fine?

Also the app does not crash and I'm not logging any exceptions.

protected Void doInBackground(URL... argv) {
    final HtmlCleaner cleaner = new HtmlCleaner();
    TagNode lNode = null;
    try {
        lNode = cleaner.clean( argv[0].openConnection().getInputStream() );
        Log.d("LoadMain", argv[0].toString());
    } catch (IOException e) {
        Log.d("LoadMain", e.getMessage());
    }

    final String lTitle = "//div[@id=\"topStory\"]";
//  final String lBlurp = "//div[@id=\"topStory\"]//p";

    try {
        Object[] x = lNode.evaluateXPath(lTitle);
//      Object[] y = lNode.evaluateXPath(lBlurp);
        Log.d("LoadMain", "Title Nodes: " + x.length  );
//      Log.d("LoadMain", "Title Nodes: " + y.length);
//      this.mBlurbs.add(new BlurbView (this.mContext, x.getText().toString(), y.getText().toString() ));

    } catch (XPatherException e) {
        Log.d("LoadMain", e.getMessage());
    }

    return null;
}

Any help is greatly appreciated. Thank you.

UPDATE: I've narrowed down the problem to being something to do with the http request. If I load the html source as an asset I get what I want so clearly the problem is in receiving the http request. In other words using lNode = cleaner.clean( getAssets().open("reuters.html") ); works fine.


Solution

  • Problem was that the http request was being redirected to the mobile website. This was solved by changing the User-Agent property like so.

    private static final String USER_AGENT = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:23.0) Gecko/20100101 Firefox/23.0";
    
    HttpURLConnection lConn = (HttpURLConnection) argv[0].openConnection();
    lConn.setRequestProperty("User-Agent", USER_AGENT);
    lConn.connect();
    lNode = cleaner.clean( lConn.getInputStream() );