htmlobjective-cparsinghpple

Objective-C HTML parsing. Get all text between tags


I am using hpple to try and grab a torrent description from ThePirateBay. Currently, I'm using this code:

NSString *path = @"//div[@id='content']/div[@id='main-content']/div/div[@id='detailsouterframe']/div[@id='detailsframe']/div[@id='details']/div[@class='nfo']/pre/node()";
NSArray *nodes = [parser searchWithXPathQuery:path];
for (TFHppleElement * element in nodes) {
    NSString *postid = [element content];
    if (postid) {
        [texts appendString:postid];
    }
}

This returns just the plain text, and not any of the URL's for screenshots. Is there anyway to get all links and other tags, not just plain text? The piratebay is fomratted like so:

<pre>
    <a href="http://img689.imageshack.us/img689/8292/itskindofafunnystory201.jpg" rel="nofollow">
    http://img689.imageshack.us/img689/8292/itskindofafunnystory201.jpg</a>
More texts about the file
</pre>

Solution

  • That's an easy job and you did it almost correctly!

    What you want is the content (or an attribute) of the a-tag, so you need to tell the parser that you want it.

    Just change your XPath to

    @"//div[@id='content']/div[@id='main-content']/div/div[@id='detailsouterframe']/div[@id='detailsframe']/div[@id='details']/div[@class='nfo']/pre/a"
    

    (You missed the a at the very end and you do not need node())

    Output:

    http://www.imdb.com/title/tt1904996/
    http://leetleech.org/images/65823608764828593230.png
    http://leetleech.org/images/44748070481477652927.png
    http://leetleech.org/images/42024611449329122742.png

    If you only want the screenshot URLs you can do something like

    NSMutableArray *screenshotURLs = [[NSMutableArray alloc] initWithCapacity:0];
    for (int i = 1; i < nodes.count; i++) {
        [screenshotURLs addObject:nodes[i]];
    }