iosxpathhtml-parsinghpple

iOS parsing content with hpple help required


I'm currently trying to teach myself how to screen scrap in iOS having learnt how to do so on Android.

I am using the hpple library.

I am currently struggling to replicate what I have on Android using hpple and as such I am looking for some guidance on how to correctly use hpple to parse my HTML content.

I'm currently trying to parse the following content from my HTML website:

<table class="tableForAppContent">     

<tr>
<td nowrap="nowrap">
<a href='testLink'>CODE</a> MyTestCode</td>
<td nowrap>
<a href='testLink'>Number 123</a></td>
<td></td>
<td>Company Name</td>
<td nowrap>
11:10 AM
</td>
<td class="tableList" nowrap>
</td>
<td>
</td>
<td nowrap>
Status of company
<br />
</td>
<td>
</td>
</tr>

</table>

I need to be able to get all the text values you see if the HTML, so I need to be able to get the values: "CODE MyTestCode", "Number 123", "Company Name", "11:10 AM" and " "Status of company".

Here is the code I have so far:

NSURL *url = [NSURL URLWithString:@"MyTestSite.com"];
NSMutableURLRequest *request = [NSMutableURLRequest requestWithURL:url];
[request setTimeoutInterval: 30.0]; // Will timeout after 30 seconds
[NSURLConnection sendAsynchronousRequest:request
                                   queue:[NSOperationQueue currentQueue]
                       completionHandler:^(NSURLResponse *response, NSData *data, NSError *error) {

 if (data != nil && error == nil)
 {
     NSString *result = [[NSString alloc] initWithData:data encoding:NSASCIIStringEncoding];
     TFHpple *tutorialsParser = [TFHpple hppleWithHTMLData:data encoding:@"NSASCIIStringEncoding"];
     NSString *tutorialsXpathQueryString = @"//table[@class='tableForContent']//td";
     NSArray *tutorialsNodes = [tutorialsParser searchWithXPathQuery:tutorialsXpathQueryString];

     NSMutableArray *newTutorials = [[NSMutableArray alloc] initWithCapacity:0];
     for (TFHppleElement *element in tutorialsNodes) {
           NSLog(@"%@", [[element firstChild] content]);

           }
      }
      else
      {
      // There was an error, alert the user
      }                    
}];

I can't figure out the correct XPath Query string for the following line of code

NSString *tutorialsXpathQueryString = @"//table[@class='tableForContent']//td";

No matter what I try I can only find one of the elements at a time, so I can get the "Company Name" value but nothing else.

Can anyone help with the Query string?


Solution

  • Try to use the XPath expression

    //table[@class='tableForContent']//*[normalize-space(text()) != '']
    

    which should give all nodes containing a non-all-whitespace text.

    EDIT

    The solution above splits formatted <td> entries into several nodes which is not what you want. So, in fact your original XPath seems to be the right approach as far as the level of granularity is concerned.

    The following XPath

    //table[@class='tableForAppContent']//td[* or normalize-space(text()) != '']
    

    gives you the "right" <td> entries, which is to say, only those that contain text themselves or at least one child node which should result in all non-empty nodes.

    However, the result node set consists of nodes with a sub structure which means that they contain both text nodes and children with text nodes. Since you use these result node sets as the interface between XPath and the calling routine (in Objective C?) you will probably have to extract the text elements from this sub tree yourself and concatenate them. Maybe there are library routines that you could use for that. If not, you can always do it by recursively traversing the result node trees.