macoshtml-parsinginitwithcontentsofurl

How can I make multiple calls to initWithContentsOfURL without it eventually returning the wrong stuff?


I'm doing multiple levels of parsing of web pages where I use information from one page to drill down and grab a "lower" page to parse. When I get to the lowest level of my hierarchy, I no longer hit a new page, I basically hit the same one (with different parameters) and make SQL database entries.

If I don't slow things down (by putting a sleep(1)) before that inner loop, initWithContentsOfURL eventually returns a kind of stub piece of HTML. Here's the code I use to get my HTML nodes:

    NSError *err = nil;
    NSString* webStringURL = [sURL stringByAddingPercentEscapesUsingEncoding: NSUTF8StringEncoding];
    NSData *contentData = [[[NSData alloc] initWithContentsOfURL: [NSURL URLWithString: webStringURL] 
                                                         options: 0 
                                                           error: &err] autorelease];   
    NSString *dataString = [[[NSString alloc] initWithData: contentData
                                                  encoding: NSISOLatin1StringEncoding] autorelease];    
    NSData *data = [dataString dataUsingEncoding: NSUTF8StringEncoding];
    TFHpple *xPathDoc = [[[TFHpple alloc] initWithHTMLData: data] autorelease]; 

It works fine with 4 levels of looping. In faxt, it can run 24/7 with no real memory leak problem. It only dies when I have a connection issue. That is as long as I put in the sleep(1) before the inner-most loop.

It's like it's too fast and initWithContentsOfURL can't keep up. I suppose I could try to do something asynchronous but this is not for user-consumption and the direct synchronous looping works just fine... almost. I've tried different ways of slowing things down. Pausing for one second on a regular basis works but if I take that out, it starts getting bogus data after about 10 times through the inner loop. Is there a way to handle this properly?


Solution

  • I don't think it's a problem of initWithContentsOfURL; rather, I suspect it's the server or network that is unable to respond that quickly.

    The following assumes that's the case.

    If you want to receive network errors and/or server response errors, you need to use NSURLConnection. There's no way to get notified about the error from initWithContentsOfURL. If you know what is the stub page, or if you know a magic string in the successful response, you can check the returned NSData against those.