iosnsxmlparsernsinputstream

Streaming NSXMLParser with NSInputStream


Update:

When using NSXMLParser class method initWithContentsOfURL, rather than parsing as the XML feed is downloaded, it appears to try to load the entire XML file into memory, and only then initiate the parsing process. This is problematic if the XML feed is large (using an excessive amount of RAM, inherently inefficient because rather than parsing in parallel with the download, it only starts the parsing once the download is done, etc.).

Has anyone discovered how to parse as the feed is being streamed to the device using NSXMLParser? Yes, you can use LibXML2 (as discussed below), but it seems like it should be possible to do it with NSXMLParser. But it's eluding me.

Original question:

I was wrestling with using NSXMLParser to read XML from a web stream. If you use initWithContentsOfURL, while the interface may lead one to infer that it would stream the XML from the web, it doesn't seem to to do so, but rather appears to attempt to load the entire XML file first before any parsing taking place. For modest sized XML files that's fine, but for really large ones, that's problematic.

I have seen discussions of using NSXMLParser in conjunction with initWithStream with some customized NSInputStream that is streaming from the web. For example, there have been answers to this that suggest using something like the CFStreamCreateBoundPair referred to in the following Cocoa Builder post and the discussion of Setting Up Socket Streams in the Apple Stream Programming Guide, but I have not gotten it to work. I even tried writing my own subclassed NSInputStream that used a NSURLConnection (which is, itself, pretty good at streaming) but I wasn't able to get it to work in conjunction with NSXMLParser.

In the end, I decided to use LibXML2 rather than NSXMLParser, as demonstrated in the Apple XMLPerformance sample, but I was wondering if anyone had any luck getting streaming from a web source working with NSXMLParser. I've seen plenty of "theoretically you could do x" sort of answers, suggesting everything from CFStreamCreateBoundPair to grabbing the HTTPBodyStream from NSURLRequest, but I've yet to come across a working demonstration of streaming with NSXMLParser.

The Ray Wenderlich article How To Choose The Best XML Parser for Your iPhone Project seems to confirm that NSXMLParser is not well suited for large XML files, but with all of the posts about possible NSXMLParser-based work-arounds for streaming really large XML files, I'm surprised I have yet to find a working demonstration of this. Does anyone know of a functioning NSXMLParser implementation that streams from the web? Clearly, I can just stick with LibXML2 or some other equivalent XML parser, but the notion of streaming with NSXMLParser seems tantilizingly close.


Solution

  • -[NSXMLParser initWithStream:] is the only interface to NSXMLParser that currently performs a streaming parse of the data. Hooking it up to an asynchronous NSURLConnection that's providing data incrementally is unwieldy because NSXMLParser takes a blocking, "pull"-based approach to reading from the NSInputStream. That is, -[NSXMLParser parse] does something like the following when dealing with an NSInputStream:

    while (1) {
        NSInteger length = [stream read:buffer maxLength:maxLength];
        if (!length)
            break;
    
        // Parse data …
    }
    

    In order to incrementally provide data to this parser a custom NSInputStream subclass is needed that funnels data received by the NSURLConnectionDelegate calls on a background queue or runloop over to the -read:maxLength: call that NSXMLParser is waiting on.

    A proof-of-concept implementation follows:

    #include <Foundation/Foundation.h>
    
    @interface ReceivedDataStream : NSInputStream <NSURLConnectionDelegate>
    @property (retain) NSURLConnection *connection;
    @property (retain) NSMutableArray *bufferedData;
    @property (assign, getter=isFinished) BOOL finished;
    @property (retain) dispatch_semaphore_t semaphore;
    @end
    
    @implementation ReceivedDataStream
    
    - (id)initWithContentsOfURL:(NSURL *)url
    {
        if (!(self = [super init]))
            return nil;
    
        NSURLRequest *request = [NSURLRequest requestWithURL:url];
        self.connection = [[[NSURLConnection alloc] initWithRequest:request delegate:self startImmediately:NO] autorelease];
        self.connection.delegateQueue = [[[NSOperationQueue alloc] init] autorelease];
        self.bufferedData = [NSMutableArray array];
        self.semaphore = dispatch_semaphore_create(0);
    
        return self;
    }
    
    - (void)dealloc
    {
        self.connection = nil;
        self.bufferedData = nil;
        self.semaphore = nil;
    
        [super dealloc];
    }
    
    - (BOOL)hasBufferedData
    {
        @synchronized (self) { return self.bufferedData.count > 0; }
    }
    
    #pragma mark - NSInputStream overrides
    
    - (void)open
    {
        NSLog(@"open");
        [self.connection start];
    }
    
    - (void)close
    {
        NSLog(@"close");
        [self.connection cancel];
    }
    
    - (NSInteger)read:(uint8_t *)buffer maxLength:(NSUInteger)maxLength
    {
        NSLog(@"read:%p maxLength:%ld", buffer, maxLength);
        if (self.isFinished && !self.hasBufferedData)
            return 0;
    
        if (!self.hasBufferedData)
            dispatch_semaphore_wait(self.semaphore, DISPATCH_TIME_FOREVER);
    
        NSAssert(self.isFinished || self.hasBufferedData, @"Was woken without new information");
    
        if (self.isFinished && !self.hasBufferedData)
            return 0;
    
        NSData *data = nil;
        @synchronized (self) {
            data = [[self.bufferedData[0] retain] autorelease];
            [self.bufferedData removeObjectAtIndex:0];
            if (data.length > maxLength) {
                NSData *remainingData = [NSData dataWithBytes:data.bytes + maxLength length:data.length - maxLength];
                [self.bufferedData insertObject:remainingData atIndex:0];
            }
        }
    
        NSUInteger copiedLength = MIN([data length], maxLength);
        memcpy(buffer, [data bytes], copiedLength);
        return copiedLength;
    }
    
    
    #pragma mark - NSURLConnetionDelegate methods
    
    - (void)connection:(NSURLConnection *)connection didReceiveData:(NSData *)data
    {
        NSLog(@"connection:%@ didReceiveData:…", connection);
        @synchronized (self) {
            [self.bufferedData addObject:data];
        }
        dispatch_semaphore_signal(self.semaphore);
    }
    
    - (void)connectionDidFinishLoading:(NSURLConnection *)connection
    {
        NSLog(@"connectionDidFinishLoading:%@", connection);
        self.finished = YES;
        dispatch_semaphore_signal(self.semaphore);
    }
    
    @end
    
    @interface ParserDelegate : NSObject <NSXMLParserDelegate>
    @end
    
    @implementation ParserDelegate
    
    - (void)parser:(NSXMLParser *)parser didStartElement:(NSString *)elementName namespaceURI:(NSString *)namespaceURI qualifiedName:(NSString *)qualifiedName attributes:(NSDictionary *)attributeDict
    {
        NSLog(@"parser:%@ didStartElement:%@ namespaceURI:%@ qualifiedName:%@ attributes:%@", parser, elementName, namespaceURI, qualifiedName, attributeDict);
    }
    
    - (void)parserDidEndDocument:(NSXMLParser *)parser
    {
        NSLog(@"parserDidEndDocument:%@", parser);
        CFRunLoopStop(CFRunLoopGetCurrent());
    }
    
    @end
    
    
    int main(int argc, char **argv)
    {
        @autoreleasepool {
    
            NSURL *url = [NSURL URLWithString:@"http://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.xml"];
            ReceivedDataStream *stream = [[ReceivedDataStream alloc] initWithContentsOfURL:url];
            NSXMLParser *parser = [[NSXMLParser alloc] initWithStream:stream];
            parser.delegate = [[[ParserDelegate alloc] init] autorelease];
    
            [parser performSelector:@selector(parse) withObject:nil afterDelay:0.0];
    
            CFRunLoopRun();
    
        }
        return 0;
    }