cstringstrstr

Using strstr to find all instances of substring results in weird string formatting


I'm making a web scraper and i'm at the point where I need to parse the incoming data. Everything was going fine until I had to find all instances of a substring in a string. I was able to get something working but it doesn't give me the full string I want (which is a full <p></p> tag).

done = 0;

while (done == 0) {
    if ((findSpan = strstr(serverResp, "<p")) != NULL) {
        printf("%s\n", findSpan);
        if ((findSpanEnd = strstr(findSpan, "</p>")) != NULL) {
            strcpy(serverResp, findSpanEnd);
            strcpy(findSpanEnd+4, "");
            printf("after end tag formattng %s\n", findSpan);
        }
    } else {
        done = 1;
    }
}

After end tag formatting should give me a result along the lines of <p>insert text here</p> but instead, I get something like this:

        <p>This should be printed</p>
        <h3>ignore</h3>
        <p>and so should this</p>
    </body>
</html>

after end tag formatting <p>This should be printed</p>
        <h3>ignore</h3>
        <p>and so should this</p>
    </body>
</html>

after end tag formatting dy>
</html>

The site's code looks like this:

<!DOCTYPE html>
<html>
    <head></head>
    <body>
        <h1>ignore this</h1>
        <p>This should be printed</p>
        <h3>ignore</h3>
        <p>and so should this</p>
    </body>
</html>

Solution

  •         if ((findSpanEnd = strstr(findSpan, "</p>")) != NULL) {
                strcpy(serverResp, findSpanEnd);
    

    This makes no sense. strstr finds "</p>" as requested; however you can't pass that to strcpy like that. strstr doesn't allocate a new string at all; it only returns the location within the old one.

    A routine to print out all <p> tags would look like this (note that this assumes no nested <p> tags):

        for (char *ptr = serverResp; ptr = strstr(ptr, "<p");)
        {
            char *finger = strchr(ptr, '>');
            if (!finger) break;
            ++finger;
            ptr = strstr(finger, "</p>");
            if (!ptr) {
                fwrite(finger, 1, strlen(finger), stdout);
            } else {
                fwrite(finger, 1, ptr - finger, stdout);
            }
            fputs("\r\n", stdout);
        }
    

    The technique: the call to strstr in the for loop locates the next <p> tag, strchr finds the end of it, then another strstr finds the closing </p> Because the return pointers are into the originating string, we use fwrite instead of printf to produce output.