cstringsplit

Why no split function in C?


There is no Standard function in C to take a string, break it up at whitespace or other delimiters, and create an array of pointers to char, in one step. If you want to do that sort of thing, you have to do it yourself, either completely by hand, or by calling e.g. strspn and strpbrk in a loop, or by calling strtok in a loop, or by calling strsep in a loop.

I am not asking how to do this. I know how to do this, and there are plenty of other questions on Stackoverflow about how to do it. What I'm asking is if there are any good reasons why there's no such function.

I know the two main reasons, of course: "Because no mainstream compiler/library ever had one" and "Because the C Standard didn't specify one, either (because it likes to standardize existing practice)." But are there any other reasons? (Are there arguments that such a function is an actively bad idea?)

This is usually a lame and pointless sort of question, I know. In this case I'm fixated on it because convenient splitting is such a massively useful operation. I wrote my own string splitter within my first year as a C programmer, I think, and it's been a huge productivity enhancer for me ever since. There are dozens of questions here on SO every day that could be answered easily (or that wouldn't even have to be asked) if there were a standard split function that everyone could use and refer to.

To be clear, the function I'm imagining would have a signature like

int split(char *string, char **argv, int maxargs, const char *delim)

It would break up string into at most maxargs substrings, splitting on one or more characters from delim, placing pointers to the substrings into argv, and modifying string in the process.

And to head off an argument I'm sure someone will make: although it's standard, I've never considered strtok to be a good solution. Saying "you don't need a split function, because strtok exists" is kind of like saying "You don't need printf, because puts exists." This is not a question about what's theoretically possible with a given toolset; it's about what's useful and convenient. The more fundamental issue here, I guess, concerns the ineffable tradeoffs involved in picking tools that are leverageable and productivity-enhancing and that "pay their way". (I think it's clear that a nicely encapsulated string-splitting function would pay its way handsomely, but perhaps that's just me.)


Solution

  • I will try an answer. I indeed agree that such a function would be useful. It is often quite useful in the languages that have one.

    Basically you are suggesting a very simple builtin wrapper around strtok() or strtok_r(). It would be a less powerful version (as we can't change the delimiter while processing) but still useful in some cases.

    What I see is that these cases are also overlapping with scanf() family functions use cases and with getopt() or getsubopt() family functions use cases.

    Actually I'm not sure that the remaining real use cases are that common.

    In real life non trivial cases you would need a true parser or regex library, in specialized common case you already have scanf() or getopt() or even strtok().

    Also functions modifying their input strings like strtok() or yours are more or less deprecated these days (experience says they easily lead to troubles).

    Most languages providing a split feature have a real string type, often an immutable one, and are supporting it by creating many individual substrings while leaving the original string intact.

    Following that path would lead to either some other API not based on zero delimited strings (maybe with a start pointer and an end pointer), or with allocated string copies (like when using strdup()). Neither really satisfying.

    In the end, if you add up not so common use in real life, quite simple to write and not that simple or intuitive API, there is no wonder that such function wasn't included in strandard libc.

    Basically I would write something like that:

    #include <string.h>
    #include <stdio.h>
    
    int split(char *string, char **argv, int maxargs, const char *delim){
        char * saveptr = 0;
        int x = 0;
        argv[x++] = strtok_r(string, delim, &saveptr);
        while(argv[x-1] && (x <= maxargs)){
            argv[x++] = strtok_r(0, delim, &saveptr);
        }
        return x-1;
    }
    
    int main(){
        char * args[10];
        {
            char * str = strdup("un deux trois quatre cinq six sept huit neuf dix onze");
            int res = split(str, args, sizeof(args)/sizeof(char*), " ");
            printf("res = %d\n", res);
            for(int x = 0; x < res ; x++){
                printf("%d:%s\n", x, args[x]);
            }
        }
    
        {
            char * str = strdup("un deux trois quatre cinq");
            int res = split(str, args, sizeof(args)/sizeof(char*), " ");
            printf("res = %d\n", res);
            for(int x = 0; x < res ; x++){
                printf("%d:%s\n", x, args[x]);
            }
        }
    }
    

    What I see looking at the code is that the wanted function is really very simple to write using strtok()... and that the call site to use the result is nearly as complicated as the function itself. In such a case hencefore I'd rather inline the function on the call site than having to call libc.

    But of course you are welcome to use and write yours if you believe it's simpler for you.