cstringcode-reusewchar-tmemory-optimization

Function logic reuse between char string and wchar_t string without explicit string copying?


I'm writing a data structure in C to store commands; Here is the source pared down to what I'm unsatisfied with:

#include <stdlib.h>
#include <string.h>
#include <stdbool.h>
#include <errno.h>

#include "dbg.h"
#include "commandtree.h"

struct BranchList
{
    CommandTree *tree;
    BranchList *next;
};

struct CommandTree
{
    wchar_t id;       // wchar support actually has no memory cost due to the 
    bool term;        // padding that would otherwise exist, and may in fact be
    BranchList *list; // marginally faster to access due to its alignable size.
};

static inline BranchList *BranchList_create(void)
{
    return calloc(1, sizeof(BranchList));
}

inline CommandTree *CommandTree_create(void)
{
    return calloc(1, sizeof(CommandTree));
}

int CommandTree_putnw(CommandTree *t, const wchar_t *s, size_t n)
{
    for(BranchList **p = &t->list;;)
    {
        if(!*p)
        {

            *p = BranchList_create();
            if(errno == ENOMEM) return 1;
            (*p)->tree = CommandTree_create();
            if(errno == ENOMEM) return 1;
            (*p)->tree->id = *s;
        }   
        else if(*s != (*p)->tree->id)
        {   
            p = &(*p)->next;
            continue;
        }
        if(n == 1)
        {
            (*p)->tree->term = 1;
            return 0;
        }
        p = &(*p)->tree->list;
        s++;
        n--;

    }
}
int CommandTree_putn(CommandTree *t, const char *s, size_t n)
{
    wchar_t *passto = malloc(n * sizeof(wchar_t));
    mbstowcs(passto, s, n);
    int ret = CommandTree_putnw(t, passto, n);
     free(passto);
    return ret;
}

This works perfectly well, but I'm rather unsatisfied with how I'm handling the fact that my tree supports wchar_t. I decided to add this when I realized that the padding of CommandTree would make any datatype smaller than than 7 bytes cost just as much memory anyway, but so as not to duplicate code, I have CommandTree_putn reuse the logic in the wchar_t-supporting CommandTree_putnw.

However, due to the difference in size of char and wchar_t, I can't just pass the array; I have to convert using mbstowcs and pass a temporary wchar_t * to CommandTree_putnw. This is suboptimal, given that CommandTree_putn is going to see the most usage and this quintuples the memory usage (sizeof (char) to sizeof (char) + sizeof (wchar_t)) of the stored string, which could stack if lots of these are going to be instantiated with longish commands.

I was wondering it I could do something like create a third function that would contain the logic, and get passed a size_t, depending in the value of which it would cast the string passed to it as a void * to either const char * or const wchar_t * but given that C is statically typed, I'd have to pretty much duplicate the logic with s cast to its respective type, which would ruin the idea I'm going for of "single instance of logic".

So ultimately, the question is, can I provide the program logic only once and pass wrappers const char * and const wchar_t * respectively, without creating a temporary wchar_t * in the function to handle const char *?


Solution

  • I don't know your hard requirements, but wchar_t tends to be difficult to work with precisely because of this problem; it's too hard to mesh with existing code that uses char.

    All of the codebases I've worked with eventually migrated to UTF-8, which removes the necessity to store strings in a different type. UTF-8 works with the standard strcpy/strlen type of string manipulation functions and is fully Unicode savvy. The only challenge is that you will need to convert it to UTF-16 to invoke Windows Unicode APIs. (OS X can use UTF-8 directly.) You didn't mention platform so I don't know if this will be an issue for you. In our case we just wrote Win32 wrappers that took UTF-8 strings.

    Can you use C++? If so, and the actual type wchar_t is important (rather than Unicode support), you can templatize the functions and then instantiate them with std::wstring or std::string depending on string width. You can also write them to be based on char and wchar_t if you are brave, but you'll need to write special wrapper functions to handle basic operations like strcpy versus wcscpy and so it ends up being more work overall by far.

    In plain C, I don't think there's a silver bullet at all. There are yucky answers, but none I could recommend with a straight face.