I'm developing a C library to write a custom binary format to store levels for a game.
I had a problem about optimizing the usage of strings for this file, I have two use cases of strings in this file: cells on a grid for tiles and metadata.
The cells specify how the game world look like and ask 3 strings for what texture to display for a floor, walls and ceiling; then the metadata is key/value pairs. With my first attempts creating a big grid filled with the same strings for the cells ends up into a file that grows exponentially, that's not good. I need to optimize this.
My attempt
The best case would be to write the string once, so I thought of a string table placed at the end of the file, each string is stored with an unique integer key. I haven't found any stardardized implementation of a string table so I've made my own.
I've made a struct that implements a dynamic set of entries, each entry contains the string with his hash and a generated unique key. The hash algorithm is really stupid, it adds the first 50 characters into one big number and collision are handled by comparing two strings with matching hashes, it's good enough.
the structure looks like this:
struct sentry_s
{
char *value;
uint32_t key, key_hash;
size_t use_count; // when hit zero entry is removed.
}
// a SET that store unique strings with a key.
struct stable_s
{
struct sentry_s *entries;
size_t count, capacity;
}
// actual func names are longer, ofc.
struct sentry_s* tput (struct stable_s *table, char *string);
struct sentry_s* tget (struct stable_s *table, uint32_t key);
void trem (struct stable_s *table, uint32_t key);
I do really like this structure because I can use binary search to increase speed for one of these things:
I want to increase reading speed to prioritize the game loading these levels.
The question
I came here to ask if there is a better implementation for a string table, what would other devs do? Do exists papers that specify a standardized string table? Could this structure be improved to be both fast at reading and writing?
I'm interested on the topic and it's not the first time I've seen the use string tables, for example compiled code have a string table for static strings.
The actual question: Is the above diatribe from:
First, don't malloc()
each individual string. Instead, malloc()
one giant array of characters and fill it with all the strings concatenated with null terminators between them. Like this:
hello world\0foo\0bar\0
Next, use the starting address of any string as its ID:
hello world\0foo\0bar\0
01234567890 1234 5678 9
So "foo" has ID 12.
Your use case doesn't seem to need support for erasing strings, and you can append any time (use realloc()
to double the table capacity if needed).
The IDs should be stored in the smallest integer type that makes sense. Probably uint16_t or uint32_t. If you want a special value you can store a null terminator at the start and its ID will be 0.
Binary search will still work, but if you need O(1) lookup by string you can add a hash table where each entry is (string_hash, ID). Then to do lookup by string you hash the string, get its ID from the hash table, and use that ID to get the full string to compare (to confirm it's not a hash collision). This way you don't need to store all the strings a second time, and the hash table only contains integers so the entire solution only has two heap allocations in use at any time. The string hash can probably be 32 bits, using a full 64 bit hash is probably overkill if you use a standard string hash function instead of your weak one.
Don't bother writing your own hash table, there are many great existing ones.