cduplicates

How does this code find duplicate characters in a string?


Example: Given a string (in this example char *word) you want to find duplicate characters (bytes).

I wanted to know if someone can explain to me how the following works:

    int table[256] = {0};
    for (int i = 0; i < len; i++)  table[word[i]]++;

After that you can check with another loop if duplicate or not like:

    for (int i = 0; i < len; i++) if (table[word[i]] > 1) { … }

How does this work? I don't understand why duplicated chars are > 1 in the table?


Solution

  • Transferring my comments into a semi-coherent answer.

    The first loop counts the number of occurrences of each byte value in the range 0..255 (it adds one to the count of the byte for each byte value it finds); the rescan finds the byte values with more than one occurrence in the string — the very definition of a duplicate.

    The loops both assume that the string is encoded in a single-byte code set and not a multi-byte code set like UTF-8, for example. They also assume that either plain char is an unsigned type (not common; most x86 platforms have a signed plain char type) or that there are no values in the string where word[i] is negative (no accented characters).

    For safety, therefore, the code should be:

    for (int i = 0; i < len; i++)
        table[(unsigned char)word[i]]++;
    
    for (int i = 0; i < len; i++)
    {
        if (table[(unsigned char)word[i]] > 1)
        {
            …
        }
    }
    

    You could use word[i] & 0xFF in place of the cast; it even has fewer characters, but I'd argue that the cast is clearer (and the character count is a red herring — please don't chase it). Note that both of these variants (cast and mask) work correctly regardless of whether the plain char is a signed or unsigned type (though the code does make the mostly reasonable assumption that CHAR_BIT is 8 and not some larger number — it can't be smaller).

    But why does it add to every char value it finds and not to just the one on the current position in the table? When I print out the memory location I see that from an example string "abcda" that both a are the same in memory. I thought in the table array they must be different (but continuous) locations; why is it the same?

    When you find 'a' in a character string, there's a byte value, usually 97, associated with that character. So when the counting code reads word[i] and the content of word[i] is 'a', it's as if there was 97 in the array (in fact, there is 97 in the array), so the first loop increments table[97] — changing it from 0 to 1 when the first a in abcda is read, and from 1 to 2 when the second a is read.

    Characters are just numbers, when all is said and done (but there's a heck of a lot of saying and doing that can happen before all is said and done; fortunately, you don't have to go through all that most of the time).