calgorithmhashtable

Evaluating Improvements in C Program Using Hash Tables: A Beginner's Perspective


I'm new to C programming and I'm trying to understand the differences between two programs that check for duplicate elements in an array using hash tables with the uthash library. I've noticed some differences in their implementation, and I'm not sure if the changes in Program 1 are indeed improvements over Program 2. Could you help me understand these differences and their implications?

Program 1

#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include "uthash.h"

typedef struct {
    int key;
    UT_hash_handle hh;
} hash_table;

bool containsDuplicate(int* nums, int numsSize) {
    if (numsSize == 1) {
        return false;
    }

    hash_table *hash = NULL;
    hash_table *elem = NULL;
    bool flag = false;

    for (int i = 0; i < numsSize; i++) {
        HASH_FIND_INT(hash, &nums[i], elem);

        if (!elem) {
            elem = (hash_table *)malloc(sizeof(hash_table));
            elem->key = nums[i];
            HASH_ADD_INT(hash, key, elem);
        } else {
            flag = true;
            break;
        }
    }

    hash_table *tmp;
    HASH_ITER(hh, hash, elem, tmp) {
        HASH_DEL(hash, elem);
        free(elem);
    }

    return flag;
}

Program2

#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include "uthash.h"

typedef struct {
    int key;
    UT_hash_handle hh;
} hash_table;

hash_table *hash = NULL, *elem, *tmp;

bool containsDuplicate(int* nums, int numsSize){
    if (numsSize == 1) {
        return false;
    }

    bool flag = false;

    for (int i=0; i<numsSize; i++) {
        HASH_FIND_INT(hash, &nums[i], elem);

        if(!elem) {
            elem = malloc(sizeof(hash_table));
            elem->key = nums[i];
            HASH_ADD_INT(hash, key, elem);
        } else {
            flag = true;
            break;
        }
    }

    HASH_ITER(hh, hash, elem, tmp) {
        HASH_DEL(hash, elem); free(elem);
    }

    return flag;
}

From my beginner's perspective, the main differences I see are:

I'm trying to understand:


Solution

  • The only significant difference is that your 1st program uses local variables and your 2nd program uses global variables for your hash tables. Global variables make your program harder to reason about as you have to audit the whole program to figure out what reads or writes to any of those 3 global variables. Global variables make your code harder to test.

    If you have other uses for the hash table then it would make sense to keep it around rather than making it an implementation detail of containsDuplicate(). It costs O(n) to build the hash table but only O(1) to check if a new value is already present. Neither program supports this currently.

    Don't cast the void * from malloc(). It may hide type issues.

    Prefer sizeof *elem to sizeof(hash_table). The former is purely mechanical and it leads to less duplication.