Parallelising CCITT-CRC16 over via CUDA

I have been tasked by my boss to convert a sequential CRC16 algorithm that runs on the CPU into something that can run on the GPU via CUDA (that isn't just running the sequential algorithm in a single GPU thread).

Prior to this I had no CRC knowledge. Since then I have identified the CRC algorithm as CCITT-CRC16 reversed (0x8408 polynomial) and understood how the sequential algorithm works. The problem is that to calculate the CRC of element N you need to know the CRC of N-1 which isn't parallelisable.

I've read online that you can convert the sequential algorithm into a parallel agorithm by processing "partial CRC" values for each element of data and then doing a reduction to get all of them into one single CRC shift that you apply the initial CRC value to, which in our case is 0xFFFF. But the actual math behind it goes way over my head, and I have no idea how to implement it in code.

Can someone show me an example of how this is done for the reversed CCITT-CRC16 as an algorithm or code?

Thanks in advance.

Solution

The code below in C is generated by crcany for the CCITT CRC-16 (also known as the KERMIT CRC). crcany also generates fast software CRCs, which you may want to adapt to run on each GPU. You would then use crc16kermit_comb() to combine the CRCs generated in parallel in each GPU. crc16kermit_comb(crc1, crc2, len2) will return the combined CRC of sequential blocks, where crc1 is the CRC of the first block, and crc2 is the CRC of the second block, and len2 is the length of the second block. That can be repeated, using the returned CRC as crc1, and the providing the CRC and length of the third block. And so on.

#include <stdint.h>

static uint16_t multmodp(uint16_t a, uint16_t b) {
    uint16_t prod = 0;
    for (;;) {
        if (a & 0x8000) {
            prod ^= b;
            if ((a & 0x7fff) == 0)
                break;
        }
        a <<= 1;
        b = b & 1 ? (b >> 1) ^ 0x8408 : b >> 1;
    }
    return prod;
}

static uint16_t const table_comb[] = {
    0x4000, 0x2000, 0x0800, 0x0080, 0x8408, 0x0cec, 0x861d, 0x3f75, 0x9471, 0x3fc8,
    0x236c, 0x0abf, 0x7955, 0x3811, 0x1a22
};

static uint16_t x8nmodp(uintmax_t n) {
    uint16_t xp = 0x8000;
    int k = 3;
    for (;;) {
        if (n & 1)
            xp = multmodp(table_comb[k], xp);
        n >>= 1;
        if (n == 0)
            break;
        if (++k == 15)
            k = 0;
    }
    return xp;
}

uint16_t crc16kermit_comb(uint16_t crc1, uint16_t crc2, uintmax_t len2) {
    return multmodp(x8nmodp(len2), crc1) ^ crc2;
}

If the lengths of the blocks are the same (except perhaps the last block), then you can precompute x8nmodp(len2) to save some time.

By the way, the function name multmodp(a, b) means multiply the polynomials a and b modulo the CRC polynomial, and the function name x8nmodp(n) means compute the polynomial x⁸ⁿ modulo the CRC polynomial. 8n is how many bits crc1 is shifted by.