pythonprobabilitychecksumcrc32hash-collision

What is the CRC32 Collision probability of All possible ASCII strings of variable length ranging from 1 to 7


I am trying to find the CRC32 collision probability among all possible ASCII strings of variable length ranging from 1 to 7. Here an ASCII character can range from ASCII 32 to ASCII 126. I tried to run my program in my PC, but due to high requirement of CPU, the program would crash and never run to completion. Is there any other way to find this out.

my code is as below:

import binascii
from itertools import product

def crc32_all_ascii_strings(length):
  strings = []
  for chars in product(range(32, 127), repeat=length):
    strings.append(''.join(chr(c) for c in chars))

  crc32_dict = {}
  for string in strings:
    crc32 = binascii.crc32(string.encode()) & 0xFFFFFFFF
    if crc32 not in crc32_dict:
      crc32_dict[crc32] = []
    crc32_dict[crc32].append(string)

  return crc32_dict

if __name__ == "__main__":
  for length in range(1, 8):
    crc32_dict = crc32_all_ascii_strings(length)
    for crc32, string_list in crc32_dict.items():
      if len(string_list) > 1:
        print(f"CRC32: {crc32:08X}")
        for string in string_list:
          print(f"  {string}")

Solution

  • The probability of collision for strings of length 1-4 is exactly zero. A CRC-32 is a one-to-one and onto mapping of 32 bits to 32 bits.

    For length 5, the probability of a collision is very close to 2-32. 955 is about double 232, so there is plenty of coverage of the range space. For length 6 or more, it is almost exactly 2-32, whether it's ASCII or all byte values. The domain constraint has negligible effect.

    Update:

    I ran the experiments, and for length 5, restricting the five bytes to be in 32..126, the probability of collision is about 2-32.2. That's 13% less than what the probability of collision would be if the bytes were not constrained (in 0..255), which is 2-32.

    For length 6, bytes in 32..126, we indeed get a collision probability of 2-32. It will be the same for any greater length.

    Update 2:

    I was able to calculate the probability for length 5. It is: 4111856 divided by 20546909374090625 ≈ 2-32.2184. Here is the code in C++ that does the calculation:

    // crc-hits.cpp   Calculate the probability of CRC collisions
    // Mark Adler, March 5, 2024
    
    // Count the number of times each CRC value is hit by the CRC of five input
    // bytes whose range is restricted to 32..126. Use the counts to calculate the
    // probability of collision. This uses 1.5GB of memory.
    
    #include <iostream>
    #include <vector>
    #include <bitset>
    #include <math.h>
    #include <stdint.h>
    
    // Range of byte values.
    #define LO 32
    #define HI 126
    
    // Return the greatest common divisor of a and b.
    uint64_t gcd(uint64_t a, uint64_t b) {
        while (b) {
            auto t = b;
            b = a % b;
            a = t;
        }
        return a;
    }
    
    int main() {
        // Generate a byte-wise table for calculating a reflected CRC-32, using the
        // CRC-32/ISO-HDLC polynomial.
        uint32_t table[256];
        for (unsigned i = 0; i < 256; i++) {
            uint32_t crc = i;
            for (int k = 0; k < 8; k++)
                crc = crc & 1 ? (crc >> 1) ^ 0xedb88320 : crc >> 1;
            table[i] = crc;
        }
    
        // Bit vectors with a count of the number of occurrences of each CRC value.
        // Each count is broken up across the bitsets, with one bit of the count
        // per bitset. For any CRC value x, bit k of the count is count[k][x].
        // count[] is resized as needed to be able to hold the largest count. For
        // each additional bit of a count, another 512MB bitset is required.
        std::vector<std::bitset<(1ULL << 32)>> count;
    
        // Calculate the CRCs of all five-symbol sequences (symbol in LO..HI). The
        // CRCs of subsequences are retained to avoid repeated calculations. This
        // reduces the number of table lookups by almost a factor of five.
        for (unsigned a0 = LO; a0 <= HI; a0++) {
            auto c1 = table[a0];
            for (unsigned a1 = LO; a1 <= HI; a1++) {
                auto c2 = (c1 >> 8) ^ table[(c1 ^ a1) & 0xff];
                for (unsigned a2 = LO; a2 <= HI; a2++) {
                    auto c3 = (c2 >> 8) ^ table[(c2 ^ a2) & 0xff];
                    for (unsigned a3 = LO; a3 <= HI; a3++) {
                        auto c4 = (c3 >> 8) ^ table[(c3 ^ a3) & 0xff];
                        for (unsigned a4 = LO; a4 <= HI; a4++) {
                            auto c5 = (c4 >> 8) ^ table[(c4 ^ a4) & 0xff];
    
                            // Increment the count for the value c5.
                            size_t j = 0;
                            while (j < count.size() && count[j].test(c5))
                                count[j++].reset(c5);
                            if (j == count.size())
                                // Add one more bitset (512MB!) to count.
                                count.resize(j + 1);
                            count[j].set(c5);
                        }
                    }
                }
            }
        }
    
        // Calculate and show a histogram of the hit counts.
        std::vector<unsigned> hist(1 << count.size());
        for (size_t i = 0; i < count[0].size(); i++) {
            unsigned c = 0;
            auto j = count.size();
            while (j)
                c = (c << 1) | count[--j][i];
            hist[c]++;
        }
        std::cout << "# seqs hitting same CRC: count of such CRC values\n";
        for (size_t i = 0; i < hist.size(); i++)
            if (hist[i])
                std::cout << i << ": " << hist[i] << '\n';
    
        // Calculate the probability of collision. The numerator is n, the
        // denominator is d(d-1).
        uint64_t n = 0;
        for (size_t i = 2; i < hist.size(); i++)
            n += i * (i - 1) * (uint64_t)hist[i];
        uint64_t d = HI - LO + 1;  d *= d;  d *= d;  d *= HI - LO + 1;
        auto e = d - 1;
    
        // Reduce the fraction. Simplify if the denominator fits in 64 bits.
        auto g = gcd(n, d);  n /= g;  d /= g;
        g = gcd(n, e);  n /= g;  e /= g;
        if (e <= (uint64_t)-1 / d) {
            d *= e;
            e = 1;
        }
    
        // Show the probability exactly, and as an approximate power of two.
        std::cout << "p = " << n << " / ";
        if (e == 1)
            std::cout << d;
        else
            std::cout << "(" << d << " * " << e << ")";
        std::cout << " ≈ 2^" << log2(n / (d * (double)e)) << '\n';
    }
    
    // Result:
    //    # seqs hitting same CRC: count of such CRC values
    //    0: 567451345
    //    1: 1258641855
    //    2: 1264764208
    //    3: 968752448
    //    4: 133405440
    //    5: 101952000
    //    p = 4111856 / 20546909374090625 ≈ 2^-32.2184
    //    ./crc-hits  95.90s user 0.52s system 92% cpu 1:44.57 total