[SOLVED] What is a good, optimized C/C++ algorithm for converting a 24-bit bitmap to 16-bit with dithering?

What is a good, optimized C/C++ algorithm for converting a 24-bit bitmap to 16-bit with dithering?

I've been looking for an optimized (i.e., quick) algorithm that converts a 24-bit RGB bitmap to a 16-bit (RGB565) bitmap using dithering. I'm looking for something in C/C++ where I can actually control how the dithering is applied. GDI+ seems to provide some methods, but I can't tell if they dither or not. And, if they do dither, what mechanism are they using (Floyd-Steinberg?)

Does anyone have a good example of bitmap color-depth conversion with dithering?

Solution

As you mentioned, the Floyd-Steinberg dithering method is popular because it's simple and fast. For the subtle differences between 24-bit and 16-bit color the results will be nearly optimal visually.

It was suggested that I use the sample picture Lena but I decided against it; despite its long history as a test image I consider it too sexist for modern sensibilities. Instead I present a picture of my own. First up is the original, followed by the conversion to dithered RGB565 (and converted back to 24-bit for display).

Original Floyd-Steinberg Dithered RGB565

And the code, in C++:

inline BYTE Clamp(int n)
{
    n = n>255 ? 255 : n;
    return n<0 ? 0 : n;
}

struct RGBTriplet
{
    int r;
    int g;
    int b;
    RGBTriplet(int _r = 0, int _g = 0, int _b = 0) : r(_r), g(_g), b(_b) {};
};

void RGB565Dithered(const BYTE * pIn, int width, int height, int strideIn, BYTE * pOut, int strideOut)
{
    std::vector<RGBTriplet> oldErrors(width + 2);
    for (int y = 0;  y < height;  ++y)
    {
        std::vector<RGBTriplet> newErrors(width + 2);
        RGBTriplet errorAhead;
        for (int x = 0;  x < width;  ++x)
        {
            int b = (int)(unsigned int)pIn[3*x] + (errorAhead.b + oldErrors[x+1].b) / 16;
            int g = (int)(unsigned int)pIn[3*x + 1] + (errorAhead.g + oldErrors[x+1].g) / 16;
            int r = (int)(unsigned int)pIn[3*x + 2] + (errorAhead.r + oldErrors[x+1].r) / 16;
            int bAfter = Clamp(b) >> 3;
            int gAfter = Clamp(g) >> 2;
            int rAfter = Clamp(r) >> 3;
            int pixel16 = (rAfter << 11) | (gAfter << 5) | bAfter;
            pOut[2*x] = (BYTE) pixel16;
            pOut[2*x + 1] = (BYTE) (pixel16 >> 8);
            int error = r - ((rAfter * 255) / 31);
            errorAhead.r = error * 7;
            newErrors[x].r += error * 3;
            newErrors[x+1].r += error * 5;
            newErrors[x+2].r = error * 1;
            error = g - ((gAfter * 255) / 63);
            errorAhead.g = error * 7;
            newErrors[x].g += error * 3;
            newErrors[x+1].g += error * 5;
            newErrors[x+2].g = error * 1;
            error = b - ((bAfter * 255) / 31);
            errorAhead.b = error * 7;
            newErrors[x].b += error * 3;
            newErrors[x+1].b += error * 5;
            newErrors[x+2].b = error * 1;
        }
        pIn += strideIn;
        pOut += strideOut;
        oldErrors.swap(newErrors);
    }
}

I won't guarantee this code is perfect, I already had to fix one of those subtle errors that I alluded to in another comment. However it did generate the results above. It takes 24-bit pixels in BGR order as used by Windows, and produces R5G6B5 16-bit pixels in little endian order.