Discrete Wavelet Transform integer Daub 5/3 lifting issue

I'm trying to run an integer-to-integer lifting 5/3 on an image of lena. I've been following the paper "A low-power Low-memory system for wavelet-based image compression" by Walker, Nguyen, and Chen (Link active as of 7 Oct 2015).

I'm running into issues though. The image just doesn't seem to come out quite right. I appear to be overflowing slightly in the green and blue channels which means that subsequent passes of the wavelet function find high frequencies where there ought not to be any. I'm also pretty sure I'm getting something else wrong as I am seeing a line of the s0 image at the edges of the high frequency parts.

My function is as follows:

bool PerformHorizontal( Col24* pPixelsIn, Col24* pPixelsOut, int width, int pixelPitch, int height )
{
    const int widthDiv2 = width / 2;
    int y   = 0;
    while( y < height )
    {
        int x = 0;
        while( x < width )
        {
            const int n     = (x)       + (y * pixelPitch);
            const int n2    = (x / 2)   + (y * pixelPitch);

            const int s     = n2;
            const int d     = n2 + widthDiv2;

            // Non-lifting 5 / 3
            /*pPixelsOut[n2 + widthDiv2].r  = pPixelsIn[n + 2].r - ((pPixelsIn[n + 1].r + pPixelsIn[n + 3].r) / 2) + 128;
            pPixelsOut[n2].r                = ((4 * pPixelsIn[n + 2].r) + (2 * pPixelsIn[n + 2].r) + (2 * (pPixelsIn[n + 1].r + pPixelsIn[n + 3].r)) - (pPixelsIn[n + 0].r + pPixelsIn[n + 4].r)) / 8;

            pPixelsOut[n2   + widthDiv2].g  = pPixelsIn[n + 2].g - ((pPixelsIn[n + 1].g + pPixelsIn[n + 3].g) / 2) + 128;
            pPixelsOut[n2].g                = ((4 * pPixelsIn[n + 2].g) + (2 * pPixelsIn[n + 2].g) + (2 * (pPixelsIn[n + 1].g + pPixelsIn[n + 3].g)) - (pPixelsIn[n + 0].g + pPixelsIn[n + 4].g)) / 8;

            pPixelsOut[n2   + widthDiv2].b  = pPixelsIn[n + 2].b - ((pPixelsIn[n + 1].b + pPixelsIn[n + 3].b) / 2) + 128;
            pPixelsOut[n2].b                = ((4 * pPixelsIn[n + 2].b) + (2 * pPixelsIn[n + 2].b) + (2 * (pPixelsIn[n + 1].b + pPixelsIn[n + 3].b)) - (pPixelsIn[n + 0].b + pPixelsIn[n + 4].b)) / 8;*/

            pPixelsOut[d].r = pPixelsIn[n + 1].r    - (((pPixelsIn[n].r         + pPixelsIn[n + 2].r)   >> 1) + 127);
            pPixelsOut[s].r = pPixelsIn[n].r        + (((pPixelsOut[d - 1].r    + pPixelsOut[d].r)      >> 2) - 64);

            pPixelsOut[d].g = pPixelsIn[n + 1].g    - (((pPixelsIn[n].g         + pPixelsIn[n + 2].g)   >> 1) + 127);
            pPixelsOut[s].g = pPixelsIn[n].g        + (((pPixelsOut[d - 1].g    + pPixelsOut[d].g)      >> 2) - 64);

            pPixelsOut[d].b = pPixelsIn[n + 1].b    - (((pPixelsIn[n].b         + pPixelsIn[n + 2].b)   >> 1) + 127);
            pPixelsOut[s].b = pPixelsIn[n].b        + (((pPixelsOut[d - 1].b    + pPixelsOut[d].b)      >> 2) - 64);

            x += 2;
        }
        y++;
    }
    return true;
}

There is definitely something wrong but I just can't figure it out. Can anyone with slightly more brain than me point out where I am going wrong? Its worth noting that you can see the un-lifted version of the Daub 5/3 above the working code and this, too, give me the same artifacts ... I'm very confused as I have had this working once before (It was over 2 years ago and I no longer have that code).

Any help would be much appreciated :)

Edit: I appear to have eliminated my overflow issues by clamping the low pass pixels to the 0 to 255 range. I'm slightly concerned this isn't the right solution though. Can anyone comment on this?

Solution

OK I can losslessly forward then inverse as long as I store my post forward transform data in a short. Obviously this takes up a little more space than I was hoping for but this does allow me a good starting point for going into the various compression algorithms. You can also, nicely, compress 2 4 component pixels at a time using SSE2 instructions. This is the standard C forward transform I came up with:

        const int16_t dr    = (int16_t)pPixelsIn[n + 1].r   - ((((int16_t)pPixelsIn[n].r        + (int16_t)pPixelsIn[n + 2].r)  >> 1));
        const int16_t sr    = (int16_t)pPixelsIn[n].r       + ((((int16_t)pPixelsOut[d - 1].r   + dr)                           >> 2));

        const int16_t dg    = (int16_t)pPixelsIn[n + 1].g   - ((((int16_t)pPixelsIn[n].g        + (int16_t)pPixelsIn[n + 2].g)  >> 1));
        const int16_t sg    = (int16_t)pPixelsIn[n].g       + ((((int16_t)pPixelsOut[d - 1].g   + dg)                           >> 2));

        const int16_t db    = (int16_t)pPixelsIn[n + 1].b   - ((((int16_t)pPixelsIn[n].b        + (int16_t)pPixelsIn[n + 2].b)  >> 1));
        const int16_t sb    = (int16_t)pPixelsIn[n].b       + ((((int16_t)pPixelsOut[d - 1].b   + db)                           >> 2));

        pPixelsOut[d].r = dr;
        pPixelsOut[s].r = sr;

        pPixelsOut[d].g = dg;
        pPixelsOut[s].g = sg;

        pPixelsOut[d].b = db;
        pPixelsOut[s].b = sb;

It is trivial to create the inverse of this (A VERY simple bit of algebra). Its worth noting, btw, that you need to inverse the image from right to left bottom to top. I'll next see if I can shunt this data into uint8_ts and lost a bit or 2 of accuracy. For compression this really isn't a problem.