ccompressionlzw

How to get LZW encode results as from example?


Given the input from example1 on page http://michael.dipperstein.com/lzw/#example1, I am unable to get the correct result:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "lzw.h"

void print_hex(unsigned char str[], int len) 
{
    int idx;

    for (idx = 0; idx < len; idx++)
        printf("%02x", str[idx]);
}

int main()
{
    FILE *fpIn;             /* pointer to open input file */
    FILE *fpOut;            /* pointer to open output file */
    FILE *fptr;

    char test_str_lzw[] = { "this_is_his_thing" };
    fptr = fopen("lzw_in_test.txt", "wb");
    fwrite(test_str_lzw, sizeof(char), strlen(test_str_lzw), fptr);
    fclose(fptr);

    fpIn = fopen("lzw_in_test.txt", "rb");
    fpOut = fopen("lzw_out.txt", "wb");

    LZWEncodeFile(fpIn, fpOut);

    fclose(fpIn);
    fclose(fpOut);

    // Getting the results from file
    if ((fptr = fopen("lzw_out.txt", "rb")) == NULL) {
        printf("Error! opening file");
        // Program exits if file pointer returns NULL.
        exit(1);
    }

    unsigned char lzw_out[256];
    memset(lzw_out, 0, 256);

    size_t num;
    num = fread(lzw_out, sizeof(unsigned char), 256, fptr);


    fclose(fptr);

    unsigned int lzw_size = num;
    printf("LZW out size: %d\n", lzw_size);
    printf("LZW out data: \n");
    print_hex(lzw_out, lzw_size);
    printf("\n");

    return(0);
}

Expected result in Hex:

0x74 0x68 0x69 0x73 0x5F 0x102 0x5F 0x101 0x103 0x100 0x69 0x6E 0x67

Result I'm getting in Hex:

0x74 0x34 0x1A 0x4E 0x65 0xF0 0x15 0x7C 0x03 0x03 0x80 0x5A 0x4D 0xC6 0x70 0x20

Can anyone help me to get output file like from the example?

Regards.


Solution

  • The LZW encoder is encoding a sequence of 9-bit code words as a sequence of 8-bit bytes. There is a mechanism for signalling an increase in the length of the code word as required, but let's ignore that for simplicity as it is not required for encoding short input sequences.

    For OP's example, the first eight 9-bit code words are (in hex):

    0x74 0x68 0x69 0x73 0x5F 0x102 0x5F 0x101

    or expressed in binary:

    001110100 001101000 001101001 001110011 001011111 100000010 001011111 100000001

    The encoder splits each 9-bit code word into two groups - bits 7 to 0, followed by bit 8:

    01110100 0 01101000 0 01101001 0 01110011 0 01011111 0 00000010 1 01011111 0 00000001 1

    This is then regrouped as a sequence of 8-bit bytes:

    01110100 00110100 00011010 01001110 01100101 11110000 00010101 01111100 00000011

    or expressed in hex:

    0x74 0x34 0x1A 0x4E 0x65 0xF0 0x15 0x7C 0x03