cbit-manipulationbitwise-operators

Compress string using bits Operation in C


I want to reduce memory storage of chars using Bitwise Operations,

for Example

input: {"ACGT"}

output: {"0xE4"}

where we represtnt the number in binary then into Hexadecimal

if A=00, C=01, G=10, T=11

so ACGT = 0xE4 = 11100100b

I cant Figure the whole way so here is what I did So Far

enum NucleicAc {
    A = 0,
    C = 1,
    G = 2,
    T = 3,

} ;


struct _DNA {
    char * seq ;
};

DNAString DSCreate(char * mseq) {
    
    DNAString dna = malloc(sizeof(struct _DNA));
    if (!dna){
       exit(-1);
    }
    
    const int length = strlen(mseq);
    
    
  
    //  left shift will create the base , in the case of ACGT --> 0000,0000  
    int base   = 0 << length * 2;
    //loop and set bits
    int i = 0 ;
    int k = 0 ; // the counter where i want to modify the current bit
    //simple looping gor the string 
    for ( ; i < length ; i++ ) {
        switch (*(mseq+i)) {
            case 'A': // 00
                k++;
                break;
            case 'C': // 0 1
                  modifyBit(&base, k, 1);
                  k++;
                break;
                
            case 'G': //10
                k++;
                modifyBit(&base,k , 1);
                break;
                
            case 'T': // 11
                modifyBit(&base, k, 1);
                k++;
                modifyBit(&base, k,1);
                break;
            default:
                break;
                
        } //end of switch
        k++;
     
    }//end of for
    

    char * generatedSeq ;

    //convert the base to hex ??

    return dna;
}

void bin(unsigned n){
    unsigned i;
    for (i = 1 << 7; i > 0; i = i / 2){
        (n & i) ? printf("1") : printf("0");
    }
}

and if we print the base, the value is 11100100b as expected,

how to store the hexadecimal representation as String to the char *mseq in the struct ? any direction or exact solutions or is there any better approach?

and also later i want to get the letter using only the index for Example

DSGet(dsStruct , '0')--> will return 'A' hence dsStruct contains the "ACGT" before Encode?


Solution

  • There are several ways you can approach encoding the sequence. Your enum is fine, but for your encoded sequence, a struct that captures the bytes as unsigned char, the original sequence length and the encoded size in bytes will allow an easy decoding. You will get 4-to-1 compression (plus 1 byte if you sequence isn't evenly divisible by 4). The enum and struct could be:

    enum { A, C, G, T };
    
    typedef struct {
      unsigned char *seq;
      size_t len, size;
    } encoded;
    

    To map characters in your string to the encoded values a simple function that returns the enum value matching the character is all you need (don't forget to handle any errors)

    /* convert character to encoded value */
    unsigned char getencval (const char c)
    {
      if (c == 'A')
        return A;
      else if (c == 'C')
        return C;
      else if (c == 'G')
        return G;
      else if (c == 'T')
        return T;
      
      /* exit on anything other than A, C, G, T */
      fprintf (stderr, "error: invalid sequence character '%c'\n", c);
      exit (EXIT_FAILURE);
    }
    

    To encode the original sequence, you will populate the encoded struct with the original length (len) and number of bytes (size) needed to hold the encoded string. Don't forget that any 1-character of the next 4-characters will require another byte of storage. You can use a simple add-and-divide to account for any partial 4-character ending portion of the sequence, e.g.

    /* encode sequence of characters as 2-bit pairs (4-characters per-byte)
     * returns encoded struct with allocated .seq member, on failure the .seq
     * member is NULL. User is resposible for freeing .seq member when done.
     */
    encoded encode_seq (const char *seq)
    {
      size_t  len = strlen(seq),
              size = (len + 3) / 4;             /* integer division intentional */
      encoded enc = { .seq = calloc (1, size),  /* encoded sequence struct */
                      .len = len,
                      .size = size };
      
      if (!enc.seq) { /* validate allication */
        perror ("calloc-enc.seq");
        return enc;
      }
      
      /* loop over each char (i) with byte index (ndx) 
       * shifting each 2-bit pair by (shift * 2) amount.
       */
      for (int i = 0, ndx = 0, shift = 4; seq[i] && seq[i] != '\n'; i++) {
        if (!shift--)           /* decrement shift, reset if 0 */
          shift = 3;
        if (i && i % 4 == 0)    /* after each 4th char, increment ndx */
          ndx += 1;
        
        /* shift each encoded value (multiply by 2 for shift of 6, 4, 2, 0) */
        enc.seq[ndx] |= getencval (seq[i]) << shift * 2;
      }
      
      return enc;   /* return encoded struct with allocated .seq member */
    }
    

    To get the original sequence back from your encoded struct, the use of a lookup table (shown below with the full code) makes it a breeze. You simply loop over all stored byte values appending the corresponding strings from the lookup table until the final byte. For the final byte, you need to determine if it is a partial string and, if so, how many characters remain to copy. (that's why you store the original sequence length in your struct). Then simply use strncat to append that many characters from the final byte, e.g.

    /* decodes encoded sequence. Allocates storage for decoded sequence
     * and loops over each encoded byte using lookup-table to obtain
     * original 4-character string from byte value. User is responsible
     * for freeing returned string when done. Returns NULL on allocation
     * failure.
     */
    char *decode_seq (encoded *eseq)
    {
      char *seq = malloc (eseq->len + 1);   /* allocate storage for sequence */
      size_t i = 0, offset = 0, remain;
      
      if (!seq) { /* validate allocation */
        perror ("malloc-seq");
        return NULL;
      }
      
      /* loop appending strings from lookup table for all but last byte */
      for (; i < eseq->size - 1; i++) {
        memcpy (seq + offset, lookup[eseq->seq[i]], 4);
        offset += 4;  /* increment offset by 4 */
      }
      
      /* determine the number of characters in last byte */
      remain = eseq->len - (eseq->size - 1) * 4;
      memcpy (seq + offset, lookup[eseq->seq[i]], remain);
      seq[offset + remain] = 0;   /* nul-terminate seq */
      
      return seq;     /* return allocated sequence */
    }
    

    Adding the lookup table and putting all the pieces together, one way to approach this problem is:

    (edit: lookup table reordered to match your byte-value encoding, optimized decode_seq() to not scan for end-of-string on copy)

    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    
    enum { A, C, G, T };
    
    typedef struct {
      unsigned char *seq;
      size_t len, size;
    } encoded;
    
    const char lookup[][5] = {
        "AAAA","CAAA","GAAA","TAAA","ACAA","CCAA","GCAA","TCAA",
        "AGAA","CGAA","GGAA","TGAA","ATAA","CTAA","GTAA","TTAA",
        "AACA","CACA","GACA","TACA","ACCA","CCCA","GCCA","TCCA",
        "AGCA","CGCA","GGCA","TGCA","ATCA","CTCA","GTCA","TTCA",
        "AAGA","CAGA","GAGA","TAGA","ACGA","CCGA","GCGA","TCGA",
        "AGGA","CGGA","GGGA","TGGA","ATGA","CTGA","GTGA","TTGA",
        "AATA","CATA","GATA","TATA","ACTA","CCTA","GCTA","TCTA",
        "AGTA","CGTA","GGTA","TGTA","ATTA","CTTA","GTTA","TTTA",
        "AAAC","CAAC","GAAC","TAAC","ACAC","CCAC","GCAC","TCAC",
        "AGAC","CGAC","GGAC","TGAC","ATAC","CTAC","GTAC","TTAC",
        "AACC","CACC","GACC","TACC","ACCC","CCCC","GCCC","TCCC",
        "AGCC","CGCC","GGCC","TGCC","ATCC","CTCC","GTCC","TTCC",
        "AAGC","CAGC","GAGC","TAGC","ACGC","CCGC","GCGC","TCGC",
        "AGGC","CGGC","GGGC","TGGC","ATGC","CTGC","GTGC","TTGC",
        "AATC","CATC","GATC","TATC","ACTC","CCTC","GCTC","TCTC",
        "AGTC","CGTC","GGTC","TGTC","ATTC","CTTC","GTTC","TTTC",
        "AAAG","CAAG","GAAG","TAAG","ACAG","CCAG","GCAG","TCAG",
        "AGAG","CGAG","GGAG","TGAG","ATAG","CTAG","GTAG","TTAG",
        "AACG","CACG","GACG","TACG","ACCG","CCCG","GCCG","TCCG",
        "AGCG","CGCG","GGCG","TGCG","ATCG","CTCG","GTCG","TTCG",
        "AAGG","CAGG","GAGG","TAGG","ACGG","CCGG","GCGG","TCGG",
        "AGGG","CGGG","GGGG","TGGG","ATGG","CTGG","GTGG","TTGG",
        "AATG","CATG","GATG","TATG","ACTG","CCTG","GCTG","TCTG",
        "AGTG","CGTG","GGTG","TGTG","ATTG","CTTG","GTTG","TTTG",
        "AAAT","CAAT","GAAT","TAAT","ACAT","CCAT","GCAT","TCAT",
        "AGAT","CGAT","GGAT","TGAT","ATAT","CTAT","GTAT","TTAT",
        "AACT","CACT","GACT","TACT","ACCT","CCCT","GCCT","TCCT",
        "AGCT","CGCT","GGCT","TGCT","ATCT","CTCT","GTCT","TTCT",
        "AAGT","CAGT","GAGT","TAGT","ACGT","CCGT","GCGT","TCGT",
        "AGGT","CGGT","GGGT","TGGT","ATGT","CTGT","GTGT","TTGT",
        "AATT","CATT","GATT","TATT","ACTT","CCTT","GCTT","TCTT",
        "AGTT","CGTT","GGTT","TGTT","ATTT","CTTT","GTTT","TTTT"};
    
    /* convert character to encoded value */
    unsigned char getencval (const char c)
    {
      if (c == 'A')
        return A;
      else if (c == 'C')
        return C;
      else if (c == 'G')
        return G;
      else if (c == 'T')
        return T;
      
      /* exit on anything other than A, C, G, T */
      fprintf (stderr, "error: invalid sequence character '%c'\n", c);
      exit (EXIT_FAILURE);
    }
    
    /* encode sequence of characters as 2-bit pairs (4-characters per-byte)
     * returns encoded struct with allocated .seq member, on failure the .seq
     * member is NULL. User is resposible for freeing .seq member when done.
     */
    encoded encode_seq (const char *seq)
    {
      size_t  len = strlen(seq),
              size = (len + 3) / 4;             /* integer division intentional */
      encoded enc = { .seq = calloc (1, size),  /* encoded sequence struct */
                      .len = len,
                      .size = size };
      
      if (!enc.seq) { /* validate allication */
        perror ("calloc-enc.seq");
        return enc;
      }
      
      /* loop over each char (i) with byte index (ndx) 
       * shifting each 2-bit pair by (shift * 2) amount.
       */
      for (int i = 0, ndx = 0, shift = 0; seq[i] && seq[i] != '\n'; i++, shift++) {
        if (shift == 4)         /* reset to 0 */
          shift = 0;
        if (i && i % 4 == 0)    /* after each 4th char, increment ndx */
          ndx += 1;
        
        /* shift each encoded value (multiply by 2 for shift of 0, 2, 4, 6) */
        enc.seq[ndx] |= getencval (seq[i]) << shift * 2;
      }
      
      return enc;   /* return encoded struct with allocated .seq member */
    }
    
    /* decodes encoded sequence. Allocates storage for decoded sequence
     * and loops over each encoded byte using lookup-table to obtain
     * original 4-character string from byte value. User is responsible
     * for freeing returned string when done. Returns NULL on allocation
     * failure.
     */
    char *decode_seq (encoded *eseq)
    {
      char *seq = malloc (eseq->len + 1);   /* allocate storage for sequence */
      size_t i = 0, offset = 0, remain;
      
      if (!seq) { /* validate allocation */
        perror ("malloc-seq");
        return NULL;
      }
      
      /* loop appending strings from lookup table for all but last byte */
      for (; i < eseq->size - 1; i++) {
        memcpy (seq + offset, lookup[eseq->seq[i]], 4);
        offset += 4;  /* increment offset by 4 */
      }
      
      /* determine the number of characters in last byte */
      remain = eseq->len - (eseq->size - 1) * 4;
      memcpy (seq + offset, lookup[eseq->seq[i]], remain);
      seq[offset + remain] = 0;   /* nul-terminate seq */
      
      return seq;     /* return allocated sequence */
    }
    
    /* short example program that takes string to encode as 1st argument
     * using "ACGT" if no argument is provided by default
     */
    int main (int argc, char **argv) {
      
      char *seq = NULL;
      encoded enc = encode_seq(argc > 1 ? argv[1] : "ACGT");
      
      if (!enc.seq)   /* validate encoded allocation */
        return 1;
      
      /* output original string, length and encoded size */
      printf ("encoded str  : %s\nencoded len  : %zu\nencoded size : %zu\n", 
              argc > 1 ? argv[1] : "ACGT", enc.len, enc.size);
      
      /* loop outputting byte-values of encoded string */
      fputs ("encoded seq  :", stdout);
      for (size_t i = 0; i < enc.size; i++)
         printf (" 0x%02x", enc.seq[i]);
      putchar ('\n');
      
      seq = decode_seq (&enc);              /* decode seq from byte values */
      printf ("decoded seq  : %s\n", seq);  /* output decoded string */
      
      free (seq);         /* don't forget to free what you allocated */
      free (enc.seq);
    }
    
    

    In most cases a lookup-table provides a great deal of efficiency advantage compared to computing and building each 4-character string during decoding. This is enhanced by the lookup table staying resident in cache for most cases.

    The length of the DNA sequence you can encode and decode is limited only by the amount of virtual memory you have available.

    Example Use/Output

    The program takes the sequence to encode and decode as the first argument (default "ACGT"). So the default output is:

    $ ./bin/dnaencodedecode
    encoded str  : ACGT
    encoded len  : 4
    encoded size : 1
    encoded seq  : 0xe4
    decoded seq  : ACGT
    

    4-byte encoded in 1-byte. Note the byte value of 0x1b and not 0xe4 due to the table ordering.

    A longer example:

    ./bin/dnaencodedecode ACGTGGGTCAGACTTA
    encoded str  : ACGTGGGTCAGACTTA
    encoded len  : 16
    encoded size : 4
    encoded seq  : 0xe4 0xea 0x21 0x3d
    decoded seq  : ACGTGGGTCAGACTTA
    

    16-character encoded in 4-bytes.

    Finally, what of a sequence that isn't divisible by 4 so you have a partial number of characters in the last encoded byte? That is handled as well, e.g.

    $ ./bin/dnaencodedecode ACGTGGGTCAGACTTAG
    encoded str  : ACGTGGGTCAGACTTAG
    encoded len  : 17
    encoded size : 5
    encoded seq  : 0xe4 0xea 0x21 0x3d 0x02
    decoded seq  : ACGTGGGTCAGACTTAG
    

    17 characters encoded in 5-bytes. (not the pure 4-to-1 compression, but as the sequence size increases, the significance of any partial group of characters in the last byte becomes negligible)

    As far a perfomance, for a sequence of 100,000 characters and output of the the byte values and strings replaced with a simple loop that compares the decoded seq to the original argv[1] it only takes a few thousandth of a second (on an old i7 Gen2 laptop with SSD) to encode and decode and validate, e.g.

    $ 
    time ./bin/dnaencodedecodet2big $(< dat/dnaseq100k.txt)
    encoded len  : 100000
    encoded size : 25000
    all tests passed
    
    real    0m0.014s
    user    0m0.012s
    sys     0m0.003s
    

    There are a lot of ways to do this, but given your description, this was what came to my mind that you were trying to accomplish. There is a lot here, so take your time going through it.

    Look things over (the code is commented), and let me know if you have further questions. Just drop a comment below.