javaasciismppjsmpp

US-ASCII string (de-)compression into/from a byte array (7 bits/character)


As we all know, ASCII uses 7-bit to encode chars, so number of bytes used to represent the text is always less than the length of text letters

For example:

    StringBuilder text = new StringBuilder();
    IntStream.range(0, 160).forEach(x -> text.append("a")); // generate 160 text
    int letters = text.length();
    int bytes = text.toString().getBytes(StandardCharsets.US_ASCII).length;
    System.out.println(letters); // expected  160,  actual 160
    System.out.println(bytes); //   expected  140,  actual 160

Always letters = bytes, but the expected is letters > bytes.

the main proplem: in smpp protocol sms body must be <= 140 byte, if we used ascii encoding, then you can write 160 letters =(140*8/7),so i'd like to text encoded in 7-bit based ascii, we are using JSMPP library

Can anyone explain it to me please and guide me to the right way, Thanks in advance (:


Solution

  • Here is a quick & dirty solution without any libraries, i.e. only JRE on-board means. It is not optimised for efficiency and does not check if the message is indeed US-ASCII, it just assumes it. It is just a proof of concept:

    package de.scrum_master.stackoverflow;
    
    import java.util.BitSet;
    
    public class ASCIIConverter {
      public byte[] compress(String message) {
        BitSet bits = new BitSet(message.length() * 7);
        int currentBit = 0;
        for (char character : message.toCharArray()) {
          for (int bitInCharacter = 0; bitInCharacter < 7; bitInCharacter++) {
            if ((character & 1 << bitInCharacter) > 0)
              bits.set(currentBit);
            currentBit++;
          }
        }
        return bits.toByteArray();
      }
    
      public String decompress(byte[] compressedMessage) {
        BitSet bits = BitSet.valueOf(compressedMessage);
        int numBits = 8 * compressedMessage.length - compressedMessage.length % 7;
        StringBuilder decompressedMessage = new StringBuilder(numBits / 7);
        for (int currentBit = 0; currentBit < numBits; currentBit += 7) {
          char character = (char) bits.get(currentBit, currentBit + 7).toByteArray()[0];
          decompressedMessage.append(character);
        }
        return decompressedMessage.toString();
      }
    
      public static void main(String[] args) {
        String[] messages = {
          "Hello world!",
          "This is my message.\n\tAnd this is indented!",
          " !\"#$%&'()*+,-./0123456789:;<=>?\n"
            + "@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_\n"
            + "`abcdefghijklmnopqrstuvwxyz{|}~",
          "1234567890123456789012345678901234567890"
            + "1234567890123456789012345678901234567890"
            + "1234567890123456789012345678901234567890"
            + "1234567890123456789012345678901234567890"
        };
    
        ASCIIConverter asciiConverter = new ASCIIConverter();
        for (String message : messages) {
          System.out.println(message);
          System.out.println("--------------------------------");
          byte[] compressedMessage = asciiConverter.compress(message);
          System.out.println("Number of ASCII characters = " + message.length());
          System.out.println("Number of compressed bytes = " + compressedMessage.length);
          System.out.println("--------------------------------");
          System.out.println(asciiConverter.decompress(compressedMessage));
          System.out.println("\n");
        }
      }
    }
    

    The console log looks like this:

    Hello world!
    --------------------------------
    Number of ASCII characters = 12
    Number of compressed bytes = 11
    --------------------------------
    Hello world!
    
    
    This is my message.
        And this is indented!
    --------------------------------
    Number of ASCII characters = 42
    Number of compressed bytes = 37
    --------------------------------
    This is my message.
        And this is indented!
    
    
     !"#$%&'()*+,-./0123456789:;<=>?
    @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
    `abcdefghijklmnopqrstuvwxyz{|}~
    --------------------------------
    Number of ASCII characters = 97
    Number of compressed bytes = 85
    --------------------------------
     !"#$%&'()*+,-./0123456789:;<=>?
    @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
    `abcdefghijklmnopqrstuvwxyz{|}~
    
    
    1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
    --------------------------------
    Number of ASCII characters = 160
    Number of compressed bytes = 140
    --------------------------------
    1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890