javabitbitset

Java BitSet.size() is not returning the size I gave at the constructor


I have the following code:

 private BitSet arrayListToBitSet(ArrayList<Boolean> code) {
                // The problem is here, the bitset automatically is created using 64 bits even tho we are clearly 
                    // giving it a max size in constructor, and in the test we are running,
                    // we are only using 9 bits, leading to the rest of the bits being automatically 0, so we need to
                    // search for a way that the size of the bitset is exactly the size of the array list "code"
                int x = code.size(); // With the test I'm running, x = 9;
                BitSet bitset = new BitSet(x);
                System.out.println(bitset.size());
        
                for (int i = 0; i < x; i++)
                {
                    bitset.set(i, code.get(i));
                }
        
                return bitset;
        }

The System.out.println() is showing 64 on console, even tho it should show only 9.

And even when I hard coded 9 into the constructor of BitSet, it still returned that the size of the bitset is 64. As I am making a program that compress .txt files, is is crucial for me to use BitSets intead of booleans, or if anyone knows another way to manipulate single bits in java, it can also be helpful.

In the documentation of BitSet.size() just says the following

public int size()
Returns the number of bits of space actually in use by this BitSet to represent bit values. The maximum element in the set is the size - 1st element.
Returns:
the number of bits currently in this bit set 

And this really does not help much to my problem. Anyone knows the reason this happens? and how to fix it? Thanks a lot.

More context: What I am doing is an implementation of the huffman algorithm to compress .txt files given, to a new class called "CompressedFile" that is going to be stored in the pc memory.

What happens before the above method is that we count the chars in the text and save them to a hashmap, with this hashmap we then create the Huffman tree, once we have the Huffman tree, we use it for getting the huffman enconding for each one of the chars in the text, each of the code for a given char are saved in another HashMap with this value pairs:

(char, ArrayList), the code is an ArrayList of booleans for now because of the problem with the BitSets.

Once we have this, we give this HashMap to another function that creates the "encodedText", this is an ArrayList that stores the encoded version of the tex, we do this by going by every char in the .txt, and adding the encoded value of that char to the "encodedText" variable (recall that the encoded value of a char is saved also as an ArrayList in a HashMap).

Once we have the complete Huffman code, is where this arrayListToBitSet method comes into play, it converts the encodedText in ArrayList version, to a BitSet version, that is the one that is going to be stored in the CompressedText class.

The .txt file we are using for test purposes is just this: "aaabbc"

After the getting the Huffman tree and getting each character encoded version, the value pairs look like this (im gonna use 0 and 1, as to not write True and False):

(a, 0) (b, 11) (c, 10)

Therefore the encodedVersion of the text will look like this: 000111110

when this ArrayList is converted to BitSet, the output (I print the Bitset to console) looks like this: 0001111100000000000... (another 42 ceros) ... 00

So the first 9 bits are correct, but the following 53 bits shouldn't be there, and are cero. If I were to leave it like this this is what the decoded text would look like:

aaabbcaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...

very different from the original. One way to fix this would be to include the correct size of the encoded message in the CompressFile class, so when I decompress it, I only check the exact number of bits, but I want to know if there is a way to not have to save this integer in the class (this would save me 8 bytes, not much in the grand scheme of things, coming to be 8 chars, but the purpose is to use as less memory as possible). Still if there is no other way, I will put that original size into the CompressedFile class.

Thanks in advance for the help :)


Solution

  • Notice what the constructor's documentation says:

    Creates a bit set whose initial size is large enough to explicitly represent bits with indices in the range 0 through nbits-1.

    The phrase "large enough" is key here. This constructor is not supposed to create a bit set that is exactly the size you specified - just large enough to store the number of bits you specified.

    Also note this paragraph in the class documentation:

    Every bit set has a current size, which is the number of bits of space currently in use by the bit set. Note that the size is related to the implementation of a bit set, so it may change with implementation.

    BitSet, at least in OpenJDK, is implemented by using a long[]. This explains why you see 64. The least it can do to create a bit set that is "large enough" to store 9 bits, is to allocate a long[] with length 1. A single long is 64 bits.

    I don't think it is even possible in the JVM to implement a bit set that can store exactly 9 bits. Even if you use a boolean[] to back it, each boolean is still one byte (as you may know already).

    so having extra ceros might lead to adding further characters in the decompressed version of the file

    If that is the case, you can check the length of the BitSet. It gives you how many bits there are in the bit set, minus the leading zero bits.