c++machine-learningdata-miningbitsetbitvector

Best way to transform an array of ints into a bitset representation in C++?


I've seen some similar questions on the topic, but I'm relatively new to programming and couldn't make sense of some of the language used in the solutions.

Suppose I have 2 finite sets A,B represented as arrays where:

int A[2] = {1, 3};
int B[2] = {1, 2};

I want the bitsets (column vectors V) that represent A and B.

    v1 v2
(1) 1, 1
(2) 0, 1
(3) 1, 0

This way I can easily sum row (k) and get the count of appearances for the value k across all of my sets A_1 to A_n.

I am looking for the fastest possible way to do this. I can roughly imagine how I might first initialize a matrix of bit vectors (setting each value to 0) and then loop through each set A_i, setting the corresponding entry of my matrix to 1, but this solution seems useless, because I still have to loop through every element in every set A_i.

I am trying to avoid having to loop through every element of every set by instead getting the count of appearances by summing rows of bits, but I can't figure out how to elegantly make this transformation in a time efficient manner.

Motivation: I'm trying to implement the ID3 decision tree algorithm and am trying to use bit vectors to calculate proportions of labels for entropy calculation.


Solution

  • The key in the presentation is that you don't explicitly form the sets just to build bitsets from them, but construct the bitsets instead of the sets.

    In short, you have

    std::vector<double> unsortedDataInRow(numDataInRow) = ...;
    std::vector<int> labels(numLabels) = ...;
    

    You then obtain

    std::vector<unsigned> sortedIndices = getSortedIndices(unsortedDataInRow);
    

    so that unsortedDataInRow[sortedIndices[i]] is sorted. But instead of building std::vector<int> sortedLabels from them, you instead fill a

    std::vector<std::vector<bool>> bitsets(numLabels, std::vector<bool>(numDataInRow));
    // this gets zero-initialized
    

    in such a way that bitsets[label][i] == (unsortedLabels[sortedIndices[i]] == label):

    for (auto sortedIndex : sortedIndices)
      bitsets[unsortedLabels[sortedIndices]][sortedIndex] = true;
    

    This helps performance because you (presumably) do the label counting in InfoGain (i.e. determining P(c), which can then be done much faster through popcnt than through counts[labels[i]]++;) much more often than you do the above.

    Note that this is just a sketch - std::vector<bool> doesn't have an inbuilt way to get a popcnt. You'd have to hope that your compiler recognizes a handwritten one. Alternatively, use boost::dynamic_bitset, or some other library, or a handwritten one.