algorithmsetcompressionsimilarityhamming-distance

Most efficient way to find intersections between many sets of numbers


I am trying to efficiently compress sets of numbers which look like this (one set per line):

19 20 23 24 27 29 32 35 69 97 99 119 122 129 132 134 136 137 139 141 147 148 152 157 158 160 170 173 174 175 176 178 179 182 183 185 186 188 189 190 192 194 195 197 198 199 200 201 202 203 204 205
19 20 23 24 27 29 32 35 69 97 99 119 122 129 132 134 136 137 139 141 147 148 152 157 158 160 170 173 174 175 176 178 179 182 183 185 186 188 189 190 192 194 195 197 198 199 200 201 202 203 204 205 206
19 20 23 24 27 29 32 35 69 97 99 119 122 129 132 134 136 137 139 141 147 148 152 157 158 160 170 173 174 175 176 178 182 183 185 186 188 189 190 192 194 195 197 198 199 200 201 202 203 204 205 206 45387
19 20 23 24 27 29 32 35 69 97 99 119 122 129 132 134 136 137 139 141 147 148 152 157 158 160 170 173 174 175 176 178 182 183 185 186 188 189 190 192 194 195 197 198 199 200 201 202 203 204 205 206 45387 45392
19 20 23 24 27 29 32 35 69 97 99 119 122 129 132 134 136 137 139 141 147 148 152 157 158 160 170 173 174 175 176 178 182 183 185 186 188 189 190 192 194 195 197 198 199 200 201 202 203 204 205 206 45387
19 20 23 24 27 29 32 35 69 97 99 119 122 129 132 134 136 137 139 141 147 148 152 157 158 160 170 173 174 175 176 178 182 183 185 186 188 189 190 192 194 195 197 198 199 200 201 202 203 204 205 206 45392
19 20 23 24 27 29 32 35 69 97 99 119 122 129 132 134 136 137 139 141 147 148 152 157 158 160 170 173 174 175 176 178 182 183 185 186 188 189 190 192 194 195 197 198 199 200 201 202 203 204 205 206 45392 144554
19 20 23 24 27 29 32 35 69 97 99 119 122 129 130 134 136 137 141 147 148 152 157 158 160 170 173 174 175 176 178 182 183 185 186 188 189 190 192 194 195 197 198 199 200 201 202 203 204 205
19 20 23 24 27 29 32 35 69 97 99 119 122 129 134 136 137 139 141 147 148 152 157 158 160 170 173 174 175 176 178 182 183 185 186 188 189 190 192 194 195 197 198 199 200 201 202 203 204 205 45392 144554
19 20 23 24 27 29 32 35 69 97 99 119 122 129 132 134 136 137 139 141 147 148 152 157 158 160 170 173 174 175 176 178 182 183 185 186 188 189 190 192 194 195 197 198 199 200 201 202 203 204 205
19 20 23 24 27 29 32 35 69 97 99 119 122 129 132 134 136 137 139 141 147 148 152 157 158 160 170 173 174 175 176 178 182 183 185 186 188 189 190 192 194 195 197 198 199 200 201 202 203 204 205 45392 144554

You could easily have ~10K sets, each with ~10K entries. However, as you can see from the sample data, most of the data in the sets is redundant - with a few removals and a few additions for each new set. (Occasionally there's a big change, but this is rare).

I'd like to compress this to:

To achieve minimal CPU when expanding, I'm trying to build each set out of a set of common subsets - i.e. factoring out the most common recurring subsets of data, one level deep (i.e. no recursion).

To determine the common subsets to factor out, I've tried considering the sets line by line, and looking at what items are added and what items are removed. The additions are considered as new subsets, and as these build up over time, equal-sized subsets are coalesced together pairwise into new subsets. For instance, for the simple case of the Nth set being the integers 0 to N, you get:

({0}),
({0, 1}),
({0, 1}),({2}),
({0, 1, 2, 3}),
({0, 1, 2, 3}),({4}),
({0, 1, 2, 3}),({4, 5}),
({0, 1, 2, 3}),({4, 5}),({6}),
({0, 1, 2, 3, 4, 5, 6, 7}),
({0, 1, 2, 3, 4, 5, 6, 7}),({8}),
({0, 1, 2, 3, 4, 5, 6, 7}),({8, 9}),
({0, 1, 2, 3, 4, 5, 6, 7}),({8, 9}),({10}),
({0, 1, 2, 3, 4, 5, 6, 7}),({8, 9, 10, 11}),
({0, 1, 2, 3, 4, 5, 6, 7}),({8, 9, 10, 11}),({12}),
({0, 1, 2, 3, 4, 5, 6, 7}),({8, 9, 10, 11}),({12, 13}),
({0, 1, 2, 3, 4, 5, 6, 7}),({8, 9, 10, 11}),({12, 13}),({14}),
({0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}),

Then, if you keep track of the 'parent' components of each subset, when an item is removed, you can explode the given subset apart into its components (which subsequently will get coalesced again as time moves on). For instance, removing item 4 would produce something like:

({0, 1, 2, 3}),({5, 6, 7}),({8, 9, 10, 11}),({12, 13}),({14}),

...which would then coalesce to...

({0, 1, 2, 3, 8, 9, 10, 11}),({5, 6, 7}),({12, 13}),({14}),

Empirically this works fairly well (roughly 5x improvement in disk space), but I'm worried that I'm missing a more obvious way to spot which subsets can be factored out most efficiently across the overall dataset.

I've also tried building a prefix trie to track which prefixes recur the most, and then factoring these out - except this uses quite a lot of storage, and doesn't help compress subsets which aren't prefixes. It also doesn't exploit the fact that the sets are unordered.

I've also tried looking at Signature Trees (https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.7315&rep=rep1&type=pdf) but these seem to use a huge amount of disk storage for when your datasets are large and not that sparse.

I could also do an O(N^2) search to compare the intersection of each set with every other, and track a histogram of which subsets recur the most, but O(N^2) will be painful for large datasets, and it's not obvious how to tune out the noise when comparing the intersections to spot the underlying common subsets.

TL;DR: what's the best way to spot structural similarity across a large number of sets in order to factor out recurring subsets?

Edit: have clarified that random access is needed when decompressing. Also, I've published a real dataset to http://matrix.org/~matthew/expanded.out.xz. Warning: this 2MB .xz expands to 4.9GB of actual data... which illustrates the problem quite well, and why it's frustrating that i haven't found an approach which does better than 5x compression so far :/


Solution

  • We can combine three simple ideas:

    1. Encode the symmetric differences between successive sets (I think this is what Mark is suggesting).

    2. This is good for encoding but hard to decode randomly. To fix it, emit the whole set periodically. One heuristic is to do this whenever we've emitted approximately as much in deltas as the whole set -- in theory, this costs only a constant factor more in storage while limiting the total size of the deltas we scan to a constant factor more than the size of the set.

    3. Use a delta encoding with varints. This is a common encoding for posting lists, so there should be optimized implementations floating around.

    Python 3 encoder that compresses the given input to less than 5 MB. We also need an index, but this won't add much.

    import fileinput
    import re
    import sys
    
    output = open("output", "wb")
    
    
    def emit_varint(n):
        buffer = []
        mask = 127
        while n > mask:
            buffer.append(128 | (n & mask))
            n >>= 7
        buffer.append(n)
        output.write(bytes(buffer))
    
    
    def emit_indices(delta):
        emit_varint(len(delta))
        prev = 0
        for x in sorted(delta):
            emit_varint(x - prev)
            prev = x
    
    
    delta_counter = 0
    delta_from = 0
    previous_indices = set()
    for i, line in enumerate(fileinput.input()):
        if i % 1000 == 0:
            print(i, file=sys.stderr)
        m = re.match(r"[^{}]*\{(\d+(,\d+)*)\}", line)
        if not m:
            continue
        indices = set(map(int, re.findall("\d+", m.group(1))))
        delta = indices ^ previous_indices
        delta_counter += len(delta)
        if delta_counter + len(delta) > 2 * len(indices):
            emit_indices(indices)
            delta_counter = 0
            delta_from = i
        else:
            emit_indices(delta)
        previous_indices = indices