algorithmvector-search

Finding matching vector from millions of vectors


I have millions of unique 10-dimensional vectors. In these vectors, if a dimension is 0, it means this vector can match any value in this dimension. Now, I have a vector, and I want to find which of these million vectors matches it. No dimension in this vector can be 0. Could you please help me design an algorithm for this? For example.

vectors:
[
  [1, 1, 1, 1, ... 1],
  [2, 2, 0, 2, ... 2],  //In the third dimension any value can be matched
  [3, 0, 3, 3, ... 3],  //In the second dimension any value can be matched
  ...
]

Search:
input vector: [1, 1, 1, 1, ... 1] , output: [1, 1, 1, 1, ... 1]
input vector: [2, 2, 4, 2, ... 2] , output: [2, 2, 0, 2, ... 2]
input vector: [1, 2, 3, 4, ... 10], output: null

I tried using the KD-tree, but it doesn't satisfy my particular matching rules (maybe I don't know how to use it).

Any programming language will be OK.


Solution

  • Here is a simple approach. Memory shouldn't be too bad.

    def make_trie (vectors):
        trie = {}
        for vector in vectors:
            node = trie
            for i in range(len(vector) - 1, -1, -1):
                if vector[i] not in node:
                    node[vector[i]] = {}
                node = node[vector[i]]
        return trie
    
    def search_vector (trie, target, i=None):
        if i is None:
            i = len(target) - 1
        if i == -1:
            yield []
        else:
            if 0 in trie:
                for v in search_vector(trie[0], target, i-1):
                    v.append(0)
                    yield v
            if target[i] in trie:
                for v in search_vector(trie[target[i]], target, i-1):
                    v.append(target[i])
                    yield v
    
    trie = make_trie([
        [1, 1, 1, 1],
        [2, 2, 0, 2],
        [3, 0, 3, 3],
        [3, 3, 3, 3],
    ])
    
    for v in [
        [1, 1, 1, 1],
        [2, 2, 4, 2],
        [3, 8, 3, 3],
        [3, 3, 3, 3],
        [4, 4, 4, 4],
    ]:
        print(v, list(search_vector(trie, v)))