I would like to use Lucene to run a nearest neighbour search. I'm using Lucene 9.0.0 on JVM 11. I did not find much documentation and mainly tried to piece things together using the existing tests.
I wrote a small test which prepares a HnswGraph
but so far the search does not yield the expected result. I setup a set of random vectors and add a final vector (0.99f,0.01f) which is very close to my search target.
The search unfortunately never returns the expected value. I'm not sure where my error is. I assume it may be related with the insert and document id order.
Maybe someone who is more familar with lucene might be able to provide some feedback. Is my approach correct? I'm using Documents only for persistence.
HnswGraphBuilder builder = new HnswGraphBuilder(vectors, similarityFunction, maxConn, beamWidth, seed);
HnswGraph hnsw = builder.build(vectors);
// Run a search
NeighborQueue nn = HnswGraph.search(
new float[] { 1, 0 },
10,
10,
vectors.randomAccess(), // ? Why do I need to specify the graph values again?
similarityFunction, // ? Why can I specify a different similarityFunction for search. Should that not be the same that was used for graph creation?
hnsw,
null,
new SplittableRandom(RandomUtils.nextLong()));
The whole test source can be found here: https://gist.github.com/Jotschi/cea21a72412bcba80c46b967e9c52b0f
Since this is a top search hit for people like me looking for lucene hnsw examples (there just aren't many out there), here's what this looks like as of Lucene 9.6, without bringing in higher-level Lucene classes:
// Create a random vector universe
var vectorDimensions = 1500;
var universeSize = 2_000;
var universe = new ArrayList<float[]>(universeSize);
for (var i = 0; i < universeSize; i++) {
universe.add(randomVector(vectorDimensions));
}
// construct a HNSW graph of the universe
System.out.println("Constructing HNSW graph...");
var ravv = new ListRandomAccessVectorValues(universe, vectorDimensions);
var builder = HnswGraphBuilder.create(ravv, VectorEncoding.FLOAT32, similarityFunction, 16, 100, random.nextInt());
var hnsw = builder.build(ravv.copy());
// search for the nearest neighbors of a random vector
var queryVector = randomVector(vectorDimensions);
System.out.println("Searching for top 10 neighbors of a random vector");
var nn = HnswGraphSearcher.search(queryVector, 10, ravv.copy(), VectorEncoding.FLOAT32, similarityFunction, hnsw, null, Integer.MAX_VALUE);
for (var i : nn.nodes()) {
var neighbor = universe.get(i);
var similarity = similarityFunction.compare(queryVector, neighbor);
System.out.printf(" ordinal %d (similarity: %s)%n", i, similarity);
}
The implementation of ListRandomAccessVectorValues
is straightforward and can be found here: https://github.com/jbellis/hnswdemo
As you note, the API is a bit wonky, and while you can specify different encodings or similarities to the search than from the builder, you will obviously get nonsense results.
The reason you have to pass the vector values in to the search function is that for efficiency, the index itself does not store a copy of the vectors themselves, but only an int offset into the values provider.