I am trying to implement semantic/vector search for images.
To do that, I am using gpt-4-mini to analyze an image and create data from it with this prompt:
Your job is to generate json data from a given image.
Return your output in the following format:
{
description: "A description of the image. Only use relevant keywords.",
text: "If the image contains text, include that here, otherwise remove this field",
keywords: "Keywords that describe the image",
artstyle: "The art style of the image",
text_language: "The language of the text in the image, otherwise remove this field",,
design_theme : "If the image has a theme (hobby, interest, occupation etc.), include that here, otherwise remove this field",
}
The data I am getting back is pretty accurate (in my eyes). I am then embedding the json with the "text-embedding-3-small" model.
The problem is that the search results are pretty bad.
For example: I have 2 images with only text. One says "straight outta knee surgery" and one says "straight outta valhalla".
When I search for "straight outta", I have to turn down the similary treshold to 0.15 to get both results.
This is my postgres search function:
CREATE
OR REPLACE FUNCTION search_design_items (
query_embedding vector (1536),
match_threshold FLOAT,
match_count INT
) RETURNS TABLE (
id BIGINT
) AS $$
BEGIN
RETURN QUERY
SELECT id
FROM public.design_management_items
WHERE 1 - (design_management_items.description_vector <=> query_embedding) > match_threshold
ORDER BY (design_management_items.description_vector <=> query_embedding) asc
LIMIT match_count;
END;
$$ LANGUAGE plpgsql;
When I go into higher numbers (0.5) there are pretty much no results at all. This seems wrong because in every tutorial I have seen they use a threshold of 0.7+
What do I need to change in order to improve the accuracy of my search results?
Try to perform a hybrid search. All vector databases offer the hybrid search functionality.
As stated in the official Weaviate blog:
Hybrid search is a technique that combines multiple search algorithms to improve the accuracy and relevance of search results. It uses the best features of both keyword-based search algorithms with vector search techniques. By leveraging the strengths of different algorithms, it provides a more effective search experience for users.
In simple terms, performing a hybrid search means that you search with both keywords and embedding vectors, where you set the alpha
parameter as a way to give a weight to these two. For example, setting alpha
to 0
means keyword search only, while setting alpha
to 1
means embedding vector search only.
I've created a project with a hybrid search in the past where you can search for Lex Fridman's podcast insights without watching the full episodes. See the demonstration.
Here's the weaviateHybridSearch.ts
file:
"use server";
import weaviate from "weaviate-client";
import { PodcastType } from "@/app/types/podcast";
// Define and export the queryPodcasts function
export async function queryPodcasts(searchTerm: string, alpha: number) {
/**
* Queries the Podcast collection based on a search term and alpha value.
*
* @param {string} searchTerm - The search term to query for.
* @param {number} alpha - The alpha value to use for the hybrid search.
* @return {Promise<PodcastType[]>} - The array of PodcastType objects representing the search results.
*/
// Connect to the local Weaviate instance
const client = await weaviate.connectToLocal();
// Get the Podcast collection
const podcastCollection = await client.collections.get<
Omit<PodcastType, "distance">
>("Podcast");
// Perform the hybrid search on the Podcast collection
const { objects } = await podcastCollection.query.hybrid(searchTerm, {
limit: 10,
alpha: alpha,
returnMetadata: ["score"],
returnProperties: ["number", "guest", "title", "transcription"],
});
// Process the results
const podcasts: PodcastType[] = objects.map((podcast: any) => ({
...podcast.properties,
distance: podcast.metadata?.score!!,
}));
// Return the podcasts
return podcasts;
}