typescriptpostgresqlopenai-apisupabaseopenaiembeddings

How can I optimize the data I am embedding to increase vector search result quality?


I am trying to implement semantic/vector search for images.

To do that, I am using gpt-4-mini to analyze an image and create data from it with this prompt:

Your job is to generate json data from a given image.
          
            Return your output in the following format:
            {
            description: "A description of the image. Only use relevant keywords.",
            text: "If the image contains text, include that here, otherwise remove this field",
            keywords: "Keywords that describe the image",
            artstyle: "The art style of the image",
            text_language: "The language of the text in the image, otherwise remove this field",,
            design_theme : "If the image has a theme (hobby, interest, occupation etc.), include that here, otherwise remove this field",
            }

The data I am getting back is pretty accurate (in my eyes). I am then embedding the json with the "text-embedding-3-small" model.

The problem is that the search results are pretty bad.

For example: I have 2 images with only text. One says "straight outta knee surgery" and one says "straight outta valhalla".

When I search for "straight outta", I have to turn down the similary treshold to 0.15 to get both results.

This is my postgres search function:

CREATE
OR REPLACE FUNCTION search_design_items (
  query_embedding vector (1536),
  match_threshold FLOAT,
  match_count INT
) RETURNS TABLE (
  id BIGINT
) AS $$
BEGIN
    RETURN QUERY
    SELECT id
    FROM public.design_management_items
    WHERE 1 - (design_management_items.description_vector <=> query_embedding) > match_threshold
    ORDER BY (design_management_items.description_vector <=> query_embedding) asc
    LIMIT match_count;
END;
$$ LANGUAGE plpgsql;

When I go into higher numbers (0.5) there are pretty much no results at all. This seems wrong because in every tutorial I have seen they use a threshold of 0.7+

What do I need to change in order to improve the accuracy of my search results?


Solution

  • Try to perform a hybrid search. All vector databases offer the hybrid search functionality.

    As stated in the official Weaviate blog:

    Hybrid search is a technique that combines multiple search algorithms to improve the accuracy and relevance of search results. It uses the best features of both keyword-based search algorithms with vector search techniques. By leveraging the strengths of different algorithms, it provides a more effective search experience for users.

    In simple terms, performing a hybrid search means that you search with both keywords and embedding vectors, where you set the alpha parameter as a way to give a weight to these two. For example, setting alpha to 0 means keyword search only, while setting alpha to 1 means embedding vector search only.

    I've created a project with a hybrid search in the past where you can search for Lex Fridman's podcast insights without watching the full episodes. See the demonstration.

    Here's the weaviateHybridSearch.ts file:

    "use server";
    
    import weaviate from "weaviate-client";
    import { PodcastType } from "@/app/types/podcast";
    
    // Define and export the queryPodcasts function
    export async function queryPodcasts(searchTerm: string, alpha: number) {
      /**
       * Queries the Podcast collection based on a search term and alpha value.
       *
       * @param {string} searchTerm - The search term to query for.
       * @param {number} alpha - The alpha value to use for the hybrid search.
       * @return {Promise<PodcastType[]>} - The array of PodcastType objects representing the search results.
       */
    
      // Connect to the local Weaviate instance
      const client = await weaviate.connectToLocal();
    
      // Get the Podcast collection
      const podcastCollection = await client.collections.get<
        Omit<PodcastType, "distance">
      >("Podcast");
    
      // Perform the hybrid search on the Podcast collection
      const { objects } = await podcastCollection.query.hybrid(searchTerm, {
        limit: 10,
        alpha: alpha,
        returnMetadata: ["score"],
        returnProperties: ["number", "guest", "title", "transcription"],
      });
    
      // Process the results
      const podcasts: PodcastType[] = objects.map((podcast: any) => ({
        ...podcast.properties,
        distance: podcast.metadata?.score!!,
      }));
    
      // Return the podcasts
      return podcasts;
    }