amazon-s3weaviatevector-database

Optimizing Weaviate for Image Embedding Search without Storing Images


I’m currently working on a project where I’m using Weaviate as a vector database to store and search for images based on their embeddings. The images themselves are stored in an S3 bucket. My goal is to leverage Weaviate’s capabilities solely for storing and searching image embeddings, while keeping the actual image files in the S3 bucket.

As of now, I’ve successfully configured Weaviate to store both the image embeddings and the images themselves, but I’m interested in optimizing this setup to conserve storage space and streamline the search process. I’ve been through the documentation, but I couldn’t find a way to disable the storage of image files in Weaviate.

Could anyone guide me on how to configure Weaviate to store only the embeddings and utilize it purely as a search engine for images without storing the actual image files? Your insights and suggestions would be greatly appreciated!

Thanks in advance for your help!


Currently, I am using the following schema:

const schemaConfig = {
    "class": "Product",
    "description": "Product images",
    "moduleConfig": {
        "img2vec-neural": {
            "imageFields": [
                "image"
            ]
        }
    },
    "properties": [
        {
            "dataType": [
                "blob"
            ],
            "description": "Product image",
            "name": "image"
        },
        {
            "dataType": [
                "text"
            ],
            "description": "label name (description) of the given image.",
            "name": "labelName"
        }
    ],
    "vectorIndexType": "hnsw",
    "vectorizer": "img2vec-neural"
}

Docker compose file:

---
version: '3.4'
services:
  weaviate:
    command:
    - --host
    - 0.0.0.0
    - --port
    - '8080'
    - --scheme
    - http
    image: semitechnologies/weaviate:1.20.5
    ports:
    - 8080:8080
    restart: on-failure:0
    environment:
      IMAGE_INFERENCE_API: 'http://i2v-neural:8080'
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      DEFAULT_VECTORIZER_MODULE: 'img2vec-neural'
      ENABLE_MODULES: 'img2vec-neural'
      CLUSTER_HOSTNAME: 'node1'
  i2v-neural:
    image: semitechnologies/img2vec-pytorch:resnet50
    environment:
      ENABLE_CUDA: '0'
...

One way to approach this is maybe using the custom vector API, in that case I thought of using the Weaviate img2vec embedder, but couldn't figure how to use it separately.

Or I could host the image embedder myself, but would prefer to use a premade solution.


Solution

  • TL;DR: Solved the issue by opening up the API of img2vec-neural. It has a /vectors endpoint that receives a POST request with body {id: TEMP_FILENAME, image: BASE64_IMG}. It returns an object with a vector attribute. After, followed the custom vector API tutorial.


    Remove the img2vec-neural module from the weaviate environment, and reveal the port 8080 of the i2v-neural (here its 8081).

    weaviate:
        ...
        environment:
          QUERY_DEFAULTS_LIMIT: 25
          AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
          PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
          CLUSTER_HOSTNAME: 'node1'
      i2v-neural:
        image: semitechnologies/img2vec-pytorch:resnet50
        ports:
        - 8081:8080
        ...
    

    Here is a custom javascript function to produce a vector of an image (picture):

    export const vectorizeImage = async (b64Img) => {
    
        const req = await fetch('http://localhost:8081/vectors', {
            method: 'POST',
            headers: {
                'Content-Type': 'application/json',
            },
            body: JSON.stringify({
                id: "image.jpg",
                image: b64Img,
            }),
        });
    
        const res = await req.json();
    
        if (res.error) {
            console.error(res.error);
            return;
        }
    
        return res.vector;
    }
    

    Create a schema. Notice that here you don't specify a vectorizer.

    const schemaConfig = {
        class: "Product",
        vectorIndexType: "hnsw",
    }
    
    await client
        .schema
        .classCreator()
        .withClass(schemaConfig)
        .do();
    

    Here is how to create an item in the DB with some properties.

    const { id, item_no, image_url, unit_price } = product;
    const b64 = await getImageUrlBase64(image_url);
    const vector = await vectorizeImage(b64);
    
    await client.data.creator()
        .withClassName('Product')
        .withProperties({
            item_no: item_no,
            image_url: image_url,
        })
        .withVector(vector)
        .do();