pythonlarge-language-modelollama

Constrain Llama3.2-vision output to a list of options


I have several images of animals in the same directory as the script. How can I modify the following script to process an image but force the output to only be a single selection from a list:

from pathlib import Path
import base64
import requests

def encode_image_to_base64(image_path):
    """Convert an image file to base64 string."""
    return base64.b64encode(image_path.read_bytes()).decode('utf-8')

def extract_text_from_image(image_path):
    """Send image to local Llama API and get text description."""
    base64_image = encode_image_to_base64(image_path)

    payload = {
        "model": "llama3.2-vision",
        "stream": False,
        "messages": [
            {
                "role": "user",
                "content": (
                    "With just one word, classify this image into one of these exact categories:\n"
                    "- dog\n"
                    "- cat\n"
                    "- butterfly\n"
                ),
                "images": [base64_image]
            }
        ]
    }

    response = requests.post(
        "http://localhost:11434/api/chat",
        json=payload,
        headers={"Content-Type": "application/json"}
    )

    return response.json().get('message', {}).get('content', 'No text extracted')

def process_directory():
    """Process all images in current directory and create text files."""
    for image_path in Path('.').glob('*'):
        if image_path.suffix.lower() in {'.png', '.jpg', '.jpeg', '.gif', '.bmp', '.webp'}:
            print(f"\nProcessing {image_path}...")

            text = extract_text_from_image(image_path)
            image_path.with_suffix('.txt').write_text(text, encoding='utf-8')
            print(f"Created {image_path.with_suffix('.txt')}")

process_directory()

However, despite different prompt engineering, I get some answers that will do more than just select from a list. For example, it may occassionally output "From the image, there is a winged insect, therefore my guess is "butterfly." ANSWER: Butterfly." If I define the list as allowed_options = ['dog', 'cat', 'butterfly'] I only want it to output a single string from that list and nothing else.


Solution

  • Llama3.2-vision supports structured outputs, so you can specify the schema of the response you want, and the model should follow it. Your request should look like this:

    payload = {
            "model": "llama3.2-vision",
            "stream": False,
            "messages": [
                {
                    "role": "user",
                    "content": (
                        "Classify this image into one of these exact categories:\n"
                        "- dog\n"
                        "- cat\n"
                        "- butterfly\n"
                    ),
                    "images": [base64_image]
                }
            ],
            "format": {
              "type": "object",
              "properties": {
                "animal": {
                  "enum": [
                    "dog",
                    "cat",
                    "butterfly"
                  ],
                  "type": "string"
                }
              },
              "required": [
                "animal"
              ]
            }
        }
    

    And now the output should be a JSON with an attribute animal. For example:

    {
      "animal": "dog"
    }