[SOLVED] How to validate search terms when using embedding to look for objects in images

How to validate search terms when using embedding to look for objects in images

I have a search on my site that does both tradition full text search and searches using embeddings. So, for example, when you search 'red balloon' I want both the text and image results. The problem is that not all search terms make sense for object detection (like, say 'William' or even like an identifier like a driver license number) I know there are libraries that will tell me if a word is a noun but is there anything that tells me if a phrase is searchable. So like this:

Red Apple YES
Big Idea No
Driver's License YES
Suspended License No

Solution

An idea to start with :

you must before starting the scripts :

pip install spacy, nltk, pywsd

then install the spacy small model :

python -m spacy download en_core_web_sm

Available Models

Before the first run download the nltk necessary packages :

init.py

import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')
nltk.download('punkt_tab')

When this all is done :

main.py:

from nltk.corpus import wordnet as wn
import spacy
from pywsd.lesk import simple_lesk


nlp = spacy.load("en_core_web_sm")

def is_visually_searchable(phrase):
    doc = nlp(phrase)
    
    if any(token.like_num or token.is_digit for token in doc):
        return False
    
    noun_chunks = list(doc.noun_chunks)
    if not noun_chunks:
        return False
    
    for token in doc:
        if token.pos_ in ["NOUN", "PROPN"]:
            synset = simple_lesk(phrase, token.text)
            if synset:
                if synset.lexname() in ["noun.attribute", "noun.cognition", "noun.communication"]:
                    return False
    return True

phrases = [
    "red apple",
    "big idea",
    "driver's license",
    "suspended license",
    "Veronica",       
    "DL 1234",       
    "flying plane",
    "available items"
]

for phrase in phrases:
    print(f"{phrase}: {'YES' if is_visually_searchable(phrase) else 'NO'}")

Results:

> red apple: YES
> big idea: NO
> driver's license: YES
> suspended license: YES
> Veronica: YES
> DL 1234: NO
> flying plane: YES
> available items: NO
> >

You see that Veronica and suspended licence are alway YES

Additional custom filters:

First-name filter:

import requests

def load_names_from_url(url: str) -> set:
    try:
        response = requests.get(url)
        response.raise_for_status()
        return {line.strip().lower() for line in response.text.splitlines()}
    except requests.RequestException as e:
        return set()

url = 'https://raw.githubusercontent.com/dominictarr/random-name/refs/heads/master/first-names.txt'

NAMES_SET = load_names_from_url(url)

Then add the logic in the 'is_visually_searchable' function:

if any(str(token).lower() in NAMES_SET for token in doc if not token.like_num):
     return False

Results:

> red apple: YES
> big idea: NO
> driver's license: YES
> suspended license: YES
> Veronica: NO
> DL 1234: NO
> flying plane: YES
> available items: NO
>

Check Lexical Categories to tune the results:

You can check the lexical Categories of a word like this:

from nltk.corpus import wordnet as wn

word = "suspended"
synsets = wn.synsets(word, pos=wn.VERB)
for syn in synsets:
    print(f"Word: {word}, Lexname: {syn.lexname()}, Definition: {syn.definition()}")

Results:

> Word: suspended, Lexname: verb.contact, Definition: hang freely
> Word: suspended, Lexname: verb.change, Definition: cause to be held in suspension in a fluid
> Word: suspended, Lexname: verb.social, Definition: bar temporarily; from school, office, etc.
> Word: suspended, Lexname: verb.change, Definition: stop a process or a habit by imposing a freeze on it
> Word: suspended, Lexname: verb.change, Definition: make inoperative or stop
> Word: suspended, Lexname: verb.stative, Definition: render temporarily ineffective
>

You can use: wn.NOUN,wn.VERB,wn.ADJ,wn.ADV