I have a search on my site that does both tradition full text search and searches using embeddings. So, for example, when you search 'red balloon' I want both the text and image results. The problem is that not all search terms make sense for object detection (like, say 'William' or even like an identifier like a driver license number) I know there are libraries that will tell me if a word is a noun but is there anything that tells me if a phrase is searchable. So like this:
An idea to start with :
you must before starting the scripts :
pip install spacy, nltk, pywsd
then install the spacy small model :
python -m spacy download en_core_web_sm
Before the first run download the nltk necessary packages :
init.py
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')
nltk.download('punkt_tab')
When this all is done :
main.py:
from nltk.corpus import wordnet as wn
import spacy
from pywsd.lesk import simple_lesk
nlp = spacy.load("en_core_web_sm")
def is_visually_searchable(phrase):
doc = nlp(phrase)
if any(token.like_num or token.is_digit for token in doc):
return False
noun_chunks = list(doc.noun_chunks)
if not noun_chunks:
return False
for token in doc:
if token.pos_ in ["NOUN", "PROPN"]:
synset = simple_lesk(phrase, token.text)
if synset:
if synset.lexname() in ["noun.attribute", "noun.cognition", "noun.communication"]:
return False
return True
phrases = [
"red apple",
"big idea",
"driver's license",
"suspended license",
"Veronica",
"DL 1234",
"flying plane",
"available items"
]
for phrase in phrases:
print(f"{phrase}: {'YES' if is_visually_searchable(phrase) else 'NO'}")
Results:
> red apple: YES
> big idea: NO
> driver's license: YES
> suspended license: YES
> Veronica: YES
> DL 1234: NO
> flying plane: YES
> available items: NO
> >
You see that Veronica and suspended licence are alway YES
Additional custom filters:
import requests
def load_names_from_url(url: str) -> set:
try:
response = requests.get(url)
response.raise_for_status()
return {line.strip().lower() for line in response.text.splitlines()}
except requests.RequestException as e:
return set()
url = 'https://raw.githubusercontent.com/dominictarr/random-name/refs/heads/master/first-names.txt'
NAMES_SET = load_names_from_url(url)
Then add the logic in the 'is_visually_searchable' function:
if any(str(token).lower() in NAMES_SET for token in doc if not token.like_num):
return False
Results:
> red apple: YES
> big idea: NO
> driver's license: YES
> suspended license: YES
> Veronica: NO
> DL 1234: NO
> flying plane: YES
> available items: NO
>
Check Lexical Categories to tune the results:
You can check the lexical Categories of a word like this:
from nltk.corpus import wordnet as wn
word = "suspended"
synsets = wn.synsets(word, pos=wn.VERB)
for syn in synsets:
print(f"Word: {word}, Lexname: {syn.lexname()}, Definition: {syn.definition()}")
Results:
> Word: suspended, Lexname: verb.contact, Definition: hang freely
> Word: suspended, Lexname: verb.change, Definition: cause to be held in suspension in a fluid
> Word: suspended, Lexname: verb.social, Definition: bar temporarily; from school, office, etc.
> Word: suspended, Lexname: verb.change, Definition: stop a process or a habit by imposing a freeze on it
> Word: suspended, Lexname: verb.change, Definition: make inoperative or stop
> Word: suspended, Lexname: verb.stative, Definition: render temporarily ineffective
>
You can use: wn.NOUN
,wn.VERB
,wn.ADJ
,wn.ADV