I’m working with LlamaIndex in Python and ran into an issue with metadata filtering.
I have a TextNode that includes a metadata field explicitly set to None.
When I try to retrieve it using a metadata filter where value is None, no documents are returned.
I expected that documents with None metadata would match such a filter.
Here's an MRE:
from llama_index.core import VectorStoreIndex
from llama_index.core.schema import TextNode
from llama_index.core.vector_stores import (
MetadataFilter,
MetadataFilters,
FilterOperator,
)
node_01 = TextNode(
text="This document has None in the metadata",
id_="node_01",
metadata={"start_date": None},
)
doc_index = VectorStoreIndex([node_01])
# Debug: Check what's actually stored
print("Index nodes:\n", [node.metadata for node in doc_index.docstore.docs.values()])
filter_null_start_date = MetadataFilter(key="start_date", operator=FilterOperator.EQ, value=None)
filters = MetadataFilters(filters=[filter_null_start_date])
retriever = doc_index.as_retriever(filters=filters, similarity_top_k=1)
nodes = retriever.retrieve("this")
print("Retrieved nodes:\n", [(node.node_id, node.metadata) for node in nodes])
Output:
Index nodes:
[{'start_date': None}]
Retrieved nodes:
[]
So even though the metadata is stored as {'start_date': None}, filtering with EQ value=None does not return the node.
My questions:
Any clarification or workaround would be appreciated.
correct, None is not filterable in LlamaIndex, that's the expected behavior. You can try the following:
Instead of None , you can change it to a str i.e., "None"
node_01 = TextNode(
text="This document has None in the metadata",
id_="node_01",
metadata={"start_date": "None"},
)
filter_null_start_date = MetadataFilter(key="start_date", operator=FilterOperator.EQ, value=str(None))
otherwise, simply leave it as an empty string ""
node_01 = TextNode(
text="This document has None in the metadata",
id_="node_01",
metadata={"start_date": ""},
)
filter_null_start_date = MetadataFilter(key="start_date", operator=FilterOperator.EQ, value="")
example:
from llama_index.core.schema import TextNode
from llama_index.core.vector_stores import (
MetadataFilter,
MetadataFilters,
FilterOperator,
)
from llama_index.core import VectorStoreIndex, Settings
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.llms.ollama import Ollama
embed_model = OllamaEmbedding(
model_name="llama3.2",
base_url="http://localhost:11434"
)
# 2) Tell LlamaIndex to use this embedder globally
Settings.embed_model = embed_model
# using metadata={"start_date": "None"}
node_01 = TextNode(
text="This document has None in the metadata",
id_="node_01",
metadata={"start_date": "None"},
)
node_02 = TextNode(
text="This document has start date in the metadata",
id_="node_02",
metadata={"start_date": "20/03/2023"},
)
doc_index = VectorStoreIndex([node_01, node_02])
# Debug: Check what's actually stored
print("Index nodes:\n", [node.metadata for node in doc_index.docstore.docs.values()])
filter_null_start_date = MetadataFilter(key="start_date", operator=FilterOperator.EQ, value=str(None))
filters = MetadataFilters(filters=[filter_null_start_date])
retriever = doc_index.as_retriever(filters=filters, similarity_top_k=1)
nodes = retriever.retrieve("this")
print("Retrieved nodes:\n", [(node.node_id, node.metadata) for node in nodes])
output:
Index nodes:
[{'start_date': 'None'}, {'start_date': '20/03/2023'}]
Retrieved nodes:
[('node_01', {'start_date': 'None'})]
# using metadata={"start_date": ""}
node_01 = TextNode(
text="This document has None in the metadata",
id_="node_01",
metadata={"start_date": ""},
)
node_02 = TextNode(
text="This document has start date in the metadata",
id_="node_02",
metadata={"start_date": "20/03/2023"},
)
doc_index = VectorStoreIndex([node_01, node_02])
# Debug: Check what's actually stored
print("Index nodes:\n", [node.metadata for node in doc_index.docstore.docs.values()])
filter_null_start_date = MetadataFilter(key="start_date", operator=FilterOperator.EQ, value="")
filters = MetadataFilters(filters=[filter_null_start_date])
retriever = doc_index.as_retriever(filters=filters, similarity_top_k=1)
nodes = retriever.retrieve("this")
print("Retrieved nodes:\n", [(node.node_id, node.metadata) for node in nodes])
output:
Index nodes:
[{'start_date': ''}, {'start_date': '20/03/2023'}]
Retrieved nodes:
[('node_01', {'start_date': ''})]