pythonllama-index

LlamaIndex Python: Metadata filter with `None` value does not retrieve documents


I’m working with LlamaIndex in Python and ran into an issue with metadata filtering.

I have a TextNode that includes a metadata field explicitly set to None. When I try to retrieve it using a metadata filter where value is None, no documents are returned. I expected that documents with None metadata would match such a filter.

Here's an MRE:

from llama_index.core import VectorStoreIndex
from llama_index.core.schema import TextNode
from llama_index.core.vector_stores import (
    MetadataFilter,
    MetadataFilters,
    FilterOperator,
)

node_01 = TextNode(
    text="This document has None in the metadata",
    id_="node_01",
    metadata={"start_date": None},
)

doc_index = VectorStoreIndex([node_01])

# Debug: Check what's actually stored
print("Index nodes:\n", [node.metadata for node in doc_index.docstore.docs.values()])

filter_null_start_date = MetadataFilter(key="start_date", operator=FilterOperator.EQ, value=None)
filters = MetadataFilters(filters=[filter_null_start_date])
retriever = doc_index.as_retriever(filters=filters, similarity_top_k=1)
nodes = retriever.retrieve("this")

print("Retrieved nodes:\n", [(node.node_id, node.metadata) for node in nodes])

Output:

Index nodes:
 [{'start_date': None}]
Retrieved nodes:
 []

So even though the metadata is stored as {'start_date': None}, filtering with EQ value=None does not return the node.

My questions:

Any clarification or workaround would be appreciated.


Solution

  • correct, None is not filterable in LlamaIndex, that's the expected behavior. You can try the following:

    example:

    from llama_index.core.schema import TextNode
    from llama_index.core.vector_stores import (
        MetadataFilter,
        MetadataFilters,
        FilterOperator,
    )
    from llama_index.core import VectorStoreIndex, Settings
    from llama_index.embeddings.ollama import OllamaEmbedding
    from llama_index.llms.ollama import Ollama
    
    embed_model = OllamaEmbedding(
            model_name="llama3.2",
            base_url="http://localhost:11434"
        )
    
    # 2) Tell LlamaIndex to use this embedder globally
    Settings.embed_model = embed_model
    
    # using metadata={"start_date": "None"}
    node_01 = TextNode(
        text="This document has None in the metadata",
        id_="node_01",
        metadata={"start_date": "None"},
    )
    node_02 = TextNode(
        text="This document has start date in the metadata",
        id_="node_02",
        metadata={"start_date": "20/03/2023"},
    )
    
    doc_index = VectorStoreIndex([node_01, node_02])
    
    # Debug: Check what's actually stored
    print("Index nodes:\n", [node.metadata for node in doc_index.docstore.docs.values()])
    
    filter_null_start_date = MetadataFilter(key="start_date", operator=FilterOperator.EQ, value=str(None))
    filters = MetadataFilters(filters=[filter_null_start_date])
    retriever = doc_index.as_retriever(filters=filters, similarity_top_k=1)
    nodes = retriever.retrieve("this")
    
    print("Retrieved nodes:\n", [(node.node_id, node.metadata) for node in nodes])
    

    output:

    Index nodes:
     [{'start_date': 'None'}, {'start_date': '20/03/2023'}]
    Retrieved nodes:
     [('node_01', {'start_date': 'None'})]
    
    # using metadata={"start_date": ""}
    node_01 = TextNode(
        text="This document has None in the metadata",
        id_="node_01",
        metadata={"start_date": ""},
    )
    node_02 = TextNode(
        text="This document has start date in the metadata",
        id_="node_02",
        metadata={"start_date": "20/03/2023"},
    )
    
    doc_index = VectorStoreIndex([node_01, node_02])
    
    # Debug: Check what's actually stored
    print("Index nodes:\n", [node.metadata for node in doc_index.docstore.docs.values()])
    
    filter_null_start_date = MetadataFilter(key="start_date", operator=FilterOperator.EQ, value="")
    filters = MetadataFilters(filters=[filter_null_start_date])
    retriever = doc_index.as_retriever(filters=filters, similarity_top_k=1)
    nodes = retriever.retrieve("this")
    
    print("Retrieved nodes:\n", [(node.node_id, node.metadata) for node in nodes])
    

    output:

    Index nodes:
     [{'start_date': ''}, {'start_date': '20/03/2023'}]
    Retrieved nodes:
     [('node_01', {'start_date': ''})]