elasticsearchaws-elasticsearchopensearch

OpenSearch compute knn index size properly


I am using OpenSearch service in AWS for my research.

Task: I want to compute the index size for N records in the index.

Input: I have only one node in AWS [r6g.4xlarge.search] with 128 RAM. The index definition is:

{
    "settings": {
        "index": {
            "knn":                           True,
            "knn.space_type":                "cosinesimil",
            'number_of_replicas':            0,
            'refresh_interval':              -1,
            'translog.flush_threshold_size': '10gb',
        }
    },
    "mappings": {
        "properties": {
            "vector": {
                "type":      "knn_vector",
                "dimension": 512
            },
            "keyword1":   {
                "type": "keyword"
            },
            "keyword2":    {
                "type": "keyword"
            }
        }
    }
}

I see that after force merge + refresh I have 5 segments.

The KNN stats looks like:

{"_nodes":                    {"total": 1, "successful": 1, "failed": 0}, "cluster_name": "NAME",
     "circuit_breaker_triggered": false, "nodes": {
        "ID": {"miss_count":             7, "graph_memory_usage_percentage": 34.527355,
                                   "graph_query_requests":   475, "graph_memory_usage": 16981999,
                                   "cache_capacity_reached": false, "graph_index_requests": 5,
                                   "load_exception_count":   0, "load_success_count": 7, "eviction_count": 0,
                                   "indices_in_cache":       {
                                       "INDEX_NAME": {"graph_memory_usage_percentage": 34.527355,
                                                        "graph_memory_usage":            16981999,
                                                        "graph_count":                   5}},
                                   "script_query_errors":    0, "script_compilations": 0,
                                   "script_query_requests":  0, "graph_query_errors": 0, "hit_count": 468,
                                   "graph_index_errors":     0, "knn_query_requests": 95,
                                   "total_load_time":        57689947272, "script_compilation_errors": 0}}}

I found that the required amount of memory in my case is OpenSearch: 1.1 * (4 * dimension + 8 * M) bytes. Where dimension=512, M=16 which gives me 0.0000023936 per record. Now I have 7885767 documents in the index and it takes 16981999 = ±16GB - 32% in use from avail memory. So it is even less than the formula says.

But if I compute the available memory it gives me ~50GB for instance with 128 GB RAM. According to AWS docs (check the bottom line) the OpenSearch itself takes up to 32 GB so 96 left. Can you explain to me how to write a formula to estimate the amount of documents in the index properly?


Solution

  • The answer is quite simple. If you are using only knn search (like me) you can simply increase this parameter to itlize the maximum RAM of your machine knn.memory.circuit_breaker.limit settings

    You can change it from python (or elasticsearch api)

    import elasticsearch
    
    es = elasticsearch.Elasticsearch(
        hosts=["host"],
        http_auth=(
            "admin",
            "admin",
        ),
        timeout=3600,
    )
    res = es.cluster.put_settings(
        {"persistent": {"knn.memory.circuit_breaker.limit": "100%"}}
    )
    print(res)