elasticsearchopensearchamazon-opensearch

What is the ideal structure for storing and querying nested data in elasticsearch?


I'm running a query on elasticsearch using function_score where the value of a nested field is used for the calculation (price in this case). Which of the following is a better way to index the data?

{
      "name": "John",
      "age": 50,
      "country": "US",
      "subscription": {
        "Plan1": {
          "price": 100,
          "date": "June 5th"
        },
        "Plan2": {
          "price": 50,
          "date": "June 6th"
        }
      }
    }

OR

{
  "name": "John",
  "age": 50,
  "country": "US",
  "subscription": [
    {
      "name": "Plan1",
      "price": 100,
      "date": "June 5th"
    },
    {
      "name": "Plan2"
      "price": 50,
      "date": "June 6th"
    }
  ]
}

The query would be filtering on the "plan name" and "price", and "price" will be used for score calculation. The number of plans maybe upwards of 20.

Edit 1: Sample query for approach 2

{
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "filter": [
            {
              "range": {
                "createdatutc": {
                  "gte": "2022-11-01T00:00:00.000Z",
                  "lt": "2023-05-06T00:00:00.000Z",
                  "format": "strict_date_optional_time"
                }
              }
            },
            {
              "terms": {
                "country": [
                  "US"
                ]
              }
            },
            {
              "term": {
                "subscription.name": {
                  "value": "Plan1"
                } 
              }
            }
          ]
        }
      },
      "functions": [
        {
          "filter": {
            "query_string": {
              "default_field": "name",
              "query": "\"john\""
            }
          },
          "script_score": {
            "script": {
              "lang": "painless",
              "source": "for (item in params._source.subscription) {if (item.name == 'Plan1') {return item.price}}"
            }
          }
        }
      ],
      "score_mode": "sum",
      "boost_mode": "replace"
    }
  }
}

Solution

  • It depends on how many plans you have. If it's just two or a handful then the first option is better, otherwise you need to make subscription a nested object and nested object are less optimal in terms of query performance.

    With the first option, filtering on the plan name and price can be done with a single condition on subscription.Plan1.price: 100, while with the second option, you need two conditions (hence why subscription needs to be nested), one on subscription.name: Plan1 and another on subscription.price: 100

    UPDATE 1: using option 1

    {
      "query": {
        "function_score": {
          "query": {
            "bool": {
              "filter": [
                {
                  "range": {
                    "createdatutc": {
                      "gte": "2022-11-01T00:00:00.000Z",
                      "lt": "2023-05-06T00:00:00.000Z",
                      "format": "strict_date_optional_time"
                    }
                  }
                },
                {
                  "terms": {
                    "country": [
                      "US"
                    ]
                  }
                },
                {
                  "exists": {
                    "field": "subscription.Plan1.price"
                  }
                }
              ]
            }
          },
          "functions": [
            {
              "filter": {
                "query_string": {
                  "default_field": "name",
                  "query": "\"john\""
                }
              },
              "field_value_factor": {
                "field": "subscription.Plan1.price",
                "factor": 1.2
              }
            }
          ],
          "score_mode": "sum",
          "boost_mode": "replace"
        }
      }
    }