amazon-web-servicesopensearchamazon-opensearch

illegal_argument_exception Invalid JSON in payload for OpenSearch ingestion pipeline though model/_predict is working


I have deployed a text embedding model in AWS OpenSearch using a connector. The following request in Dev Tools returns a valid response with embedding.

POST /_plugins/_ml/models/<model_id>/_predict
{
  "parameters": {
    "inputs": ["hello"]
  }
}

like so

{
  "inference_results": [
    {
      "output": [
        {
          "name": "response",
          "dataAsMap": {
            "embedding": [
              0.18129970133304596,
              -0.05622033402323723, ...

However, when I simulate inserting a doc using my ingest pipeline that runs on this model, like so

POST /_ingest/pipeline/nlp-ingest-pipeline/_simulate?verbose=true
{
  "docs": [
    {
      "_index": "product-index",
      "_id": "1",
      "_source":{
        "product_text": "hello"
      }
    }
  ]
}

I get an error (shortened)

            "root_cause": [
              {
                "type": "illegal_argument_exception",
                "reason": "Invalid JSON in payload"
              }
            ],

This is the pipeline processor

"processors": [
    {
      "text_embedding": {
        "model_id": "<model_id>",
        "field_map": {
          "product_text": "product_embedding"
        }
      }
    }
  ]

I suspect the pipeline is not able to parse my model endpoint response. Where to see the ingest pipeline's interface to the model?


Solution

  • The ingest pipeline expects the model response of the form

    {
      "inference_results": [
        {
          "output": [
            {
              "name": "sentence_embedding",
              "data_type": "FLOAT32",
              "shape": [
                768
              ],
              "data": [
                0.18129970133304596, ...
              ], ...
    

    This can be achieved by using pre_process_function and post_process_function in the Connector API. These functions are written in Painless scripting language.

    I use the following, as seen in the reference blog post.

    "actions": [
          {
             "action_type": "predict",
             "method": "POST",
             "headers": {
                "content-type": "application/json"
             },
             "url": "<inference_url>",
             "pre_process_function": "\n    StringBuilder builder = new StringBuilder();\n    builder.append(\"\\\"\");\n    String first = params.text_docs[0];\n    builder.append(first);\n    builder.append(\"\\\"\");\n    def parameters = \"{\" +\"\\\"inputs\\\":[\" + builder + \"]}\";\n    return  \"{\" +\"\\\"parameters\\\":\" + parameters + \"}\";",
             "post_process_function": "\n      def name = \"sentence_embedding\";\n      def dataType = \"FLOAT32\";\n      if (params.<YOUR ENDPOINT JSON OUTPUT LABEL> == null || params.<YOUR ENDPOINT JSON OUTPUT LABEL>.length == 0) {\n        return params.message;\n      }\n      def shape = [params.<YOUR ENDPOINT JSON OUTPUT LABEL>.length];\n      def json = \"{\" +\n                 \"\\\"name\\\":\\\"\" + name + \"\\\",\" +\n                 \"\\\"data_type\\\":\\\"\" + dataType + \"\\\",\" +\n                 \"\\\"shape\\\":\" + shape + \",\" +\n                 \"\\\"data\\\":\" + params.<YOUR ENDPOINT JSON OUTPUT LABEL> +\n                 \"}\";\n      return json;\n    ",
             "request_body": "{\"inputs\": ${parameters.inputs}}"
          }
       ]
    

    Since my inference endpoint returns {"embedding": [...]}, I use embedding in the place of <YOUR ENDPOINT JSON OUTPUT LABEL>.


    Additional context

    My sample payload for model/_predict is

    {
      "parameters": {
        "inputs": ["hello"]
      }
    }
    

    Reference - AWS Community Blog