amazon-redshift

Json parsing errors while using json_extract_path_text() function in where clause


I have a Redshift table that contains columns with json objects. I am running into json parser failures while trying to execute queries on this table that applies specific filters on the json object content. While I am able to use the json_extract_path_text() in the select statements, the same fails when used in where clause.

Following is the error I see: Amazon Invalid operation: JSON parsing error;

When I look at the STL_ERROR table for more details, this is what I see in the error details: errcode: 8001 context: JSON parsing error error: invalid json object null

Following is an example of the content in one such json column:

{"att1":"att1_val","att2":"att2_val","att3":[{"att3_sub1_1":"att3_sub1_1_val","att3_sub1_2":"att3_sub1_2_val"},{"att3_sub2_1":"att3_sub2_1_val","att3_sub2_2":"att3_sub2_2_val"}],"att4":"att4_val","att5":"att5_val"}

Now when I run the following query, it executes without any issues:

select
json_extract_path_text(col_with_json_obj,'att4') as required_val
from table_with_json_data;

Now when I use the json_extract_path_text() in the where clause it fails with the above error:

select
json_extract_path_text(col_with_json_obj,'att4') as required_val
from table_with_json_data
where json_extract_path_text(col_with_json_obj,'att4') = 'att4_val';

Is there anything that I am using incorrectly or missing here?

P.S: I have another table with similar schema and the same queries run just fine on that. The only difference between the two tables is the way the data is loaded - one uses jsonpaths file in the copy options and the other uses json 'auto'.


Solution

  • This is the error you would receive if table_with_json_data contained even a single row in which the value of col_with_json_obj was the four-character string "null".

    To avoid errors like this in general I'd recommend creating a Redshift UDF for validating JSON. The is_json() method described at http://discourse.snowplowanalytics.com/t/more-robust-json-parsing-in-redshift-with-python-udfs/197 has worked well for me:

    create or replace function is_json(j varchar(max))
      returns boolean
      stable as $$
        import json
        try:
          json_object = json.loads(j)
        except ValueError, e:
          return False
        return True
      $$ language plpythonu;
    

    Then you could add a and where is_json(col_with_json_obj) clause to your query and this class of errors can be avoided entirely.