pythonjsonstringgoogle-geminigoogle-generativeai

Using REGEX to Handle Nested Double Quotes in JSON Strings in Python


I'm using Generative AI API to return text responses as JSON strings which I intend to feed data into an application in real time. The problem is that often the JSON response provided by GenAI API includes small errors- most commonly with double quotes. These syntax issues in the response JSON string trigger errors in my python code when converting them to JSON.

For instance, I have the following JSON string:
'{"test":"this is "test" of "a" test"","result":"your result is "out" in our website"}'

As you can see, the value for "test" has multiple double quotations. So if I try to convert this to json, I get an error. What I want to do is utilize regex to convert the double quotations to single quotations. So the result can look as follows:
'{"test":"this is 'test' of 'a' test'", "result": "your result is 'out' in our website"}'

The best I can do is as follows:

def repl_call(m):
    preq = m.group(1)
    qbody = m.group(2)
    qbody = re.sub( r'"', "'", qbody )
    return preq + '"' + qbody + '"'

print( re.sub( r'([:\[,{]\s*)"(.*?)"(?=\s*[:,\]}])', repl_call, text ))

The following code successfully returns the intended result. However, if I were to add a comma, such as
{"test":"this is "test" of "a", test"","result":"your result is "out" in our website"}

...the code breaks and returns the following:
'{"test":"this is 'test' of 'a", test"","result":"your result is 'out' in our website"}'

:(

I've presently have tried to improve my AI prompt (prompt engineering) to avoid the double quotations and return only a valid JSON string. This works to some degree, but I still encounter enough errors in syntax that require me to retry the same prompt multiple times- which incurs unnecessary delays and costs.

My question is: Is there such thing as a common function and REGEX pattern I can apply in python to fix my JSON string so that it properly cleanses syntax errors? Specifically relating to misplaced double quotes.

I'm open to a variety of suggestions, including possible Python packages that can deal with JSON string cleansing. Even any advice on advanced GenAI tools that do JSON enforcement. I presently use Gemeni- which I like a lot. But doesn't allow JSON enforcement like OpenAI's API allows more explicitly.


Solution

  • If you are requesting JSon back you should be using the response_mime_type and then you will not have these issues with parsing the JSon.

    from dotenv import load_dotenv
    import google.generativeai as genai
    import os
    
    load_dotenv()
    genai.configure(api_key=os.environ['API_KEY'])
    MODEL_NAME_LATEST = os.environ['MODEL_NAME_LATEST']
    
    model = genai.GenerativeModel(
        model_name=MODEL_NAME_LATEST,
        # Set the `response_mime_type` to output JSON
        generation_config={"response_mime_type": "application/json"})
    
    prompt = """
      List 5 popular cookie recipes.
      Using this JSON schema:
        Recipe = {"recipe_name": str}
      Return a `list[Recipe]`
      """
    
    response = model.generate_content(prompt)
    print(response.text)
    

    Just remember to ensure that the JSon object you tell it to use is actually correct JSon or it may build it incorrectly include all , where they should be

    response schema

    Another option would be to use response schema.

    from dotenv import load_dotenv
    import google.generativeai as genai
    import os
    import typing_extensions as typing
    
    load_dotenv()
    genai.configure(api_key=os.environ['API_KEY'])
    MODEL_NAME_LATEST = os.environ['MODEL_NAME_LATEST']
    
    
    class Recipe(typing.TypedDict):
        recipe_name: str
    
    
    model = genai.GenerativeModel(
        model_name=MODEL_NAME_LATEST,
        # Set the `response_mime_type` to output JSON
        # Pass the schema object to the `response_schema` field
        generation_config={"response_mime_type": "application/json",
                           "response_schema": list[Recipe]})
    
    prompt = "List 5 popular cookie recipes"
    
    response = model.generate_content(prompt)
    print(response.text)
    

    see Json mode