pythonjsonstringreplace

Formatting JSON for Python - need to remove \"


I've got some JSON that looks like this:

{"name": "John",
 "description": "I'm just \"A BOY\" okay? He said \"Hello, World!\" to everyone.",
 "remark": "\"This is a test\" he mentioned."}

And the \" instances are breaking json.loads().

import json

json_string = '''{"name": "John",
"description": "I'm just \"A BOY\" okay? He said \"Hello, World!\" to everyone.",
"remark": "\"This is a test\" he mentioned."}'''
data = json.loads(json_string)

print(data)

raises: json.decoder.JSONDecodeError: Expecting ',' delimiter: line 2 column 27 (char 43)

I feel like I've tried every regex under the sun to target these instances (but leave all the other double quotes, not preceded by a backslash) and replace them with an empty string (functionally just strip them). If anyone has tips I'd appreciate it.

My implementation right now is something like:

import re

# Define a regular expression pattern to match \" within a string
pattern = r'\\"'

# Use re.sub to replace all occurrences of the pattern with an empty string
cleaned_string = re.sub(pattern, '', json_string)

print(cleaned_string)

But when i run this in a repl, nothing changes.

For reference, I'd just like the output to be:

{"name": "John",
 "description": "I'm just A BOY okay? He said Hello, World! to everyone.",
 "remark": "This is a test he mentioned."}

Edit: for clarity this is just an example of the nature of the input data i'm working with, its coming from AWS Cloudwatch logs so I don't have an easy way to manipulate the input before dragging it into Python. For example, part of the payload is something like

"\"Girl Let's Talk\" Virtual 90s Kickback"

In context:

{"search_ads": [ {"event_id": "4838383", "ad_id": "1112", "budget_amount": 5.0, "currency": "USD", "marketplace": "Online_US", "score": 18.205433, "p_click": 0.0, "p_order": 0.0, "goal": 2, "category_id": 113, "subcategory_id": 13999, "format": null, "is_paid": false, "online_event": true, "event_start_date": "2024-06-28T00:00:00Z", "latitude": null, "longitude": null, "name": "\"Girl Let's Talk\" Virtual 90s Kickback", "vip_status": false, "is_participant": true}]}

so the \" characters are really the only problem - if I copy all that input into VS Code and just search for/delete that pattern, json.loads() works great as is.

As one commenter mentioned, i think what im looking for is a regex that will match and strip the pattern \" but ive had no luck with that so far! Ive only been able to strip either the \s, which leaves me with double quotes that break json.loads() (expecting delimiter aka thinks this is another JSON key/val pair) or stripping all the double-quotes, which of course completely breaks the same.


Solution

  • You do not need to remove \". It's part of the data.*

    What you're having a problem with is Python's interpretation of string literals. The sequence \" is an escape sequence that turns into just ".

    >>> '\"'
    '"'
    

    This can be solved with a raw string (r prefix).

    import json
    
    json_string = r'''
    {"name": "John",
     "description": "I'm just \"A BOY\" okay? He said \"Hello, World!\" to everyone.",
     "remark": "\"This is a test\" he mentioned."}
    '''
    
    data = json.loads(json_string)
    
    print(data['description'])
    

    Output:

    I'm just "A BOY" okay? He said "Hello, World!" to everyone.
    

    However, you might prefer to put the JSON in a separate file and use json.load(), to avoid having to muck around with string literals at all.


    * To be more precise, it's part of the JSON. In a JSON string, \" represents ", which is the raw data.