pythonndjson

How to get an unknown substring between two known substrings, within a giant string/file


I'm trying to get all the substrings under a "customLabel" tag, for example "Month" inside of ...,"customLabel":"Month"},"schema":"metric...

Unusual issue: this is a 1071552 characters long ndjson file, of a single line ("for line in file:" is pointless since there's only one).

The best I found was that: How to find a substring of text with a known starting point but unknown ending point in python

but if I use this, the result obviously doesn't stop (at Month) and keeps writing the whole remaining of the file, same as if using partition()[2].

Just know that Month is only an example, customLabel has about 300 variants and they are not listed (I'm actually doing this to list them...)

To give some details here's my script so far:

with open("file.ndjson","rt", encoding='utf-8') as ndjson:
    filedata = ndjson.read()
    x="customLabel"
    count=filedata.count(x)
    for i in range (count):
        if filedata.find(x)>0:
            print("Found "+str(i+1))

So right now it properly tells me how many occurences of customLabel there are, I'd like to get the substring that comes after customLabel":" instead (Month in the example) to put them all in a list, to locate them way more easily and enable the use of replace() for traductions later on.

I'd guess regex are the solution but I'm pretty new to that, so I'll post that question by the time I learn about them...


Solution

  • If you want to search for all (even nested) customLabel values like this:

    {"customLabel":"Month" , "otherJson" : {"customLabel" : 23525235}}
    

    you can use RegEx patterns with the re module

    import re
    
    label_values = []
    regex_pattern = r"\"customLabel\"[ ]?:[ ]?([1-9a-zA-z\"]+)"
    with open("file.ndjson", "rt", encoding="utf-8") as ndjson:
        for line in ndjson:
            values = re.findall(regex_pattern, line)
            label_values.extend(values)
    
    print(label_values) # ['"Month"', '23525235']
    
    # If you don't want the items to have quotations
    label_values = [i.replace('"', "") for i in label_values]
    print(label_values) # ['Month', '23525235']
    

    Note: If you're only talking about ndjson files and not nested searching, then it'd be better to use the json module to parse the lines and then easily get the value of your specific key which is customLabel.

    import json
    
    label = "customLabel"
    label_values = []
    with open("file.ndjson", "rt", encoding="utf-8") as ndjson:
        for line in ndjson:
            line_json = json.loads(line)
            if line_json.get(label) is not None:
                label_values.append(line_json.get(label))
    
    
    print(label_values) # ['Month']