pythonregexregex-replace

Regex optimization for legacy data to json


I have a legacy data format which looks similar to Json. An example is shown in the code block below:

1  {
2    base = {
3      attributeNumericName = V1D,
4      attributeName = hello,
5      attributeFloat = 1.0,
6      attributeQuotes = "a\b",
7      canContainLists = [
8        1.0,
9        2.0
10      ],
11     canContainStackedData = [
12       {
13         stackedAttribute = 1.0
14       }
15     ]
16   }
17 }

To be able to read this in efficiently in python, I am converting the text using regex replacements. I am using the following three replacements:

Action Source Replacement Affects line(s)
Replace all attributes with numeric values ([\w]{1,}) = ([\d\.]{1,},?) "\1": \2 5, 13
Replace all attributes with string values ([\w]{1,}) = ([\w]{1,}) "\1": "\2" 3, 4
Replace all remaining attributes ([\w\d]{1,}) = "\1": 2, 6, 7, 11

Because these searches have to pass through the contents three times, they are quite slow. In total I need to investigate several terabytes of files. Therefore, I am looking to optimize this. I've been trying to figure out how to combine these three replacements in one, but without any success so far. Any help would be very much appreciated.


Solution

  • After looking at your sample

    {
      base = {
        attributeNumericName = V1D,
        attributeName = hello,
        attributeFloat = 1.0,
        canContainLists = [
          1.0,
          2.0
        ],
        canContainStackedData = [
          {
            stackedAttribute = 1.0
          }
        ]
      }
    }
    

    it looks YAML-like for me, but using = rather than : . I did save it as file.txt and then do

    import yaml  # pip install pyyaml
    with open("file.txt", "r") as f:
        document = f.read().replace(" = ", ": ")
    data = yaml.safe_load(document)
    print(data['base']['attributeName'])
    print(data['base']['attributeFloat'])
    print(type(data['base']['attributeFloat']))
    

    gives output

    hello
    1.0
    <class 'float'>
    

    Compared to your original approach this requires external dependency namely PyYAML but does not use re module at all. I am unable to say which solution will be faster based on just one case, so please before using this approach select reasonable number of examples from your data heap and measure run time against alternative solutions.

    (tested in PyYAML 5.4.1)

    EDIT: after learning that this solution is too slow for your use case I investigated how to ameloriate that and found fastyaml which can be used following way

    import fastyaml  # pip install fastyaml
    with open("file.txt", "r") as f:
        document = f.read().replace(" = ", ": ")
    data = fastyaml.loads(document)
    print(data['base']['attributeName'])
    print(data['base']['attributeFloat'])
    print(type(data['base']['attributeFloat']))
    

    gives output same as above and my brief experiments showed that fastyaml is 2 orders of magnitude faster than PyYAML.

    (fastyaml 0.1.0)