I have a legacy data format which looks similar to Json. An example is shown in the code block below:
1 {
2 base = {
3 attributeNumericName = V1D,
4 attributeName = hello,
5 attributeFloat = 1.0,
6 attributeQuotes = "a\b",
7 canContainLists = [
8 1.0,
9 2.0
10 ],
11 canContainStackedData = [
12 {
13 stackedAttribute = 1.0
14 }
15 ]
16 }
17 }
To be able to read this in efficiently in python, I am converting the text using regex replacements. I am using the following three replacements:
Action | Source | Replacement | Affects line(s) |
---|---|---|---|
Replace all attributes with numeric values | ([\w]{1,}) = ([\d\.]{1,},?) |
"\1": \2 |
5, 13 |
Replace all attributes with string values | ([\w]{1,}) = ([\w]{1,}) |
"\1": "\2" |
3, 4 |
Replace all remaining attributes | ([\w\d]{1,}) = |
"\1": |
2, 6, 7, 11 |
Because these searches have to pass through the contents three times, they are quite slow. In total I need to investigate several terabytes of files. Therefore, I am looking to optimize this. I've been trying to figure out how to combine these three replacements in one, but without any success so far. Any help would be very much appreciated.
After looking at your sample
{
base = {
attributeNumericName = V1D,
attributeName = hello,
attributeFloat = 1.0,
canContainLists = [
1.0,
2.0
],
canContainStackedData = [
{
stackedAttribute = 1.0
}
]
}
}
it looks YAML-like for me, but using =
rather than :
. I did save it as file.txt
and then do
import yaml # pip install pyyaml
with open("file.txt", "r") as f:
document = f.read().replace(" = ", ": ")
data = yaml.safe_load(document)
print(data['base']['attributeName'])
print(data['base']['attributeFloat'])
print(type(data['base']['attributeFloat']))
gives output
hello
1.0
<class 'float'>
Compared to your original approach this requires external dependency namely PyYAML but does not use re
module at all. I am unable to say which solution will be faster based on just one case, so please before using this approach select reasonable number of examples from your data heap and measure run time against alternative solutions.
(tested in PyYAML 5.4.1)
EDIT: after learning that this solution is too slow for your use case I investigated how to ameloriate that and found fastyaml
which can be used following way
import fastyaml # pip install fastyaml
with open("file.txt", "r") as f:
document = f.read().replace(" = ", ": ")
data = fastyaml.loads(document)
print(data['base']['attributeName'])
print(data['base']['attributeFloat'])
print(type(data['base']['attributeFloat']))
gives output same as above and my brief experiments showed that fastyaml
is 2 orders of magnitude faster than PyYAML.
(fastyaml 0.1.0)