I've been running around internet trying to find out how to build a regular expression to capture text in the way I need it; so I saw some StackOverflow questions but none of them express what I want, but if you already saw something similar to my issue here, pelase feel free to pointme to that article...
I tried to use recursion but it seems I'm not good enough to get something to work
Some notes:
1) I can't use a parse program because the program that will use this data will use regular expression to capture it, and this program is a "general purpose" program that in fact is capturing any data that is needed, only thing I need to do is give proper regular expression to get information it needs, also I need to keep it as copact as possible, so I can't use third party or external programs.
2) The pair 'key': 'value' can vary, they are not always the same number of pairs... that is what make it difficult I believe.
3) Program that is going to use this regex is created in Python 2.7.3: How this program works: it uses a Json config file where I can setup command I want to run that will give to me data I need, then I specify a regex to teach to the program what need to be captured and how to handle it ie: what to do with the groups that get captured... so that is why I can't use a parser. This program uses fabric to run configued collector(with the regex) to remote hosts and gather all data...
4) Program is used to gather data to post them into a webserver and get metrics and other stuff like graphs and monitor alarms etc
I have been able to capture almost all data I was planing to capture but when I was trying to create a collector for this then I got stuck..
The following data repeats exactly like below but with different server names, of course values will change too:
Server: Omega-X
celery.queue_length: {'transfer_data': '0', 'factor_a': '0', 'slow': '0', 'factor_b': '0', 'score_retry': '0', 'damage_factor_c': '0', 'voice_ud': '0', 'alarm_factors_bl': '0', 'telemetry_x': '0', 'endstream': '0', 'celery': '0', 'awl': '0', 'prs': '0', 'score': '0', 'feature_factors_xf': '0', 'feature_factors_dc': '0'}
Server: Alfa-X
celery.queue_length: {'transfer_data': '0', 'factor_a': '0', 'slow': '0', 'factor_b': '0', 'score_retry': '0', 'damage_factor_c': '0', 'voice_ud': '0', 'alarm_factors_bl': '0', 'telemetry_x': '0', 'endstream': '0', 'celery': '0', 'awl': '0', 'prs': '0', 'score': '0', 'feature_factors_xf': '0', 'feature_factors_dc': '0'}
How I want to capture it:
Server: Omega-X
transfer_data: 0
factor_a: 0
slow: 0
factor_b: 0
score_retry: 0
damage_factor_c: 0
voice_ud: 0
alarm_factors_bl: 0
telemetry_x: 0
endstream: 0
celery: 0
awl: 0
trx: 0
points: 0
feature_factors_xf: 0
feature_factors_dc: 0
Server: Alfa-X
transfer_data: 0
factor_a: 0
slow: 0
factor_b: 0
score_retry: 0
damage_factor_c: 0
voice_ud: 0
alarm_factors_bl: 0
telemetry_x: 0
endstream: 0
celery: 0
awl: 0
trx: 0
points: 0
feature_factors_xf: 0
feature_factors_dc: 0
If a unique server is shown, then is not so difficult, using the below regex I'm able to capture all (except name of server):
'([a-z_]+)':\s'(\d+)'
This regex will give only the second part, which is the list of variables and values, but not the Server name... so if I get on same output several servers with the same data, then will be impossible to know from which server the values are coming from...
If I try to add support for the server name: I've tried follwoing regex, it works but only capture Server name, and first pair of parameters:
Server:\s([a-zA-Z0-9-]+)\s*celery\.queue_length:\s.('([a-z_]+)':\s'(\d+)')*
I had tried multiple recursion features but I've failed to achieve what I want.
Can anyone point me to right direction here...?
Thanks.
You want key-value ? with python I would use the dictionary.
get the server name and the string containing the data:
Server: ([^\n]*)(?:[^{]*)\{(.*)\}
build a dict with the string containing the data for each server:
With python (you only need import re
statement):
input = """Server: Omega-X
celery.queue_length: {'transfer_data': '0', 'factor_a': '0', 'slow': '0', 'factor_b': '0', 'score_retry': '0', 'damage_factor_c': '0', 'voice_ud': '0', 'alarm_factors_bl': '0', 'telemetry_x': '0', 'endstream': '0', 'celery': '0', 'awl': '0', 'prs': '0', 'score': '0', 'feature_factors_xf': '0', 'feature_factors_dc': '0'}
Server: Alfa-X
celery.queue_length: {'transfer_data': '0', 'factor_a': '0', 'slow': '0', 'factor_b': '0', 'score_retry': '0', 'damage_factor_c': '0', 'voice_ud': '0', 'alarm_factors_bl': '0', 'telemetry_x': '0', 'endstream': '0', 'celery': '0', 'awl': '0', 'prs': '0', 'score': '0', 'feature_factors_xf': '0', 'feature_factors_dc': '0'}"""
for match in re.findall(r'Server: ([^\n]*)(?:[^{]*)\{(.*)\}', input):
server = match[0]
data = match[1]
datadict = dict((k.strip().replace("'", ""), v.strip().replace("'", "")) for k,v in (item.split(':') for item in data.split(',')))
datadict['serveur'] = server
Then you can store each datadict (e.g. in a list) and use then as you want. You can cast the values from string to integer to manipulate them easily.