pythonregexregex-recursion

Regex Expression to capture repeated patterns


I've been running around internet trying to find out how to build a regular expression to capture text in the way I need it; so I saw some StackOverflow questions but none of them express what I want, but if you already saw something similar to my issue here, pelase feel free to pointme to that article...

I tried to use recursion but it seems I'm not good enough to get something to work

Some notes:

1) I can't use a parse program because the program that will use this data will use regular expression to capture it, and this program is a "general purpose" program that in fact is capturing any data that is needed, only thing I need to do is give proper regular expression to get information it needs, also I need to keep it as copact as possible, so I can't use third party or external programs.

2) The pair 'key': 'value' can vary, they are not always the same number of pairs... that is what make it difficult I believe.

3) Program that is going to use this regex is created in Python 2.7.3: How this program works: it uses a Json config file where I can setup command I want to run that will give to me data I need, then I specify a regex to teach to the program what need to be captured and how to handle it ie: what to do with the groups that get captured... so that is why I can't use a parser. This program uses fabric to run configued collector(with the regex) to remote hosts and gather all data...

4) Program is used to gather data to post them into a webserver and get metrics and other stuff like graphs and monitor alarms etc

I have been able to capture almost all data I was planing to capture but when I was trying to create a collector for this then I got stuck..

The following data repeats exactly like below but with different server names, of course values will change too:

Server: Omega-X
celery.queue_length: {'transfer_data': '0', 'factor_a': '0', 'slow': '0', 'factor_b': '0', 'score_retry': '0', 'damage_factor_c': '0', 'voice_ud': '0', 'alarm_factors_bl': '0', 'telemetry_x': '0', 'endstream': '0', 'celery': '0', 'awl': '0', 'prs': '0', 'score': '0', 'feature_factors_xf': '0', 'feature_factors_dc': '0'}


Server: Alfa-X
celery.queue_length: {'transfer_data': '0', 'factor_a': '0', 'slow': '0', 'factor_b': '0', 'score_retry': '0', 'damage_factor_c': '0', 'voice_ud': '0', 'alarm_factors_bl': '0', 'telemetry_x': '0', 'endstream': '0', 'celery': '0', 'awl': '0', 'prs': '0', 'score': '0', 'feature_factors_xf': '0', 'feature_factors_dc': '0'}

How I want to capture it:

Server: Omega-X

 transfer_data: 0
 factor_a: 0
 slow: 0
 factor_b: 0
 score_retry: 0
 damage_factor_c: 0
 voice_ud: 0
 alarm_factors_bl: 0
 telemetry_x: 0
 endstream: 0
 celery: 0
 awl: 0
 trx: 0
 points: 0
 feature_factors_xf: 0
 feature_factors_dc: 0

Server: Alfa-X

 transfer_data: 0
 factor_a: 0
 slow: 0
 factor_b: 0
 score_retry: 0
 damage_factor_c: 0
 voice_ud: 0
 alarm_factors_bl: 0
 telemetry_x: 0
 endstream: 0
 celery: 0
 awl: 0
 trx: 0
 points: 0
 feature_factors_xf: 0
 feature_factors_dc: 0

If a unique server is shown, then is not so difficult, using the below regex I'm able to capture all (except name of server):

'([a-z_]+)':\s'(\d+)'

This regex will give only the second part, which is the list of variables and values, but not the Server name... so if I get on same output several servers with the same data, then will be impossible to know from which server the values are coming from...

If I try to add support for the server name: I've tried follwoing regex, it works but only capture Server name, and first pair of parameters:

Server:\s([a-zA-Z0-9-]+)\s*celery\.queue_length:\s.('([a-z_]+)':\s'(\d+)')*

I had tried multiple recursion features but I've failed to achieve what I want.

Can anyone point me to right direction here...?

Thanks.


Solution

  • You want key-value ? with python I would use the dictionary.

    1. get the server name and the string containing the data:
      Server: ([^\n]*)(?:[^{]*)\{(.*)\}

    2. build a dict with the string containing the data for each server:

    With python (you only need import re statement):

    input = """Server: Omega-X
    celery.queue_length: {'transfer_data': '0', 'factor_a': '0', 'slow': '0', 'factor_b': '0', 'score_retry': '0', 'damage_factor_c': '0', 'voice_ud': '0', 'alarm_factors_bl': '0', 'telemetry_x': '0', 'endstream': '0', 'celery': '0', 'awl': '0', 'prs': '0', 'score': '0', 'feature_factors_xf': '0', 'feature_factors_dc': '0'}
    
    Server: Alfa-X
    celery.queue_length: {'transfer_data': '0', 'factor_a': '0', 'slow': '0', 'factor_b': '0', 'score_retry': '0', 'damage_factor_c': '0', 'voice_ud': '0', 'alarm_factors_bl': '0', 'telemetry_x': '0', 'endstream': '0', 'celery': '0', 'awl': '0', 'prs': '0', 'score': '0', 'feature_factors_xf': '0', 'feature_factors_dc': '0'}"""
    
    
    for match in re.findall(r'Server: ([^\n]*)(?:[^{]*)\{(.*)\}', input):
        server = match[0]
        data = match[1]
        datadict = dict((k.strip().replace("'", ""), v.strip().replace("'", "")) for k,v in (item.split(':') for item in data.split(',')))
        datadict['serveur'] = server
    

    Then you can store each datadict (e.g. in a list) and use then as you want. You can cast the values from string to integer to manipulate them easily.