pythonbeautifulsouploggly

How to selectively ignore strings to in a python regex?


I have written a fairly basic system monitor for my router to track when the signal is dropping (and all the stats that have occurred at that time) as the very excellent routerstatslite doesn't gather everything that I need.

Here's the Gist, but I want to sanitize the data before I upload it to loggly so I can remove the db and mbps suffixes as necessary

https://gist.github.com/scottharman/6ca07a7c46ca09de3e3b2f0a5094d86e

script =  stats.findAll('script')[1]
pattern = re.compile('(\w+)="(.*?)Mbps\|dB"')
fields = dict(re.findall(pattern, script.text))
clean_fields = { k:v.strip() for k, v in fields.iteritems()}
if old_fields != clean_fields:
    logger.info(json.dumps(clean_fields))
old_fields = clean_fields
print clean_fields
sleep(5)

As I'm putting it straight into a dict, I want to discard Mbps or dB when found, but obviously what I've got isn't going to work. It's tidier if I can simply remove the two strings from the 70-80 odd status lines that I've got when extracting the fields, but is it just not possible?

Cheers

Sample input from script tag:

var conn_down="    13.35 Mbps";
var conn_up="     0.82 Mbps";
var line_down="    34.60 dB";
var line_up="    19.70 dB";
var noise_down="     6.10 dB";
var noise_up="     6.50 dB";

var sys_uptime="74523";
var lan_status="Link up";
var lan_txpkts="1294024";
var lan_rxpkts="2256747";
var lan_collisions="0";
var lan_txbs="10004";
var lan_rxbs="35259";
var lan_systime="74523";

Then the processed data looks like this:

u'noise_up': u'6.50 dB', u'lan_rxbs': u'35259', u'an_rxpkts': u'2857867', u'bgn_status': u'600M', u'lan_status0': u'100M/Full', 
u'lan_status3': u'1000M/Full', u'lan_status2': u'100M/Full', u'conn_up': u'0.82 Mbps',

Solution

  • You could use optional non-capturing group to match ' Mbps' or ' dB':

    import re
    import pprint
    
    s = '''var conn_down="    13.35 Mbps";
    var conn_up="     0.82 Mbps";
    var line_down="    34.60 dB";
    var line_up="    19.70 dB";
    var noise_down="     6.10 dB";
    var noise_up="     6.50 dB";
    
    var sys_uptime="74523";
    var lan_status="Link up";
    var lan_txpkts="1294024";
    var lan_rxpkts="2256747";
    var lan_collisions="0";
    var lan_txbs="10004";
    var lan_rxbs="35259";
    var lan_systime="74523";'''
    
    pattern = re.compile(r'(\w+)=\"\s*(.*?)(?:\sMbps|\sdB)?\"')
    fields = dict(re.findall(pattern, s))
    pprint.pprint(fields)
    

    Output:

    {'conn_down': '13.35',
     'conn_up': '0.82',
     'lan_collisions': '0',
     'lan_rxbs': '35259',
     'lan_rxpkts': '2256747',
     'lan_status': 'Link up',
     'lan_systime': '74523',
     'lan_txbs': '10004',
     'lan_txpkts': '1294024',
     'line_down': '34.60',
     'line_up': '19.70',
     'noise_down': '6.10',
     'noise_up': '6.50',
     'sys_uptime': '74523'}
    

    In above (\w+)= captures one or more alphanumeric characters followed by =. \"\s* matches quotation mark followed by zero or more whitespace. (.*?) captures non-greedily any text and (?:\sMbps|\sdB)? is optional non-capturing group that matches ' Mbps' or ' dB'. See regex101 demo.