I have a code below where I'm parsing a hostname text file with pd.read_csv
and getting them matched based on the prefix
, that works fine. However, as now there is a requirement where in sj12
I need to look for fourth character as an alphabet, example sj12 should match sh12[a-z]
ie sj12a001
, sj12u003
etc.
I'm looking if there is way in Pandas to do this:
#!/grid/common/pkgs/python/v3.6.1/bin/python3
import pandas as pd
import numpy as np
prefixes = ['sj00', 'sj12', 'cr00', 'cr08', 'eu00', 'eu50']
df = pd.read_csv('new_hosts', index_col=False, header=None)
df['prefix'] = df[0].str[:4]
df['grp'] = df.groupby('prefix').cumcount()
df = df.pivot(index='grp', columns='prefix', values=0)
#To drop if all values in the row are nan
df = df[ prefixes ].dropna(axis=0, how='all').replace(np.nan, '', regex=True)
df = df.rename_axis(None)
sj00 sj12 cr00 cr08 eu00 eu50
sj000001 sj124000 cr000011 crn00001 euk000011 eu5000011
sj000002 sj125000 cr000012 crn00002 eu0000012 eu5000013
sj000003 sj12at00 cr000013 crn00003 eu0000013 eu5000014
sj000004 sj12bt00 cr000014 crn00004 eu0000014 eu5000015
sj00 sj12 cr00 cr08 eu00 eu50
sj000001 sj12at00 cr000011 crn00001 euk000011 eu5000011
sj000002 sj12bt00 cr000012 crn00002 eu0000012 eu5000013
sj000003 cr000013 crn00003 eu0000013 eu5000014
sj000004 cr000014 crn00004 eu0000014 eu5000015
Above in expected output you see sj124000
and sj125000
removed.
I got it resolved with str.extract
method.
df['sj12'] = df['sj12'].str.extract('(\w\w\d\d\w\*)', expand=True)
OR
df['sj12'] = df['sj12'].str.extract('(\w{2}\d{2}\w\*)', expand=True)