python-3.xpandasgroup-by

Python Pandas to match a string based on the prefix


I have a code below where I'm parsing a hostname text file with pd.read_csv and getting them matched based on the prefix , that works fine. However, as now there is a requirement where in sj12 I need to look for fourth character as an alphabet, example sj12 should match sh12[a-z] ie sj12a001 , sj12u003 etc.

I'm looking if there is way in Pandas to do this:

#!/grid/common/pkgs/python/v3.6.1/bin/python3
import pandas as pd
import numpy as np

prefixes = ['sj00', 'sj12', 'cr00', 'cr08', 'eu00', 'eu50']

df = pd.read_csv('new_hosts', index_col=False, header=None)
df['prefix'] = df[0].str[:4]
df['grp'] = df.groupby('prefix').cumcount()
df = df.pivot(index='grp', columns='prefix', values=0)

#To drop if all values in the row are nan
df = df[ prefixes ].dropna(axis=0, how='all').replace(np.nan, '', regex=True)
df = df.rename_axis(None)

Current output with the above code

sj00        sj12        cr00        cr08        eu00        eu50
sj000001    sj124000    cr000011    crn00001    euk000011   eu5000011
sj000002    sj125000    cr000012    crn00002    eu0000012   eu5000013
sj000003    sj12at00    cr000013    crn00003    eu0000013   eu5000014
sj000004    sj12bt00    cr000014    crn00004    eu0000014   eu5000015

Expected output

    sj00        sj12        cr00        cr08        eu00        eu50
    sj000001    sj12at00    cr000011    crn00001    euk000011   eu5000011
    sj000002    sj12bt00    cr000012    crn00002    eu0000012   eu5000013
    sj000003                cr000013    crn00003    eu0000013   eu5000014
    sj000004                cr000014    crn00004    eu0000014   eu5000015

Above in expected output you see sj124000 and sj125000 removed.


Solution

  • I got it resolved with str.extract method.

    df['sj12'] = df['sj12'].str.extract('(\w\w\d\d\w\*)', expand=True)
    

    OR

    df['sj12'] = df['sj12'].str.extract('(\w{2}\d{2}\w\*)', expand=True)