pythonpandasregex

How to correctly extract Numbers from String using regex in Python?


I am trying to extract Numbers from the string only where it ends with Numbers or Decimals

df = pd.DataFrame({'Names': ["Absolute Neutrophil Count","Absolute Lymphocyte Count 2.9",
                       "Absolute Neutrophil Count 10.2","ESR (Modified Westergren) 8",
                        "Free Triiodothyronine (FT3) 3.59",
                        "Free Triiodothyronine FT4 4.53"]})

df

                         Names
0   Absolute Neutrophil Count
1   Absolute Lymphocyte Count 2.9
2   Absolute Neutrophil Count 10.2
3   ESR (Modified Westergren) 8
4   Free Triiodothyronine (FT3) 3.59
5   Free Triiodothyronine FT4 4.53

Desired Extraction Results:

0  Missing/None
1  2.9
2  10.2
3  8
4  3.59
5  4.53

I was trying below code but that is not giving the desired results.

df.iloc[:,0].str.extract(r'^(.*?)\s*(\d\.?\d*)?$') #  '\d+\.\d+'
                           0      1
0   Absolute Neutrophil Count     NaN
1   Absolute Lymphocyte Count     2.9
2   Absolute Neutrophil Count 1   0.2
3   ESR (Modified Westergren)     8
4   Free Triiodothyronine (FT3)   3.59
5   Free Triiodothyronine FT4     4.53

Please use dataframe form of structure in Answer and .extract otherwise sometimes answers here with re and strings work but when I try to apply them on df then it becomes something else.


Solution

  • The following regex pattern seems to be working here:

    (\d+(?:\.\d+)?)$
    

    This matches an optional ending integer/decimal. The updated Pandas code would look like:

    df.iloc[:,0].str.extract(r'(\d+(?:\.\d+)?)?$')
    

    Here is an explanation of the regex pattern being used: