pythonregexpandasdata-cleaning

Clean pandas series using regex


I am trying to clean a column called 'historical_rank' in a pandas dataframe. It contains string data. Here is a sample of the content:

       historical_rank
...    ...
122    1908
123    O'   
124 
125    1911  
126    1912  
127    1913 * * * 2010 * * *  
128
129    1914  
130    1915
131  
132
133    1918  
134    (First served 1989 to 1999)
...    ...

The data I want to retain are the four-digit numbers in rows 122, 125, 126, 127, 129, 130, and 133. Elsewhere in the series that number (the historical rank) may be one, two, or three digits. It always begins the string, and there is always a space after it. I want to use regex to keep the desired pattern -- r'\d{1,4}(?=\s)' -- and remove everything else throughout the series. What is the correct code to achieve this? Thank you.


Solution

  • IICU

    df['historical_rank_new']=df['historical_rank'].str.extract('(^[\d]{1,4})')
    df