I am trying to clean a column called 'historical_rank' in a pandas dataframe. It contains string data. Here is a sample of the content:
historical_rank
... ...
122 1908
123 O'
124
125 1911
126 1912
127 1913 * * * 2010 * * *
128
129 1914
130 1915
131
132
133 1918
134 (First served 1989 to 1999)
... ...
The data I want to retain are the four-digit numbers in rows 122, 125, 126, 127, 129, 130, and 133. Elsewhere in the series that number (the historical rank) may be one, two, or three digits. It always begins the string, and there is always a space after it. I want to use regex to keep the desired pattern -- r'\d{1,4}(?=\s)' -- and remove everything else throughout the series. What is the correct code to achieve this? Thank you.
IICU
df['historical_rank_new']=df['historical_rank'].str.extract('(^[\d]{1,4})')
df