pythonpandasdataframe

extract first sequence of numbers from a pandas column


I have imported a CSV into a pandas data frame; however, the column I need to use is freeform and in bad shape.

I need to extract the first series of numbers after the word NBU or the first series of numbers in the string. See some examples below:-

nbu 123456
NBU-123456
nbu/ 123456 blah12
123456
123456_123

All of the above should be cleaned to produce 123456. Note that the number of integers returned is based on how many are in a continual sequence; i.e., nbu12 3455 should only return 12.

I will then use something like this to fix the data:-

df['col'] = df['col'].str.

Solution

  • A possible solution, which works as follows: It uses str.extract method to extract digits from the col column of df, where in r'(\d+)' the parentheses denote a capturing group, with \d+ meaning one or more digits.

    df['col'].str.extract(r'(\d+)')
    

    Output:

            0
    0  123456
    1  123456
    2  123456
    3  123456
    4  123456