pythonpandasstringsplit

Extract first digit sequence from string containing digits, non-digits and then digits


I have a column in a Pandas dataframe that contains values as follows:

111042345--
111042345
110374217dclid=CA-R3K
109202817lciz@MM10082IA

I need to extract just the first sequence of digits in each row - not all of the digits in the row. So the output would be like this:

111042345
111042345 
110374217 
109202817

I thought the best way to achieve that would be to split the strings by digits and return that but that would give me the unwanted digits after the non-digit characters.


Solution

  • Use str.extract with regex \d for extract digits, {,5} means up to 5 digits and + is for all digits:

    df['first_5_digits'] = df['Col'].str.extract('(\d{,5})')
    df['all_digits'] = df['Col'].str.extract('(\d+)')
    print (df)
                           Col first_5_digits all_digits
    0              111042345--          11104  111042345
    1                111042345          11104  111042345
    2    110374217dclid=CA-R3K          11037  110374217
    3  109202817lciz@MM10082IA          10920  109202817
    

    Like @ Jon Clements pointed is also possible extract N values by indexing:

    df['first_5_digits'] = df['Col'].str.extract('(\d+)').str[:5]