python-3.xpandaspython-regex

Extract substrings from a column of strings and place them in a list


I have the following data frame:

   a    b             x  
0  id1  abc 123 tr    2  
1  id2  abd1 124 tr   6 
2  id3  abce 126 af   9 
3  id4  abe 128 nm    12 

From column b, for each item, I need to extract the substrings before the first space. Hence, I need the following result:

list_of_strings = [abc, abd1, abce, abe]

Please advise


Solution

  • Use a regex with ^\S+ (non-space characters anchored to the start of string) and str.extract:

    df['b'].str.extract(r'^(\S+)', expand=False)
    

    Output:

    0     abc
    1    abd1
    2    abce
    3     abe
    Name: b, dtype: object
    

    For a list:

    list_of_strings = df['b'].str.extract(r'^(\S+)', expand=False).tolist()
    # ['abc', 'abd1', 'abce', 'abe']
    

    regex demo