regexpandasdataframefor-loopiteritems

Any ideas on Iterating over dataframe and applying regex?


This may be a rudimentary problem but I am new to pandas.

I have a csv dataframe and I want to iterate over each row to extract all the string information in a specific column through regex. . (The reason why I am using regex is because eventually I want to make a separate dataframe of that column)

I tried iterating through for loop but I got ton of errors. So far, It looks like for loop reads each input row as a list or series rather than a string (correct me if i'm wrong). My main functions are iteritems() and findall() but no good results so far. How can I approach this problem?

My dataframe looks like this:

df =pd.read_csv('foobar.csv')
df[['column1','column2, 'TEXT']]

My approach looks like this:

for Individual_row in df['TEXT'].iteritems():
   parsed = re.findall('(.*?)\:\s*?\[(.*?)\], Individual_row)
   res = {g[0].strip() : g[1].strip() for g in parsed}

Many thanks in advance


Solution

  • you can try the following instead of loop:

    df['new_TEXT'] = df['TEXT'].apply(lambda x: [g[0].strip(), g[1].strip()] for g in re.findall('(.*?)\:\s*?\[(.*?)\]', x), na_action='ignore' )
    

    This will create a new column with your resultant data.