[SOLVED] How do I match a column of a dataframe of a particular length with another vector which has certain key-words to match to?

How do I match a column of a dataframe of a particular length with another vector which has certain key-words to match to?

My dataframe Expenses is as shown below :

date        name           expenditure      type
23MAR2013   KOSH ENTRP     4000             COMPANY
23MAR2013   JOHN DOE       800              INDIVIDUAL
24MAR2013   S KHAN         300              INDIVIDUAL
24MAR2013   JASINT PVT LTD 8000             COMPANY
25MAR2013   KOSH ENTRPRISE 2000             COMPANY
25MAR2013   JOHN S DOE     220              INDIVIDUAL
25MAR2013   S KHAN         300              INDIVIDUAL
26MAR2013   S KHAN         300              INDIVIDUAL

Earlier, I had identified the presence of repetitive names and patterns from the name column and stored it in a vector NameVector and it is as shown below.

KOSH    JOHN DOE    KHAN    JASINT

My question is, how do I match each and every string pattern of Expenses$name with the vector NameVector and print it in a categorical way in the main data-frame?

date        name           expenditure      type           category 
23MAR2013   KOSH ENTRP     4000             COMPANY        KOSH
23MAR2013   JOHN DOE       800              INDIVIDUAL     JOHN DOE
24MAR2013   S KHAN         300              INDIVIDUAL     KHAN          
24MAR2013   JASINT PVT LTD 8000             COMPANY        JASINT
25MAR2013   KOSH ENTRPRISE 2000             COMPANY        KOSH
25MAR2013   JOHN S DOE     220              INDIVIDUAL     JOHN DOE
25MAR2013   SALM KHAN      300              INDIVIDUAL     KHAN
26MAR2013   S KHAN         300              INDIVIDUAL     KHAN

I tried splitting the column name by every possible delimiter (spaces, |, *, commas etc) using strsplit() to get the different parts of the names into different columns and try matching the patterns using agrep() but I am not getting the desired output. Further introspection into the data, I have noticed that there were leading whitespaces and got rid of them, still no clue why I am not getting the output as show above.

The csv for the above table :

"Date","name","expenditure","type"
"23MAR2013","KOSH ENTRP",4000,"COMPANY"
"23MAR2013 ","JOHN DOE",800,"INDIVIDUAL"
"24MAR2013","S KHAN",300,"INDIVIDUAL"
"24MAR2013","JASINT PVT LTD",8000,"COMPANY"
"25MAR2013","KOSH ENTRPRISE",2000,"COMPANY"
"25MAR2013","JOHN S DOE",220,"INDIVIDUAL"
"25MAR2013","S KHAN",300,"INDIVIDUAL"
"26MAR2013","S KHAN",300,"INDIVIDUAL"

and the names vector that has been calculated/identifies as

NameVector <- c("KOSH","JOHN DOE","KHAN","JASINT")

Solution

You could try

library(stringi)
pat <- paste(unlist(strsplit(NameVector, ' ')), collapse="|")
Expenses$category <- vapply(stri_extract_all_regex(Expenses$name, pat), 
           paste, collapse=' ', character(1L))
Expenses
#       date           name expenditure       type category
#1 23MAR2013     KOSH ENTRP        4000    COMPANY     KOSH
#2 23MAR2013       JOHN DOE         800 INDIVIDUAL JOHN DOE
#3 24MAR2013         S KHAN         300 INDIVIDUAL     KHAN
#4 24MAR2013 JASINT PVT LTD        8000    COMPANY   JASINT
#5 25MAR2013 KOSH ENTRPRISE        2000    COMPANY     KOSH
#6 25MAR2013     JOHN S DOE         220 INDIVIDUAL JOHN DOE
#7 25MAR2013         S KHAN         300 INDIVIDUAL     KHAN
#8 26MAR2013         S KHAN         300 INDIVIDUAL     KHAN