regexrpattern-matchingagrep

How do I match a column of a dataframe of a particular length with another vector which has certain key-words to match to?


My dataframe Expenses is as shown below :

date        name           expenditure      type
23MAR2013   KOSH ENTRP     4000             COMPANY
23MAR2013   JOHN DOE       800              INDIVIDUAL
24MAR2013   S KHAN         300              INDIVIDUAL
24MAR2013   JASINT PVT LTD 8000             COMPANY
25MAR2013   KOSH ENTRPRISE 2000             COMPANY
25MAR2013   JOHN S DOE     220              INDIVIDUAL
25MAR2013   S KHAN         300              INDIVIDUAL
26MAR2013   S KHAN         300              INDIVIDUAL

Earlier, I had identified the presence of repetitive names and patterns from the name column and stored it in a vector NameVector and it is as shown below.

KOSH    JOHN DOE    KHAN    JASINT

My question is, how do I match each and every string pattern of Expenses$name with the vector NameVector and print it in a categorical way in the main data-frame?

date        name           expenditure      type           category 
23MAR2013   KOSH ENTRP     4000             COMPANY        KOSH
23MAR2013   JOHN DOE       800              INDIVIDUAL     JOHN DOE
24MAR2013   S KHAN         300              INDIVIDUAL     KHAN          
24MAR2013   JASINT PVT LTD 8000             COMPANY        JASINT
25MAR2013   KOSH ENTRPRISE 2000             COMPANY        KOSH
25MAR2013   JOHN S DOE     220              INDIVIDUAL     JOHN DOE
25MAR2013   SALM KHAN      300              INDIVIDUAL     KHAN
26MAR2013   S KHAN         300              INDIVIDUAL     KHAN

I tried splitting the column name by every possible delimiter (spaces, |, *, commas etc) using strsplit() to get the different parts of the names into different columns and try matching the patterns using agrep() but I am not getting the desired output. Further introspection into the data, I have noticed that there were leading whitespaces and got rid of them, still no clue why I am not getting the output as show above.


The csv for the above table :

"Date","name","expenditure","type"
"23MAR2013","KOSH ENTRP",4000,"COMPANY"
"23MAR2013 ","JOHN DOE",800,"INDIVIDUAL"
"24MAR2013","S KHAN",300,"INDIVIDUAL"
"24MAR2013","JASINT PVT LTD",8000,"COMPANY"
"25MAR2013","KOSH ENTRPRISE",2000,"COMPANY"
"25MAR2013","JOHN S DOE",220,"INDIVIDUAL"
"25MAR2013","S KHAN",300,"INDIVIDUAL"
"26MAR2013","S KHAN",300,"INDIVIDUAL"

and the names vector that has been calculated/identifies as

NameVector <- c("KOSH","JOHN DOE","KHAN","JASINT")

Solution

  • You could try

    library(stringi)
    pat <- paste(unlist(strsplit(NameVector, ' ')), collapse="|")
    Expenses$category <- vapply(stri_extract_all_regex(Expenses$name, pat), 
               paste, collapse=' ', character(1L))
    Expenses
    #       date           name expenditure       type category
    #1 23MAR2013     KOSH ENTRP        4000    COMPANY     KOSH
    #2 23MAR2013       JOHN DOE         800 INDIVIDUAL JOHN DOE
    #3 24MAR2013         S KHAN         300 INDIVIDUAL     KHAN
    #4 24MAR2013 JASINT PVT LTD        8000    COMPANY   JASINT
    #5 25MAR2013 KOSH ENTRPRISE        2000    COMPANY     KOSH
    #6 25MAR2013     JOHN S DOE         220 INDIVIDUAL JOHN DOE
    #7 25MAR2013         S KHAN         300 INDIVIDUAL     KHAN
    #8 26MAR2013         S KHAN         300 INDIVIDUAL     KHAN