rdictionarydataframeleft-joinno-match

R Data frame column update


I have a dataframe with names I have a second dataframe with a dictionary of names and the sex of those names I want to check if the name is in the dictionary if it is then add the sex from the dictionary table to the names dataframe

My code looks like below

# Sets everything to -1, 1 for male, 0 for female
train$sex <- "-1"
train$sex[toupper(train$fname) == nam_dict$Name]<-nam_dict$Sex

I am getting the following error

Error in train$sex[toupper(train$fname) == nam_dict$Name] <- nam_dict$Sex : 
NAs are not allowed in subscripted assignments
In addition: Warning message:
In toupper(train$fname) == nam_dict$Name :
longer object length is not a multiple of shorter object length

I have a work around - i think - where i can split the dictionary into male and female and simply replace the <-nam_dict$Sex portion of the code with the character 'F' or 'M' depending on the dictionary

I just thought there would be a better way


Solution

  • So your problem can be solved by a quick and simple implementation of match. First, here's a quick reproducible example

    (train <- data.frame(fname = c("Alex", "Jennifer", "David", "Alice")))
    #      fname
    # 1     Alex
    # 2 Jennifer
    # 3    David
    # 4    Alice
    (nam_dict <- data.frame(Name = c("alice", "alex"), Sex = 0:1))
    #    Name Sex
    # 1 alice   0
    # 2  alex   1
    

    A possible solution

    train$sex <- nam_dict$Sex[match(tolower(train$fname), tolower(nam_dict$Name))]
    train
    #      fname sex
    # 1     Alex   1
    # 2 Jennifer  NA
    # 3    David  NA
    # 4    Alice   0
    

    Al bit more advanced solution (if you''ll ever work with huge data sets) is to try data.tables binary join which allows you to update your data by reference, chose which columns to update while everything is done in a lightning speed.

    First we will convert both data sets to data.table class and make both columns lower case, then we will key them by the column we want join by, finally we do a left join to train while creating the sex column by reference and pulling the data from i (Sex column in nam_dict)

    library(data.table)
    setDT(train)[, fname := tolower(fname)]
    setDT(nam_dict)[, Name := tolower(Name)]
    setkey(train, fname) ; setkey(nam_dict, Name)
    train[nam_dict, sex := i.Sex]
    train
    #       fname sex
    # 1:     alex   1
    # 2:    alice   0
    # 3:    david  NA
    # 4: jennifer  NA