I am cleaning demographic data that has been submitted by 10+ school districts and the submissions are not standardized/uniform. I would like to find patterns and recode them so that the data is clean and simple.
Let's say I have a variable called Race
, and one of the categories is Native Hawaiian - Pacific Islander
.
School A submits this category as Native Hawaiian or Other Pacific Islander
. School B submits this category as Native Hawaiian/Pacific Islander
. School C submits this category as Native Hawaiian or Pacific Islander
.
How could I recode this such that if R sees the word Pacific
anywhere in the variable, it will recode to Native Hawaiian - Pacific Islander
?
Here is the original data:
df_original <- data.frame(Race=c("Native Hawaiian or Other Pacific Islander",
"Native Hawaiian/Pacific Islander", "Native Hawaiian or Pacific Islander",
"Black or African American", "Black", "Black/African American"))
Here is the ideal cleaned data:
df_desired <- data.frame(Race=c("Native Hawaiian - Pacific Islander","Native Hawaiian - Pacific Islander",
"Native Hawaiian - Pacific Islander","Black - African American",
"Black - African American","Black - African American"))
grepl()
will return TRUE
for strings that contain "Pacific" and False
otherwise. Use that to subset your vector and replace with the string you want:
df_original$Race[grepl("Pacific", df_original$Race)] <- "Native Hawaiian - Pacific Islander"