I have written a function that creates a new data.frame with character vectors. I would like to convert those character vectors to factors using a the original data.frame object that has the same variable names and labels.
I have written a for
loop that works but I want to use apply
since it is faster. The reason I need to do this is because the user specifies which variables are used in the creation of this new data.frame and how many there are.
Here is the for
loop:
for x in seq_along(group_labs) {
new[group_labs[x]] <- factor(
new[group_labs[x]],
levels = levels(old[group_labs[x]]
)
}
Here is some fake data to show that this works
# create fake data
new <- data.frame(
var1 = c("a", "c", "b", "c", "b", "a"),
var2 = c("z", "y", "y", "z", "x", "y"),
var3 = c(1, 3, 4, 2, 1, 4),
stringsAsFactors = FALSE
)
# recreate the fake data as the old data
old <- new
# make var1 and var2 in the old one factors manually
old$var1 <- factor(new$var1, levels = c("a", "b", "c"))
old$var2 <- factor(new$var2, levels = c("x", "y", "z"))
# set the reference variables
group_labs = c("var1", "var2")
# run for loop that automatically converts the variables to factors
for (x in seq_along(group_labs)) {
new[group_labs[x]] <- factor(
new[group_labs[x]],
levels = levels(old[group_labs[x]])
)
}
# check it worked
class(new$var1)
# check it worked
class(new$var2)
Is there a way to use apply
instead of a for
loop?
Thanks in advance!
The lapply
approach in your answer can be further improved. First, better put names and levels together in a named list. Next loop over the names
rather than the elements themselves. Include a check with warning
, to prevent columns from being added unnoticed, only if
name exists as column in new
, create factor
with levels stored in the list element.
> factorizer <- \(df, lst) {
+ lapply(names(lst) |> setNames(nm=_), \(x) {
+ if (!x %in% names(df)) {
+ warning(sprintf('%s not found', sQuote(x)), call.=FALSE)
+ NULL
+ } else {
+ factor(df[[x]], lst[[x]])
+ }
+ })
+ }
>
> factorizer(new, fct_data)
$var1
[1] a c b c b a
Levels: c b a
$var2
[1] z y y z x y
Levels: x y z
$foo
NULL
Warning message:
‘foo’ not found
foo=1:3
—representing some sort of typo or otherwise non existing column—does not affect the result. Although the function returns an element “foo”
, it is not assigned to the original data because it is NULL
.
> fct_data <- list(var1=c("c", "b", "a"), var2=c('x', 'y', 'z'), foo=1:3)
>
> new[names(fct_data)] <- factorizer(new, fct_data)
Warning message:
‘foo’ not found
where
> str(new)
'data.frame': 6 obs. of 3 variables:
$ var1: Factor w/ 3 levels "c","b","a": 3 1 2 1 2 3
$ var2: Factor w/ 3 levels "x","y","z": 3 2 2 3 1 2
$ var3: num 1 3 4 2 1 4
Data:
> dput(new)
structure(list(var1 = c("a", "c", "b", "c", "b", "a"), var2 = c("z",
"y", "y", "z", "x", "y"), var3 = c(1, 3, 4, 2, 1, 4)), class = "data.frame", row.names = c(NA,
-6L))