I am trying to write a function that, using tidyverse tools, will load two CSVs, join them on some key column, and write the joined result to a CSV. I am passing in the names of the key columns as character, and using double braces ({{ ... }}
) to make the column name usable in the inner_join
.
I have a function that works on the data frames. But I'm writing another function that takes filenames, loads the CSVs into data frames, and calls function 1. The "outer" function also assumes that the first column in the CSV is the key column, and passes that column name to the inner function.
The problem is that unless the column name is hard-coded, the code breaks. It appears that the outer function is passing in the argument unevaluated. If I read the column name into a variable, I get an error message like this:
Error in `inner_join()`:
! Join columns in `x` must be present in the data.
✖ Problem with `df1_key`.
Run `rlang::last_trace()` to see where the error occurred.
If I use colnames(df_loaded_from_csv)[1]
as the function argument, I get an error message like this:
Error in `join_by()`:
! Expressions can't contain computed columns, and can only reference columns by name or by explicitly specifying
a side, like `x$col` or `y$col`.
ℹ Expression 1 contains `colnames(df1)[1]`.
Run `rlang::last_trace()` to see where the error occurred.
In each case, it appears that the variable or the expression is not being evaluated, and is ending up in as an argument to the inner_join
in the inner function. I can't even begin to understand how this is even possible.
Here is a small reproducible example:
library(tidyverse)
# Create example CSVs
df1 = tibble(
key = LETTERS,
value = sample.int(100, 26)
)
write_csv(df1, "df1.csv")
df2 = tibble(
key = LETTERS[1:13],
value = sample.int(100, 13)
)
write_csv(df2, "df2.csv")
# Function to join two data frames and return the joined data frame
join_dfs = function(x, y, x_key = "key", y_key = "key") {
df = x |>
inner_join(y, by = join_by({{ x_key }} == {{ y_key }} ))
}
# This works:
df3 = join_dfs(df1, df2, "key", "key")
# Function to load to CSVs and call join_dfs on the loaded data frames
join_csvs = function(x, y, file) {
df1 = read_csv(x)
df1_key = colnames(df1)[1]
print(df1_key)
df2 = read_csv(y)
df2_key = colnames(df2)[1]
print(df2_key)
df3 = join_dfs(df1, df2, df1_key, df2_key)
write_csv(df3, file)
}
# This fails:
join_csvs("df1.csv", "df2.csv", "df3.csv")
How can I get this to work?
Try !!
(bang bang) instead. I'm not sure why {}
(curly curly) doesn't work using the second method.
join_dfs = function(x, y, x_key = "key", y_key = "key") {
df <- x |>
inner_join(y, by = join_by(!!x_key == !!y_key))
}
As an aside, it's recommended that a function returns something (usually placed on the last line but that's not strict) instead of assigning the final result to an object that will never be used (df
).
join_dfs = function(x, y, x_key = "key", y_key = "key") {
df <- x |>
inner_join(y, by = join_by(!!x_key == !!y_key))
df
}
Without the returning object on the last line, this command
join_dfs(df1, df2, "key", "key")
doesn't print anything. See this post for more details on that.
And I would simplify the function by just doing:
join_dfs = function(x, y, x_key = "key", y_key = "key") {
x |>
inner_join(y, by = join_by(!!x_key == !!y_key))
}