There is a strange behavior of lobstr::obj_addr caused by its vectorization over lists, when the list itself doesn't change the address.
I just started Advanced R by Wickham (2ed) and reached the 2.2.2 Exercises first exercise. I supposed that, given:
a <- 1:10; b <- a; c <- b
all of them would have the same memory address as retrieved by the lobstr::obj_addr
function. That is true if we just use a, b or c as inputs, but as I am lazy and wanted to have all the values at once, I did:
list(a, b, c) |> lapply(obj_addr) # lapply or sapply
Then we obtain a different set of values among the different names every time the function is run. That still happens if we set x <- list(a, b, c)
before calling the function through lapply
, and obj_addr(x[[1]]) == obj_addr(x[[2]]) == obj_addr(x[[3]]) == obj_addr(a)
, so it's not a matter of creating a new list every time.
Does someone know what is going on here? I understand that to a certain point each call generates a new output object that will have its own memory address, but I don't know how lapply
can interfere with a constant function for a given object like obj_addr
.
lobstr
identifies the environmentThis error arises from the way lobstr
uses rlang
quosure tools to access the object without increasing the reference count. The purpose of this is to allow garbage collection to happen properly later, by ensuring there are no references to the object hanging around. However, in the case of lapply(x, lobstr::obj_addr)
, it does not correctly access the environment of the elements of x
.
lobstr::obj_addr()
do?The (slightly simplified) source is as follows:
obj_addr <- function(x) {
x <- enquo(x)
obj_addr_(quo_get_expr(x), quo_get_env(x))
}
obj_addr_()
is the C function that gets the memory address. However, first obj_addr()
defuses the expression, so it can refer to it without increasing the reference count.
lapply()
?Consider the following function to get the environment of an object that a function is called from:
f <- function(x) {
rlang::quo_get_env(rlang::enquo(x))
}
We can call this from lapply()
in both ways:
x <- list(a = 1, b = 2, c = 3)
lapply(x, \(y) f(y)) # anonymous function
# $a
# <environment: 0x557c265eb438>
# $b
# <environment: 0x557c265ea910>
# $c
# <environment: 0x557c26e275c8>
lapply(x, f) # function provided directly
# $a
# <environment: R_EmptyEnv>
# $b
# <environment: R_EmptyEnv>
# $c
# <environment: R_EmptyEnv>
The first set of results make sense. lapply()
creates a temporary environment each time it calls the anonymous function (the function closure).
However, it does not make sense that lapply(x, f)
is running in the empty environment. We know we can refer to objects in the global environment with lapply()
. But the empty environment by definition contains no objects and has no parent:
f_parent <- function(x) {
e <- rlang::quo_get_env(enquo(x))
message("Objects in environment: ", length(ls(e)))
message("Parent environment: ", parent.env(e))
}
lapply(x, f_parent)
# Objects in environment: 0
# Error in parent.env(e) : the empty environment has no parent
So rlang::quo_get_env(rlang::enquo(x))
clearly returns the wrong environment. Let's try finding the parent environment of the function called by lapply()
using base R:
f2 <- function(x) {
parent.env(environment())
}
lapply(x, f2)
# $a
# <environment: R_GlobalEnv>
# $b
# <environment: R_GlobalEnv>
# $c
# <environment: R_GlobalEnv>
This makes more sense and gives us a clue as to what is going on.
To rule out lapply()
as the source of this inconsistency, let's write out own version of lobstr::obj_addr()
that doesn't mess around with environments. The relevant line of the C-level obj_addr_()
function is where it casts the SEXP
to a pointer:
static_cast<void *>(x);
Here is a similar function to get the pointer which skips the rlang
stuff:
get_pointer <- inline::cfunction(
sig = c(x = "integer"),
body = '
// cast SEXP to a void pointer like lobstr
void* ptr = (void*) x;
// put the pointer in a character array
char addr_chr[32];
snprintf(addr_chr, sizeof(addr_chr), "%p", ptr);
// put address in character vector and return it
SEXP addr = PROTECT(allocVector(STRSXP, 1));
SET_STRING_ELT(addr, 0, mkChar(addr_chr));
UNPROTECT(1);
return addr;',
includes = "#include <stdio.h>"
)
get_pointer()
to lobstr::obj_addr()
Let's define x
and check the addresses individually:
x <- list(a = 1, b = 2, c = 3)
lobstr::obj_addr(x[[1]]) # [1] "0x557c22e8eb28"
lobstr::obj_addr(x[[2]]) # [1] "0x557c22e8eb60"
lobstr::obj_addr(x[[3]]) # [1] "0x557c22e8eb98"
We can now compare the results using lobstr
in three ways. I'll use sapply()
instead of lapply()
as it prints more nicely. We can see that sapply(x, lobstr::obj_addr)
is not correct.
lobstr::obj_addrs(x) # correct
# [1] "0x557c22e8eb28" "0x557c22e8eb60" "0x557c22e8eb98"
sapply(x, \(y) lobstr::obj_addr(y)) # correct
# a b c
# "0x557c22e8eb28" "0x557c22e8eb60" "0x557c22e8eb98"
sapply(x, lobstr::obj_addr) # incorrect
# a b c
# "0x557c24fdd7a0" "0x557c24fdd8f0" "0x557c24fdda78"
The question is whether we can get the correct results if we skip the environment stuff. This is where we can use get_pointer()
:
sapply(x, \(y) get_pointer(y)) # correct
# a b c
# "0x557c22e8eb28" "0x557c22e8eb60" "0x557c22e8eb98"
sapply(x, get_pointer) # correct
# a b c
# "0x557c22e8eb28" "0x557c22e8eb60" "0x557c22e8eb98"
So get_pointer()
gets the correct results both times. This indicates that the issue is with lobstr
's use of rlang quosure tools. I am not actually sure whether this is an rlang
issue, or whether the problem is how lobstr
is using rlang
. However, as both packages are part of r-lib, I imagine that a bug report filed to either would find its way to the right place pretty quickly. However, it's not clear to me how this issue could be resolved while also not increasing the reference count of objects when they are inspected.