I have some left-censored data for CRP (C-reactive protein) and I am wondering how I could impute the values that are below the detection limit in such a way that the imputed values would be inside a desired range (here: 0<imputed_value<0.2).
I'm trying to achieve this with the package 'imputeLCMD', but as the installation of 'imputeLCMD' with all its dependencies is slightly complex, I'm willing to hear of other approaches, too.
Please consider the following MWE:
# Load libraries
library(dplyr)
library(imputeLCMD)
# Assign the dputted random data to a data frame
df <- structure(list(participant_id = 1:10, CRP = c("2.9", "<0.2",
"<0.2", "8.8", "9.4", "0.5", "5.3", "8.9", "5.5", "<0.2"), LDL_cholesterol = c(195.7,
145.3, 167.8, 157.3, 110.3, 190, 124.6, 104.2, 132.8, 195.5),
fasting_glucose = c(114.5, 104.6, 102, 119.7, 102.8, 105.4,
97.2, 99.7, 84.5, 77.4), creatinine = c(1.5, 1.4, 1.2, 1.3,
0.5, 1, 1.3, 0.7, 0.8, 0.7)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
The simulated data frame dputted above and showcased here below is similar to my real data.
# View the random data
head(df, n=5)
#> # A tibble: 5 × 5
#> participant_id CRP LDL_cholesterol fasting_glucose creatinine
#> <int> <chr> <dbl> <dbl> <dbl>
#> 1 1 2.9 196. 114. 1.5
#> 2 2 <0.2 145. 105. 1.4
#> 3 3 <0.2 168. 102 1.2
#> 4 4 8.8 157. 120. 1.3
#> 5 5 9.4 110. 103. 0.5
However, I'm a bit lost on how to continue from here on. In order to impute the missing values for these left-censored data with the package imputeLCMD, I assume that the left-cencored values have to be converted to NAs first:
df <- df %>%
mutate(CRP = na_if(CRP, "<0.2")) %>%
mutate(CRP = as.numeric(CRP))
head(df, n=5)
#> # A tibble: 5 × 5
#> participant_id CRP LDL_cholesterol fasting_glucose creatinine
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2.9 196. 114. 1.5
#> 2 2 NA 145. 105. 1.4
#> 3 3 NA 168. 102 1.2
#> 4 4 8.8 157. 120. 1.3
#> 5 5 9.4 110. 103. 0.5
If I now run one the wrappers from the package imputeLCMD, I do get some kind of a result:
# Impute the missing data
df_imputed <- impute.wrapper.SVD(df, K = 4) %>% as.data.frame()
# Round the result
df_imputed <- df_imputed %>% mutate_at(vars(CRP), ~round(., digits = 1))
# Place the original CRP next to the imputed one for comparison
df_imputed <- df_imputed %>% mutate(original_CRP = df$CRP)
df_imputed <- df_imputed %>% select(1,2,6,3,4,5)
# Display the result
head(df_imputed, n=5)
#> participant_id CRP original_CRP LDL_cholesterol fasting_glucose creatinine
#> 1 1 2.9 2.9 195.7 114.5 1.5
#> 2 2 5.8 NA 145.3 104.6 1.4
#> 3 3 5.9 NA 167.8 102.0 1.2
#> 4 4 8.8 8.8 157.3 119.7 1.3
#> 5 5 9.4 9.4 110.3 102.8 0.5
Created on 2023-05-27 with reprex v2.0.2
My issues:
I cannot figure how to set an argument for the imputeLCMD package so that the imputed values should be: >0 AND <0.2.
How to make sure that imputeLCMD does not take the participant_id itself as numeric data into the impute calculations?
I've seen some great alternative approaches for left-censored data in SO, but it would be nice to know what I'm doing wrong (or right) with 'imputeLCMD'.
You could use maximum likelihood estimation to estimate the distribution of your data while using the information that some observations are below a lower limit of quantification (LLOQ). To get plausible imputed values, you then sample from the < LLOQ region of the estimated distribution.
Let’s demonstrate with some simulated log-normally distributed data:
set.seed(42)
# Simulate some underlying log-normal CRP values
# Parameters from: https://stackoverflow.com/a/63938717/4550695
crp <- rlnorm(10000, 1.355, 1.45)
summary(crp)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.011 1.418 3.842 11.158 10.139 2060.559
# Apply a lower limit of quantification to observations
lloq <- 3
crp_obs <- replace(crp, crp < lloq, NA)
summary(crp_obs)
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> 3.002 5.019 8.738 18.669 18.107 2060.559 4331
For the estimation step, we need to define an objective function; the log-likelihood of a log-normal distribution with left-censored values:
# Any missing values in x are assumed to be known to be < LLOQ
left_censored_log_normal_log_likelihood <- function(mu, sigma, x, lloq) {
sum(dlnorm(na.omit(x), mu, sigma, log = TRUE)) +
sum(is.na(x)) * plnorm(lloq, mu, sigma, log = TRUE)
}
Then we maximize the log-likelihood, given the observed data:
mean_sd <- function(x, ...) {
c(mean(x, ...), sd(x, ...))
}
# Initial values from observed data
theta0 <- mean_sd(log(crp_obs), na.rm = TRUE)
theta0
#> [1] 2.350642 0.924159
fit <- optim(theta0, function(theta) {
-left_censored_log_normal_log_likelihood(theta[1], theta[2], crp_obs, lloq)
})
We can already see we recovered the underlying parameters:
str(fit)
#> List of 5
#> $ par : num [1:2] 1.34 1.45
#> $ value : num 26786
#> $ counts : Named int [1:2] 69 NA
#> ..- attr(*, "names")= chr [1:2] "function" "gradient"
#> $ convergence: int 0
#> $ message : NULL
Finally, sample from the < LLOQ region of the fitted distribution and create a completed dataset:
n <- sum(is.na(crp_obs))
p <- runif(n, 0, plnorm(lloq, fit$par[1], fit$par[2]))
y <- qlnorm(p, fit$par[1], fit$par[2])
summary(y)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.00557 0.62488 1.20758 1.32565 1.97396 2.99494
crp_imp <- replace(crp_obs, which(is.na(crp_obs)), y)
We’ve successfully recovered the underlying distribution:
mean_sd(log(crp_imp))
#> [1] 1.337859 1.461132
summary(crp_imp)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.0056 1.4200 3.8418 11.1576 10.1394 2060.5585
summary(crp)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.011 1.418 3.842 11.158 10.139 2060.559
I’ve now wrapped this up into an experimental censlm package:
library(censlm) # remotes::install_github("mikmart/censlm")
#> Loading required package: survival
set.seed(42)
# Simulate left-censored log-normal CRP observations
crp <- rlnorm(10000, 1.355, 1.450)
lloq <- 3
obs <- replace(crp, crp < lloq, lloq)
crpdf <- data.frame(crp, lloq, obs)
# Fit censored linear model and extract (random) imputations
fit <- clm(log(obs) ~ 1, left = log(lloq))
imp <- exp(imputed(fit))
summary(cbind(crpdf, imp))
#> crp lloq obs imp
#> Min. : 0.011 Min. :3 Min. : 3.000 Min. : 0.0056
#> 1st Qu.: 1.418 1st Qu.:3 1st Qu.: 3.000 1st Qu.: 1.4200
#> Median : 3.842 Median :3 Median : 3.842 Median : 3.8418
#> Mean : 11.158 Mean :3 Mean : 11.883 Mean : 11.1576
#> 3rd Qu.: 10.139 3rd Qu.:3 3rd Qu.: 10.139 3rd Qu.: 10.1394
#> Max. :2060.559 Max. :3 Max. :2060.559 Max. :2060.5585
The MLE model fitting is done with survival::survreg()
:
summary(fit)
#>
#> Call:
#> survreg(formula = Surv(log(obs), log(obs) > log(lloq), type = "left") ~
#> 1, dist = "gaussian")
#> Value Std. Error z p
#> (Intercept) 1.3421 0.0168 79.8 <2e-16
#> Log(scale) 0.3749 0.0103 36.3 <2e-16
#>
#> Scale= 1.45
#>
#> Gaussian distribution
#> Loglik(model)= -13460.2 Loglik(intercept only)= -13460.2
#> Number of Newton-Raphson Iterations: 4
#> n= 10000