rstatisticsmissing-dataimputation

In R, how to impute left-censored missing data to be within a desired range (e.g. 0<imputed_value<0.2)


I have some left-censored data for CRP (C-reactive protein) and I am wondering how I could impute the values that are below the detection limit in such a way that the imputed values would be inside a desired range (here: 0<imputed_value<0.2).

I'm trying to achieve this with the package 'imputeLCMD', but as the installation of 'imputeLCMD' with all its dependencies is slightly complex, I'm willing to hear of other approaches, too.

Please consider the following MWE:

# Load libraries
library(dplyr)
library(imputeLCMD)

# Assign the dputted random data to a data frame
df <- structure(list(participant_id = 1:10, CRP = c("2.9", "<0.2", 
                                                    "<0.2", "8.8", "9.4", "0.5", "5.3", "8.9", "5.5", "<0.2"), LDL_cholesterol = c(195.7, 
                                                                                                                                   145.3, 167.8, 157.3, 110.3, 190, 124.6, 104.2, 132.8, 195.5), 
                     fasting_glucose = c(114.5, 104.6, 102, 119.7, 102.8, 105.4, 
                                         97.2, 99.7, 84.5, 77.4), creatinine = c(1.5, 1.4, 1.2, 1.3, 
                                                                                 0.5, 1, 1.3, 0.7, 0.8, 0.7)), row.names = c(NA, -10L), class = c("tbl_df", 
                                                                                                                                                  "tbl", "data.frame"))

The simulated data frame dputted above and showcased here below is similar to my real data.

# View the random data
head(df, n=5)
#> # A tibble: 5 × 5
#>   participant_id CRP   LDL_cholesterol fasting_glucose creatinine
#>            <int> <chr>           <dbl>           <dbl>      <dbl>
#> 1              1 2.9              196.            114.        1.5
#> 2              2 <0.2             145.            105.        1.4
#> 3              3 <0.2             168.            102         1.2
#> 4              4 8.8              157.            120.        1.3
#> 5              5 9.4              110.            103.        0.5

However, I'm a bit lost on how to continue from here on. In order to impute the missing values for these left-censored data with the package imputeLCMD, I assume that the left-cencored values have to be converted to NAs first:

df <- df %>%
  mutate(CRP = na_if(CRP, "<0.2")) %>%
  mutate(CRP = as.numeric(CRP))

head(df, n=5)
#> # A tibble: 5 × 5
#>   participant_id   CRP LDL_cholesterol fasting_glucose creatinine
#>            <int> <dbl>           <dbl>           <dbl>      <dbl>
#> 1              1   2.9            196.            114.        1.5
#> 2              2  NA              145.            105.        1.4
#> 3              3  NA              168.            102         1.2
#> 4              4   8.8            157.            120.        1.3
#> 5              5   9.4            110.            103.        0.5

If I now run one the wrappers from the package imputeLCMD, I do get some kind of a result:

# Impute the missing data
df_imputed <- impute.wrapper.SVD(df, K = 4) %>% as.data.frame()

# Round the result
df_imputed <- df_imputed %>% mutate_at(vars(CRP), ~round(., digits = 1))

# Place the original CRP next to the imputed one for comparison
df_imputed <- df_imputed %>% mutate(original_CRP = df$CRP)
df_imputed <- df_imputed %>% select(1,2,6,3,4,5)

# Display the result
head(df_imputed, n=5)
#>   participant_id CRP original_CRP LDL_cholesterol fasting_glucose creatinine
#> 1              1 2.9          2.9           195.7           114.5        1.5
#> 2              2 5.8           NA           145.3           104.6        1.4
#> 3              3 5.9           NA           167.8           102.0        1.2
#> 4              4 8.8          8.8           157.3           119.7        1.3
#> 5              5 9.4          9.4           110.3           102.8        0.5

Created on 2023-05-27 with reprex v2.0.2

My issues:

  1. I cannot figure how to set an argument for the imputeLCMD package so that the imputed values should be: >0 AND <0.2.

  2. How to make sure that imputeLCMD does not take the participant_id itself as numeric data into the impute calculations?

I've seen some great alternative approaches for left-censored data in SO, but it would be nice to know what I'm doing wrong (or right) with 'imputeLCMD'.


Solution

  • You could use maximum likelihood estimation to estimate the distribution of your data while using the information that some observations are below a lower limit of quantification (LLOQ). To get plausible imputed values, you then sample from the < LLOQ region of the estimated distribution.

    Let’s demonstrate with some simulated log-normally distributed data:

    set.seed(42)
    
    # Simulate some underlying log-normal CRP values
    # Parameters from: https://stackoverflow.com/a/63938717/4550695
    crp <- rlnorm(10000, 1.355, 1.45)
    summary(crp)
    #>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
    #>    0.011    1.418    3.842   11.158   10.139 2060.559
    
    # Apply a lower limit of quantification to observations
    lloq <- 3
    crp_obs <- replace(crp, crp < lloq, NA)
    summary(crp_obs)
    #>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
    #>    3.002    5.019    8.738   18.669   18.107 2060.559     4331
    

    For the estimation step, we need to define an objective function; the log-likelihood of a log-normal distribution with left-censored values:

    # Any missing values in x are assumed to be known to be < LLOQ
    left_censored_log_normal_log_likelihood <- function(mu, sigma, x, lloq) {
      sum(dlnorm(na.omit(x), mu, sigma, log = TRUE)) +
        sum(is.na(x)) * plnorm(lloq, mu, sigma, log = TRUE)
    }
    

    Then we maximize the log-likelihood, given the observed data:

    mean_sd <- function(x, ...) {
      c(mean(x, ...), sd(x, ...))
    }
    
    # Initial values from observed data
    theta0 <- mean_sd(log(crp_obs), na.rm = TRUE)
    theta0
    #> [1] 2.350642 0.924159
    
    fit <- optim(theta0, function(theta) {
      -left_censored_log_normal_log_likelihood(theta[1], theta[2], crp_obs, lloq)
    })
    

    We can already see we recovered the underlying parameters:

    str(fit)
    #> List of 5
    #>  $ par        : num [1:2] 1.34 1.45
    #>  $ value      : num 26786
    #>  $ counts     : Named int [1:2] 69 NA
    #>   ..- attr(*, "names")= chr [1:2] "function" "gradient"
    #>  $ convergence: int 0
    #>  $ message    : NULL
    

    Finally, sample from the < LLOQ region of the fitted distribution and create a completed dataset:

    n <- sum(is.na(crp_obs))
    p <- runif(n, 0, plnorm(lloq, fit$par[1], fit$par[2]))
    y <- qlnorm(p, fit$par[1], fit$par[2])
    summary(y)
    #>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    #> 0.00557 0.62488 1.20758 1.32565 1.97396 2.99494
    
    crp_imp <- replace(crp_obs, which(is.na(crp_obs)), y)
    

    We’ve successfully recovered the underlying distribution:

    mean_sd(log(crp_imp))
    #> [1] 1.337859 1.461132
    
    summary(crp_imp)
    #>      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
    #>    0.0056    1.4200    3.8418   11.1576   10.1394 2060.5585
    summary(crp)
    #>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
    #>    0.011    1.418    3.842   11.158   10.139 2060.559
    

    I’ve now wrapped this up into an experimental censlm package:

    library(censlm) # remotes::install_github("mikmart/censlm")
    #> Loading required package: survival
    set.seed(42)
    
    # Simulate left-censored log-normal CRP observations
    crp <- rlnorm(10000, 1.355, 1.450)
    lloq <- 3
    obs <- replace(crp, crp < lloq, lloq)
    crpdf <- data.frame(crp, lloq, obs)
    
    # Fit censored linear model and extract (random) imputations
    fit <- clm(log(obs) ~ 1, left = log(lloq))
    imp <- exp(imputed(fit))
    
    summary(cbind(crpdf, imp))
    #>       crp                lloq        obs                imp           
    #>  Min.   :   0.011   Min.   :3   Min.   :   3.000   Min.   :   0.0056  
    #>  1st Qu.:   1.418   1st Qu.:3   1st Qu.:   3.000   1st Qu.:   1.4200  
    #>  Median :   3.842   Median :3   Median :   3.842   Median :   3.8418  
    #>  Mean   :  11.158   Mean   :3   Mean   :  11.883   Mean   :  11.1576  
    #>  3rd Qu.:  10.139   3rd Qu.:3   3rd Qu.:  10.139   3rd Qu.:  10.1394  
    #>  Max.   :2060.559   Max.   :3   Max.   :2060.559   Max.   :2060.5585
    

    The MLE model fitting is done with survival::survreg():

    summary(fit)
    #> 
    #> Call:
    #> survreg(formula = Surv(log(obs), log(obs) > log(lloq), type = "left") ~ 
    #>     1, dist = "gaussian")
    #>              Value Std. Error    z      p
    #> (Intercept) 1.3421     0.0168 79.8 <2e-16
    #> Log(scale)  0.3749     0.0103 36.3 <2e-16
    #> 
    #> Scale= 1.45 
    #> 
    #> Gaussian distribution
    #> Loglik(model)= -13460.2   Loglik(intercept only)= -13460.2
    #> Number of Newton-Raphson Iterations: 4 
    #> n= 10000