loopsreplacestata

How should my loop replace values properly?


I am working on a scientific medical paper using the HADS-Score for assessing patients' anxiety and depression. This score consists of 14 items divided in two subscales (HADS-D, HADS-A) of 7 items with possible values from 0 to 3 points. I have missing data and want to replace them. According to the Score manual I have to drop the observation if I have more than one missing item in one subscale. If only one item is missing per subscale, I can replace the missing item by the mean of the present six items. i have the HADS-Score items per observation stored in following variables:

I broke the code down to following steps:

  1. Initialize the subscale scores: creating variables for the subscales HADS-D and HADS-A.

  2. Identify missing values. I created a new variable is_missing_ to identify if it is missing.

  3. Count missing items using egen with rowtotal to count the number of missing items in each subscale.

  4. Drop observations: I dropped any observation where more than one item was missing in either subscale.

  5. Replacing the missing items for each subscale. If an item is missing, it is replaced with the mean of the other six items in the subscale.

  6. Calculate total scores: Sum the scores for each subscale to get the final scores.

PROBLEM: Somehow my code does not replace the missing items in each subscale with the loop I created in Step 5. and leaves missing data (== .)

*STEP 1: Initialize the HADS-A and HADS-D subscales
gen hads_anx_score = .
gen hads_depr_score = .

* STEP 2:Loop over each observation
foreach var in hads_tense_rec hads_glad_rec hads_omen_rec hads_laugh_rec hads_trouble_rec hads_happy_rec hads_relax_rec hads_limited_rec hads_scary_rec hads_looks_rec hads_restless_rec hads_future_rec hads_panic_rec hads_enjoy_rec {
    gen is_missing_`var' = missing(`var')
}

* STEP 3: Calculate the number of missing items per subscale
egen missing_hads_anx = rowtotal(is_missing_hads_tense_rec is_missing_hads_glad_rec is_missing_hads_omen_rec is_missing_hads_laugh_rec is_missing_hads_trouble_rec is_missing_hads_happy_rec is_missing_hads_relax_rec)

egen missing_hads_depr = rowtotal(is_missing_hads_limited_rec is_missing_hads_scary_rec is_missing_hads_looks_rec is_missing_hads_restless_rec is_missing_hads_future_rec is_missing_hads_panic_rec is_missing_hads_enjoy_rec)

* STEP 4. Drop observations with more than one missing item in any subscale
drop if missing_hads_anx > 1 | missing_hads_depr > 1

**STEP 5.** Replace single missing items with the mean of the present six items
foreach var in hads_tense_rec hads_glad_rec hads_omen_rec hads_laugh_rec hads_trouble_rec hads_happy_rec hads_relax_rec {
    qui replace `var' = (hads_tense_rec + hads_glad_rec + hads_omen_rec + hads_laugh_rec + hads_trouble_rec + hads_happy_rec + hads_relax_rec - `var') / 6 if is_missing_`var' == 1 & missing_hads_anx == 1
}

foreach var in hads_limited_rec hads_scary_rec hads_looks_rec hads_restless_rec hads_future_rec hads_panic_rec hads_enjoy_rec {
    qui replace `var' = (hads_limited_rec + hads_scary_rec + hads_looks_rec + hads_restless_rec + hads_future_rec + hads_panic_rec + hads_enjoy_rec - `var') / 6 if is_missing_`var' == 1 & missing_hads_depr == 1
}

Now, if I run the **fifth step**, there are still missing data (for example hads_limited_rec == . ).


Solution

  • A data example would help mightily. However, it seems possible to identify your bug. On the way, I will suggest simplifications to your code.

    *STEP 1: Initialize the HADS-A and HADS-D subscales
    gen hads_anx_score = .
    gen hads_depr_score = .
    

    Step 1 seems to have no point. You never use or change these variables.

    * STEP 2:Loop over each observation
    foreach var in hads_tense_rec hads_glad_rec hads_omen_rec hads_laugh_rec hads_trouble_rec hads_happy_rec hads_relax_rec hads_limited_rec hads_scary_rec hads_looks_rec hads_restless_rec hads_future_rec hads_panic_rec hads_enjoy_rec {
        gen is_missing_`var' = missing(`var')
    }
    
    * STEP 3: Calculate the number of missing items per subscale
    egen missing_hads_anx = rowtotal(is_missing_hads_tense_rec is_missing_hads_glad_rec is_missing_hads_omen_rec is_missing_hads_laugh_rec is_missing_hads_trouble_rec is_missing_hads_happy_rec is_missing_hads_relax_rec)
    
    egen missing_hads_depr = rowtotal(is_missing_hads_limited_rec is_missing_hads_scary_rec is_missing_hads_looks_rec is_missing_hads_restless_rec is_missing_hads_future_rec is_missing_hads_panic_rec is_missing_hads_enjoy_rec)
    
    

    Steps 2 and 3 can be replaced by two statements. You don't need any of those indicator variables for missing.

    egen missing_hads_anx = rowmiss(hads_tense_rec hads_glad_rec hads_omen_rec hads_laugh_rec hads_trouble_rec hads_happy_rec hads_relax_rec)
    
    egen missing_hads_depr = rowmiss(hads_limited_rec hads_scary_rec hads_looks_rec hads_restless_rec hads_future_rec hads_panic_rec hads_enjoy_rec)
    
    * STEP 4. Drop observations with more than one missing item in any subscale
    drop if missing_hads_anx > 1 | missing_hads_depr > 1
    

    Step 4 seems fine.

    **STEP 5.** Replace single missing items with the mean of the present six items
    foreach var in hads_tense_rec hads_glad_rec hads_omen_rec hads_laugh_rec hads_trouble_rec hads_happy_rec hads_relax_rec {
        qui replace `var' = (hads_tense_rec + hads_glad_rec + hads_omen_rec + hads_laugh_rec + hads_trouble_rec + hads_happy_rec + hads_relax_rec - `var') / 6 if is_missing_`var' == 1 & missing_hads_anx == 1
    }
    
    
    foreach var in hads_limited_rec hads_scary_rec hads_looks_rec hads_restless_rec hads_future_rec hads_panic_rec hads_enjoy_rec {
        qui replace `var' = (hads_limited_rec + hads_scary_rec + hads_looks_rec + hads_restless_rec + hads_future_rec + hads_panic_rec + hads_enjoy_rec - `var') / 6 if is_missing_`var' == 1 & missing_hads_depr == 1
    }
    

    The code in Step 5 is buggy. The RHS will always be missing if any of the original variables is missing, and indeed otherwise The subtle difference is that generate won't ignore missings in a total, whereas egen functions exist to do that.

    In essence 3 + . is returned as missing, not 3 (and the same holds for any other sum of non-missing and missing values).

    You need first the mean of the non-missing values.

    egen mean_hads_anx = rowmean(hads_tense_rec hads_glad_rec hads_omen_rec hads_laugh_rec hads_trouble_rec hads_happy_rec hads_relax_rec)
    
    egen mean_hads_depr = rowmean(hads_limited_rec hads_scary_rec hads_looks_rec hads_restless_rec hads_future_rec hads_panic_rec hads_enjoy_rec)
    

    Then you need the totals of the non-missing values.

    egen score_hads_anx = rowtotal(hads_tense_rec hads_glad_rec hads_omen_rec hads_laugh_rec hads_trouble_rec hads_happy_rec hads_relax_rec)
    
    egen score_hads_depr = rowtotal(hads_limited_rec hads_scary_rec hads_looks_rec hads_restless_rec hads_future_rec hads_panic_rec hads_enjoy_rec)
    

    Then the final results should be fixed if and only if there is one missing value in each case:

    replace score_hads_depr = score_hads_depr + mean_hads_depr if missing_hads_depr == 1
    
    replace score_hads_anx = score_hads_anx + mean_hads_anx if missing_hads_anx== 1
    

    Alternatively, the fix is just (7/6) the score from 6 items whenever one is missing.