rdata-cleaningmissing-dataeconomics

Calculating net assets with missing values: discrepancy in results


I'm working with a wealth component dataset that includes variables for housing, business, financial assets, loans, and non-housing loans. These variables have varying levels of randomly allocated missing values (NA).

My goal is to calculate net assets using the formula: housing + business + financial - loan - loan_non_housing

I've implemented the following method to calculate net assets. It handles missing values (NA) carefully. For assets and liabilities, it sums components while ignoring NAs. Then, returns NA if all components are NA (instead of 0). For net assets, the code accounts for all combinations of NA in assets and liabilities. It also gives accurate results even when only one value is available.

df$assets <- apply(df[, c("housing","business", "financial")], 1, function(x) {
  if (all(is.na(x))) { return(NA)} else {return(sum(x, na.rm = TRUE)) }})

df$liabilities <- apply(df[, c("loan","loan_non_housing")], 1, function(x) {
  if (all(is.na(x))) { return(NA)} else {return(sum(x, na.rm = TRUE)) }})

df$net_assets <- ifelse(is.na(df$assets) & is.na(df$liabilities), NA,
                        ifelse(is.na(df$assets), -df$liabilities,
                               ifelse(is.na(df$liabilities), df$assets,
                                      df$assets - df$liabilities)))

This approach yielded more accurate results compared to simpler calculation methods.

However, I've encountered an issue when calculating the average wealth by age groups and social classes:

fig_net_assets <- df %>%
  group_by(age_groups, social_class) %>%
  summarise(
    mean_net_assets = mean(net_assets, na.rm = TRUE),
    mean_net_assets_manual = mean(housing + business + financial - loan - loan_non_housing, na.rm = TRUE),
    mean_housing = mean(housing, na.rm = TRUE),
    mean_business = mean(business, na.rm = TRUE),
    mean_financial = mean(financial, na.rm = TRUE),
    mean_loan = mean(loan, na.rm = TRUE),
    mean_loan_non_housing = mean(loan_non_housing, na.rm = TRUE)
  )

The results from mean_net_assets = mean(net_assets, na.rm = TRUE) differ significantly from the manually calculated total of mean averages (mean_housing + mean_business + mean_financial - mean_loan - mean_loan_non_housing).

In my actual dataset, the mean_net_assets results are more intuitive. For example, for the working class aged 25-34 in Germany, I get an average wealth of around 50K euros when i calculate it with mean_net_assets. On the other hand, i get 200K euros when i calculate it with mean_net_assets_manual. The latter scenario is very much unlikely compared to other findings in the literature which makes me believe that the way i calculated mean_net_assets for missing values earlier is correct.

My question is: How can I calculate the average wealth for each dimension separately while ensuring that their sum (mean_housing + mean_business + mean_financia - mean_loan - mean_loan_non_housing) matches the overall average net assets calculated using mean(net_assets, na.rm = TRUE)? While keeping in mind that i have to keep the same operationalization for net_assets to handles all missing values correctly.

I've included a reproducible dataset and sample code below for reference:

Any insights or suggestions would be greatly appreciated!

Reproducible code:

library(dplyr)
library(tidyr)

set.seed(123)  # for reproducibility

# Create base dataset
n <- 10000  # number of observations

df <- data.frame(
  age_groups = sample(c("25-34", "35-44", "45-54", "55-64"), n, replace = TRUE, 
                      prob = c(0.3, 0.3, 0.2, 0.2)),
  social_class = sample(c("Self-employed", "Salariat", "Intermediate", "Working class"), n, replace = TRUE, 
                        prob = c(0.1, 0.3, 0.3, 0.3))
)

# Function to generate wealth based on age and social class
generate_wealth <- function(age_group, social_class, base_mean, base_sd) {
  age_factor <- case_when(
    age_group == "25-34" ~ 0.5,
    age_group == "35-44" ~ 0.8,
    age_group == "45-54" ~ 1.2,
    age_group == "55-64" ~ 1.5
  )
  
  class_factor <- case_when(
    social_class == "Working class" ~ 0.6,
    social_class == "Intermediate" ~ 1.0,
    social_class == "Salariat" ~ 1.4,
    social_class == "Self-employed" ~ 1.2
  )
  
  wealth <- rnorm(1, mean = base_mean * age_factor * class_factor, sd = base_sd)
  return(max(wealth, 0))  # Ensure non-negative values
}

# Generate wealth variables
df <- df %>%
  rowwise() %>%
  mutate(
    housing = generate_wealth(age_groups, social_class, 200000, 50000),
    business = generate_wealth(age_groups, social_class, 50000, 20000),
    financial = generate_wealth(age_groups, social_class, 100000, 30000),
    loan = generate_wealth(age_groups, social_class, 20000, 10000),
    loan_non_housing = generate_wealth(age_groups, social_class, 150000, 50000)
  ) %>%
  ungroup()

# Function to introduce missing values
introduce_missing <- function(x, rate) {
  n <- length(x)
  missing_indices <- sample(1:n, size = round(n * rate), replace = FALSE)
  x[missing_indices] <- NA
  return(x)
}

# Introduce missing values
df <- df %>%
  mutate(
    housing = introduce_missing(housing, 0.20),
    business = introduce_missing(business, 0.1),
    financial = introduce_missing(financial, 0.25),
    loan = introduce_missing(loan, 0.20),
    loan_non_housing = introduce_missing(loan_non_housing, 0.15)
  )


# Display missing value percentages
missing_percentages <- df %>%
  summarise(across(everything(), ~mean(is.na(.)) * 100)) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Missing_Percentage")


# Calculate, assets, liabilities and net assets

df$assets <- apply(df[, c("housing","business", "financial")], 1, function(x) {
  if (all(is.na(x))) { return(NA)} else {return(sum(x, na.rm = TRUE)) }})


df$liabilities <- apply(df[, c("loan","loan_non_housing")], 1, function(x) {
  if (all(is.na(x))) { return(NA)} else {return(sum(x, na.rm = TRUE)) }})

df$net_assets <- ifelse(is.na(df$assets) & is.na(df$liabilities), NA,
                        ifelse(is.na(df$assets), -df$liabilities,
                               ifelse(is.na(df$liabilities), df$assets,
                                      df$assets - df$liabilities)))



# Create the mean by social class and age groups
fig_net_assets <- df %>%
  group_by(age_groups, social_class) %>%
  summarise(
    mean_net_assets = mean(net_assets, na.rm = TRUE), # this line of code has to be compared with the line below it
    mean_net_assets_manual = mean(housing + business + financial - loan - loan_non_housing, na.rm = TRUE), # this line of code has to be compared with the line above it
    mean_housing = mean(housing, na.rm = TRUE),
    mean_business = mean(business, na.rm = TRUE),
    mean_financial = mean(financial, na.rm = TRUE),
    mean_loan = mean(loan, na.rm = TRUE),
    mean_loan_non_housing = mean(loan_non_housing, na.rm = TRUE)
  )

Solution

  • You have to replace the missing in each of the manual variables (housing, business, etc.) with 0

    library(tidyr)
    
    df %>%
      group_by(age_groups) %>%
      summarise(
        mean_net_assets = mean(net_assets, na.rm = TRUE), 
        mean_net_assets_manual = mean(replace_na(housing, 0) + 
                                        replace_na(df$business, 0) + 
                                        replace_na(df$financial, 0) - 
                                        replace_na(df$loan, 0) - 
                                        replace_na(df$loan_non_housing, 0))
      )
    

    # A tibble: 16 x 4
    # Groups:   age_groups [4]
       age_groups social_class  mean_net_assets mean_net_assets_manual
       <chr>      <chr>                   <dbl>                  <dbl>
     1 25-34      Intermediate           69966.                 69966.
     2 25-34      Salariat               89909.                 89909.
     3 25-34      Self-employed          74832.                 74832.
     4 25-34      Working class          44521.                 44521.
     5 35-44      Intermediate          103401.                103401.
     6 35-44      Salariat              153436.                153436.
     7 35-44      Self-employed         142492.                142492.
     8 35-44      Working class          63927.                 63857.
     9 45-54      Intermediate          158332.                158332.
    10 45-54      Salariat              219914.                219914.
    11 45-54      Self-employed         190177.                190177.
    12 45-54      Working class          96267.                 96267.
    13 55-64      Intermediate          194612.                194612.
    14 55-64      Salariat              282448.                282448.
    15 55-64      Self-employed         241301.                241301.
    16 55-64      Working class         135134.                134903.
    

    The very small differences for rows 8 and 16 are due to the 2 missing values in net_assets.

    df %>%
      group_by(age_groups, social_class) %>%
      summarise(nNA=sum(is.na(net_assets)))
    

    # A tibble: 16 x 3
    # Groups:   age_groups [4]
       age_groups social_class    nNA
       <chr>      <chr>         <int>
     1 25-34      Intermediate      0
    ...
     7 35-44      Self-employed     0
     8 35-44      Working class     1
     9 45-54      Intermediate      0
    ...
    15 55-64      Self-employed     0
    16 55-64      Working class     1