I'm working with a wealth component dataset that includes variables for housing, business, financial assets, loans, and non-housing loans. These variables have varying levels of randomly allocated missing values (NA).
My goal is to calculate net assets using the formula: housing + business + financial - loan - loan_non_housing
I've implemented the following method to calculate net assets. It handles missing values (NA) carefully. For assets and liabilities, it sums components while ignoring NAs. Then, returns NA if all components are NA (instead of 0). For net assets, the code accounts for all combinations of NA in assets and liabilities. It also gives accurate results even when only one value is available.
df$assets <- apply(df[, c("housing","business", "financial")], 1, function(x) {
if (all(is.na(x))) { return(NA)} else {return(sum(x, na.rm = TRUE)) }})
df$liabilities <- apply(df[, c("loan","loan_non_housing")], 1, function(x) {
if (all(is.na(x))) { return(NA)} else {return(sum(x, na.rm = TRUE)) }})
df$net_assets <- ifelse(is.na(df$assets) & is.na(df$liabilities), NA,
ifelse(is.na(df$assets), -df$liabilities,
ifelse(is.na(df$liabilities), df$assets,
df$assets - df$liabilities)))
This approach yielded more accurate results compared to simpler calculation methods.
However, I've encountered an issue when calculating the average wealth by age groups and social classes:
fig_net_assets <- df %>%
group_by(age_groups, social_class) %>%
summarise(
mean_net_assets = mean(net_assets, na.rm = TRUE),
mean_net_assets_manual = mean(housing + business + financial - loan - loan_non_housing, na.rm = TRUE),
mean_housing = mean(housing, na.rm = TRUE),
mean_business = mean(business, na.rm = TRUE),
mean_financial = mean(financial, na.rm = TRUE),
mean_loan = mean(loan, na.rm = TRUE),
mean_loan_non_housing = mean(loan_non_housing, na.rm = TRUE)
)
The results from mean_net_assets = mean(net_assets, na.rm = TRUE)
differ significantly from the manually calculated total of mean averages (mean_housing + mean_business + mean_financial - mean_loan - mean_loan_non_housing
).
In my actual dataset, the mean_net_assets
results are more intuitive. For example, for the working class aged 25-34 in Germany, I get an average wealth of around 50K euros when i calculate it with mean_net_assets
. On the other hand, i get 200K euros when i calculate it with mean_net_assets_manual
. The latter scenario is very much unlikely compared to other findings in the literature which makes me believe that the way i calculated mean_net_assets
for missing values earlier is correct.
My question is: How can I calculate the average wealth for each dimension separately while ensuring that their sum (mean_housing + mean_business + mean_financia - mean_loan - mean_loan_non_housing
) matches the overall average net assets calculated using mean(net_assets, na.rm = TRUE)
? While keeping in mind that i have to keep the same operationalization for net_assets
to handles all missing values correctly.
I've included a reproducible dataset and sample code below for reference:
Any insights or suggestions would be greatly appreciated!
Reproducible code:
library(dplyr)
library(tidyr)
set.seed(123) # for reproducibility
# Create base dataset
n <- 10000 # number of observations
df <- data.frame(
age_groups = sample(c("25-34", "35-44", "45-54", "55-64"), n, replace = TRUE,
prob = c(0.3, 0.3, 0.2, 0.2)),
social_class = sample(c("Self-employed", "Salariat", "Intermediate", "Working class"), n, replace = TRUE,
prob = c(0.1, 0.3, 0.3, 0.3))
)
# Function to generate wealth based on age and social class
generate_wealth <- function(age_group, social_class, base_mean, base_sd) {
age_factor <- case_when(
age_group == "25-34" ~ 0.5,
age_group == "35-44" ~ 0.8,
age_group == "45-54" ~ 1.2,
age_group == "55-64" ~ 1.5
)
class_factor <- case_when(
social_class == "Working class" ~ 0.6,
social_class == "Intermediate" ~ 1.0,
social_class == "Salariat" ~ 1.4,
social_class == "Self-employed" ~ 1.2
)
wealth <- rnorm(1, mean = base_mean * age_factor * class_factor, sd = base_sd)
return(max(wealth, 0)) # Ensure non-negative values
}
# Generate wealth variables
df <- df %>%
rowwise() %>%
mutate(
housing = generate_wealth(age_groups, social_class, 200000, 50000),
business = generate_wealth(age_groups, social_class, 50000, 20000),
financial = generate_wealth(age_groups, social_class, 100000, 30000),
loan = generate_wealth(age_groups, social_class, 20000, 10000),
loan_non_housing = generate_wealth(age_groups, social_class, 150000, 50000)
) %>%
ungroup()
# Function to introduce missing values
introduce_missing <- function(x, rate) {
n <- length(x)
missing_indices <- sample(1:n, size = round(n * rate), replace = FALSE)
x[missing_indices] <- NA
return(x)
}
# Introduce missing values
df <- df %>%
mutate(
housing = introduce_missing(housing, 0.20),
business = introduce_missing(business, 0.1),
financial = introduce_missing(financial, 0.25),
loan = introduce_missing(loan, 0.20),
loan_non_housing = introduce_missing(loan_non_housing, 0.15)
)
# Display missing value percentages
missing_percentages <- df %>%
summarise(across(everything(), ~mean(is.na(.)) * 100)) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Missing_Percentage")
# Calculate, assets, liabilities and net assets
df$assets <- apply(df[, c("housing","business", "financial")], 1, function(x) {
if (all(is.na(x))) { return(NA)} else {return(sum(x, na.rm = TRUE)) }})
df$liabilities <- apply(df[, c("loan","loan_non_housing")], 1, function(x) {
if (all(is.na(x))) { return(NA)} else {return(sum(x, na.rm = TRUE)) }})
df$net_assets <- ifelse(is.na(df$assets) & is.na(df$liabilities), NA,
ifelse(is.na(df$assets), -df$liabilities,
ifelse(is.na(df$liabilities), df$assets,
df$assets - df$liabilities)))
# Create the mean by social class and age groups
fig_net_assets <- df %>%
group_by(age_groups, social_class) %>%
summarise(
mean_net_assets = mean(net_assets, na.rm = TRUE), # this line of code has to be compared with the line below it
mean_net_assets_manual = mean(housing + business + financial - loan - loan_non_housing, na.rm = TRUE), # this line of code has to be compared with the line above it
mean_housing = mean(housing, na.rm = TRUE),
mean_business = mean(business, na.rm = TRUE),
mean_financial = mean(financial, na.rm = TRUE),
mean_loan = mean(loan, na.rm = TRUE),
mean_loan_non_housing = mean(loan_non_housing, na.rm = TRUE)
)
You have to replace the missing in each of the manual variables (housing
, business
, etc.) with 0
library(tidyr)
df %>%
group_by(age_groups) %>%
summarise(
mean_net_assets = mean(net_assets, na.rm = TRUE),
mean_net_assets_manual = mean(replace_na(housing, 0) +
replace_na(df$business, 0) +
replace_na(df$financial, 0) -
replace_na(df$loan, 0) -
replace_na(df$loan_non_housing, 0))
)
# A tibble: 16 x 4
# Groups: age_groups [4]
age_groups social_class mean_net_assets mean_net_assets_manual
<chr> <chr> <dbl> <dbl>
1 25-34 Intermediate 69966. 69966.
2 25-34 Salariat 89909. 89909.
3 25-34 Self-employed 74832. 74832.
4 25-34 Working class 44521. 44521.
5 35-44 Intermediate 103401. 103401.
6 35-44 Salariat 153436. 153436.
7 35-44 Self-employed 142492. 142492.
8 35-44 Working class 63927. 63857.
9 45-54 Intermediate 158332. 158332.
10 45-54 Salariat 219914. 219914.
11 45-54 Self-employed 190177. 190177.
12 45-54 Working class 96267. 96267.
13 55-64 Intermediate 194612. 194612.
14 55-64 Salariat 282448. 282448.
15 55-64 Self-employed 241301. 241301.
16 55-64 Working class 135134. 134903.
The very small differences for rows 8 and 16 are due to the 2 missing values in net_assets
.
df %>%
group_by(age_groups, social_class) %>%
summarise(nNA=sum(is.na(net_assets)))
# A tibble: 16 x 3
# Groups: age_groups [4]
age_groups social_class nNA
<chr> <chr> <int>
1 25-34 Intermediate 0
...
7 35-44 Self-employed 0
8 35-44 Working class 1
9 45-54 Intermediate 0
...
15 55-64 Self-employed 0
16 55-64 Working class 1