I am a long-time forum lurker and a first-time poster. I'm sorry in advance for any deficiencies in my post. This is a rather complicated one.
Description:
I have a Frankenstein dataset gathered from a variety of sources. It contains an inconsistent set of firm ID variables, 18 in total. Let's say I have 5 observations for a firm; I may have 2, 2, 10, or 16 of the identifiers. There is some overlap; the rest are missing (NA
). I want to back out the missing IDs, so I ideally have 18, 18, 18, 18, 18, or perhaps 16, depending on the coverage I have for each firm.
Problem Description: Now, this is a panel dataset, which adds an additional layer of complexity. Firms merge, move, and change legal state over time. Not all ID variables will treat these the same, so the ID variable's ID code might change depending on the observation time. I want to filter these cases out because I cannot be as confident of the correct ID to use.
Approach: I have attempted to loop through the data for each variable (assuming I may pick a few additional matches while filling in the blanks). I group observations by each ID Variable and identify how many distinct observations each variable has. If there is more than 1 distinct observation for any corresponding ID variable, I assume there is a conflict. If there is only 1, I assume there are no conflicts and seek to fill in the missing values. However, my code does not seem to loop correctly, missing potentially imputable data.
Sample Data:
### Not ID 1 should be imputable while ID 2 should not
data <- data.frame(ID = c(1, 1, NA, 2, NA, 2, 2, 2)
ID_Variable_1 = c(10, 10, NA, 20, 20, 20, 19, 18),
ID_Variable_2 = c(NA, 8, 8, 16, 16, 16, NA, NA),
ID_Variable_3 = c(NA, 8, 8, NA, 10, NA, NA, NA)
ID_Variable_4 = c(A10, NA, NA, NA, NA, B12, B12, B12)
Current attempt:
### Extract vector of variables names
column_vector <- names(data)
### Set Function to impute data
Impute_missing_data <- function(data, column_vector){
#### Back-up data
temp_data <- data
#### Loop through each variable
for (i in column_vector) {
#### Identify unique ID combinations, filter out those which contain conflicts
imputation_candidates <- temp_data %>%
group_by_at(vars(i)) %>%
summarise(across(everything(), ~ n_distinct(., na.rm = TRUE))) %>%
filter(if_all(-i, ~ . <= 1), .preserve = TRUE)
#### Copy vector of IDs with imputable data
imputation_applying <- imputation_candidates\[\[i\]\]
#### Update temporary data with imputable data. As all IDs are the same update based on the first non-NA
temp_data <- temp_data %>%
group_by_at(vars(i)) %>%
mutate(across(everything(), ~ ifelse(.x %in% imputation_applying & is.na(.), first(.x[!is.na(.x)]), .x))) %>% ungroup()
}
#### Return updated data
return(temp_data)
}
Desired Output:
data_temp <- data.frame(ID = c(1, 1, 1, NA, 2, NA, 2, 2, 2)
ID_Variable_1 = c(10, 10, 10, 11, 20, 20, 20, 19, 18),
ID_Variable_2 = c(8, 8, 8, ,12, 16, 16, 16, NA, NA),
ID_Variable_3 = c(8, 8, 8, NA, NA, 10, NA, NA, NA)
ID_Variable_4 = c(A10, A10, A10, NA, NA, NA, B12, B12, B12)
data %>%
rownames_to_column()%>%
rows_patch(filter(fill(., everything(), .direction = 'downup'), ID == 1))%>%
column_to_rownames()
Matching, by = "rowname"
ID ID_Variable_1 ID_Variable_2 ID_Variable_3 ID_Variable_4
1 1 10 8 8 A10
2 1 10 8 8 A10
3 1 10 8 8 A10
4 2 20 16 NA <NA>
5 NA 20 16 10 <NA>
6 2 20 16 NA B12
7 2 19 NA NA B12
8 2 18 NA NA B12