I have a dataset with multiple columns and more than 60k rows. I am using na.approx()
to replace NAs in each column. I would like to identify the rows which had NA and were replaced, something like "Dataset - After removing NAs". I would appreciate if you could please help me with that.
Dataset - Before removing NAs
A B C
1.0 2.2 3.3
NA 5.7 8.8
6.2 7.2 NA
3.0 NA 1.1
Dataset - After removing NAs
A A_NA B B_NA C C_NA
1.0 No 2.2 No 3.3 No
3.6 Yes 5.7 No 8.8 No
6.2 No 7.2 No 4.6 Yes
3.0 No 7.2 Yes 1.1 No
Sorry, I cannot share the original dataset, but if it is necessary I can prepare a better example.
Thanks
UPDATE
Sorry, I forgot to mention that in my original dataset, I have some columns in my dataset which are not numeric, also my dataset is a data.table.
Dataset - Before removing NAs
ID Trial Weight Height Depth
1 Wheat 1.0 2.2 3.3
1 Wheat NA 5.7 8.8
2 Maize 6.2 7.2 NA
4 Maize 3.0 NA 1.1
You could do this in base R by indexing:
#new columns if it is NA or not
df[paste0(names(df),"_NA")] <- is.na(df)
# use zoo::na.approx
df[-grep("_NA", names(df))] <- zoo::na.approx(df[-grep("_NA", names(df))])
# reorder (if desired)
df <- df[,sort(names(df))]
Output:
A A_NA B B_NA C C_NA
1 1.0 FALSE 2.2 FALSE 3.30 FALSE
2 3.6 TRUE 5.7 FALSE 8.80 FALSE
3 6.2 FALSE 7.2 FALSE 4.95 TRUE
4 3.0 FALSE NA TRUE 1.10 FALSE
If you wanted "Yes" or "No" instead of T/F, you could just replace is.na(df)
with ifelse(is.na(df), "Yes", "No")
Data:
df <- read.table(text = "A B C
1.0 2.2 3.3
NA 5.7 8.8
6.2 7.2 NA
3.0 NA 1.1", header = TRUE)
EDIT Using data table, you can tweak the code by:
library(data.table)
# Identify numeric columns
idxcols <- sapply(dt, is.numeric)
dt[, paste0(names(idxcols)[idxcols], "_NA") := lapply(.SD, is.na), .SDcols = idxcols]
Data table data:
dt <- data.table::data.table(read.table(text = "ID Trial Weight Height Depth
1 Wheat 1.0 2.2 3.3
1 Wheat NA 5.7 8.8
2 Maize 6.2 7.2 NA
4 Maize 3.0 NA 1.1", header = TRUE))