rnana.approx

How to mark rows with NAs which were replaced by na.approx()?


I have a dataset with multiple columns and more than 60k rows. I am using na.approx() to replace NAs in each column. I would like to identify the rows which had NA and were replaced, something like "Dataset - After removing NAs". I would appreciate if you could please help me with that.

Dataset - Before removing NAs
A    B    C
1.0  2.2  3.3
NA   5.7  8.8
6.2  7.2  NA
3.0  NA   1.1

Dataset - After removing NAs
A    A_NA   B     B_NA    C    C_NA
1.0  No     2.2   No      3.3  No
3.6  Yes    5.7   No      8.8  No
6.2  No     7.2   No      4.6  Yes
3.0  No     7.2   Yes     1.1  No 

Sorry, I cannot share the original dataset, but if it is necessary I can prepare a better example.

Thanks

UPDATE

Sorry, I forgot to mention that in my original dataset, I have some columns in my dataset which are not numeric, also my dataset is a data.table.

Dataset - Before removing NAs
ID        Trial     Weight    Height    Depth     
1         Wheat     1.0       2.2       3.3
1         Wheat     NA        5.7       8.8
2         Maize     6.2       7.2       NA
4         Maize     3.0       NA        1.1

Solution

  • You could do this in base R by indexing:

    #new columns if it is NA or not
    df[paste0(names(df),"_NA")] <- is.na(df)
    
    # use zoo::na.approx
    df[-grep("_NA", names(df))] <- zoo::na.approx(df[-grep("_NA", names(df))])
    
    # reorder (if desired)
    df <- df[,sort(names(df))]
    

    Output:

        A  A_NA   B  B_NA    C  C_NA
    1 1.0 FALSE 2.2 FALSE 3.30 FALSE
    2 3.6  TRUE 5.7 FALSE 8.80 FALSE
    3 6.2 FALSE 7.2 FALSE 4.95  TRUE
    4 3.0 FALSE  NA  TRUE 1.10 FALSE
    

    If you wanted "Yes" or "No" instead of T/F, you could just replace is.na(df) with ifelse(is.na(df), "Yes", "No")

    Data:

    df <- read.table(text = "A    B    C
    1.0  2.2  3.3
    NA   5.7  8.8
    6.2  7.2  NA
    3.0  NA   1.1", header = TRUE)
    

    EDIT Using data table, you can tweak the code by:

    library(data.table)
    
    # Identify numeric columns
    idxcols <- sapply(dt, is.numeric)
    
    dt[, paste0(names(idxcols)[idxcols], "_NA") := lapply(.SD, is.na), .SDcols = idxcols]
    

    Data table data:

    dt <- data.table::data.table(read.table(text = "ID        Trial     Weight    Height    Depth     
    1         Wheat     1.0       2.2       3.3
    1         Wheat     NA        5.7       8.8
    2         Maize     6.2       7.2       NA
    4         Maize     3.0       NA        1.1", header = TRUE))