rduplicatesna

convert duplicates into NA


I would like to replace duplicated values per site to NA and to keep the first repeated value that appears from left to right.

For example, on site "Alz-Ettelbruck" the value "7" repeats in columns 4 and 5.That means the column 5 of only that site should return NA. The number 12 repeats in all columns on site "Our-Gemund/Vianden" so I would like to keep the 12 in column 2 but the rest should be replaced to NA.

For that I have used the duplicated function but it returns "NULL".

To reproduce the issue, I have used the following data frame and I have indicated at the end the desired output.

Any help would be truly appreciated. Thank you in advance.


df <- data.frame(stringsAsFactors = FALSE,
                 check.names = FALSE,
                        Site = c("Att-Bissen","Alz-Ettelbruck","Our-Gemund/Vianden",
                                 "Syre Felsmuhle/Mertert","Ernz Blanche-Larochette"),
  `2001-12-01 to 2021-12-01` = c(12, 1, 12, 1, 8),
  `1991-12-01.to 2021-12-01` = c(5, 4, 12, 6, 14),
  `1981-12-01 to 2021-12-01` = c(12, 7, 12, 20, 14),
  `1971-12-01 to 2021-12-01` = c(19, 7, 12, 13, 14))


# Replace repeated values with NA per row
data <- for (i in 1:nrow(df)) {
  df[i, -1][duplicated(df[i, -1])] <- NA
}

The following is what I would like the script to return:

enter image description here


Solution

  • (Edit: the first version of both base-R and dplyr+tidyr code used duplicated, which would falsely-remove the 12 in row 1 column 4. It has been edited to fix that to not use duplicated.)

    base R

    A reduction comparing column to updated-column.

    df[,-1] <- Reduce(
      function(prev, this) replace(this, is.na(prev) | this == prev, this[NA][1]),
      df[,-1], accumulate = TRUE)
    df
    #                      Site 2001-12-01 to 2021-12-01 1991-12-01.to 2021-12-01 1981-12-01 to 2021-12-01 1971-12-01 to 2021-12-01
    # 1              Att-Bissen                       12                        5                       12                       19
    # 2          Alz-Ettelbruck                        1                        4                        7                       NA
    # 3      Our-Gemund/Vianden                       12                       NA                       NA                       NA
    # 4  Syre Felsmuhle/Mertert                        1                        6                       20                       13
    # 5 Ernz Blanche-Larochette                        8                       14                       NA                       NA
    

    I hard-coded df[,-1] in both places, it could easily also be df[,2:5], it just needs to be the same in both places (LHS of <- and within the Reduce).

    dplyr+tidyr

    This loses some efficiency because it double pivots.

    library(dplyr)
    library(tidyr) # pivot_*
    df %>%
      pivot_longer(cols = -Site) %>%
      arrange(Site, desc(name)) %>%
      mutate(.by = "Site", value = if_else(value == lag(value, default=-1L), value[NA], value)) %>%
      pivot_wider(id_cols = Site) %>%
      slice(match(Site, df$Site)) %>%
      select(match(names(.), names(df)))
    # # A tibble: 5 × 5
    #   Site                    `2001-12-01 to 2021-12-01` `1991-12-01.to 2021-12-01` `1981-12-01 to 2021-12-01` `1971-12-01 to 2021-12-01`
    #   <chr>                                        <dbl>                      <dbl>                      <dbl>                      <dbl>
    # 1 Att-Bissen                                      12                          5                         12                         19
    # 2 Alz-Ettelbruck                                   1                          4                          7                         NA
    # 3 Syre Felsmuhle/Mertert                           1                          6                         20                         13
    # 4 Ernz Blanche-Larochette                          8                         14                         NA                         NA
    # 5 Our-Gemund/Vianden                              12                         NA                         NA                         NA
    

    One side-effect of the pivoting is that the order of rows and columns is not guaranteed to be restored, so I added the mostly-aesthetic slice(.) %>% select(.) to the end to marry up with your input data. (It is not at all required.)


    Data

    df <- structure(list(Site = c("Att-Bissen", "Alz-Ettelbruck", "Our-Gemund/Vianden", "Syre Felsmuhle/Mertert", "Ernz Blanche-Larochette"), "2001-12-01 to 2021-12-01" = c(12, 1, 12, 1, 8), "1991-12-01.to 2021-12-01" = c(5, 4, 12, 6, 14), "1981-12-01 to 2021-12-01" = c(12, 7, 12, 20, 14), "1971-12-01 to 2021-12-01" = c(19, 7, 12, 13, 14)), class = "data.frame", row.names = c(NA, -5L))