rdataframedata-wrangling

Using R to clean data, collapsing multiple rows into single row per event, but keeping specific information from other rows


I am working on a dataset which has multiple rows per event, and my boss has asked me to tidy it so that there is one row per event. The complexity is that there is no consistent (in a programmatic sense) way of choosing one row over another - there are three variables which might need to be taken from different rows in any given situation.

Because of this, I have gone through the data manually and have created a new variable next to each of these variables, stating which row should be kept for that variable. I now would like to be able to automate the process of collapsing the data to have one row per event and saving information from the correct row in each of the three variables.

Please see below for a dummy dataset:

Event Area Keep_area Date Keep_dat Outcome Keep_out
1 A 01/01/22 Y Fail
1 B Y 13/04/22 Unknown
1 C 24/07/22 Success Y
2 B Y 18/10/21 Y Unknown
2 G 02/11/22 Success Y
3 C 12/01/22 Fail
3 E 28/03/22 Y Unknown
3 F Y 16/12/22 Unknown
3 G 21/01/23 Unknown
3 A 02/02/23 Partial Y

And this is how I'd want it to look:

Event Area Date Outcome
1 B 01/01/22 Success
2 B 18/10/21 Success
3 F 28/03/22 Partial

I know I will probably need to do this in multiple stages, but I'm just struggling to work out how I can use the "keep" variables to say which to keep. I would ideally like to do this in tidyverse but am open to any and all suggestions!

I can't even begin to work out what to do here (sorry, still relatively new to R!) so all I've been able to do so far is identify which pieces of information need to be kept from each row (as shown above).


Solution

  • One way to do this is to use tidyr::pivot_longer() to convert your data to long format, then tidyr::pivot_wider() to return the data to it's original wide format:

    library(dplyr)
    library(tidyr)
    
    # df is your data
    df %>% pivot_longer(!Event) %>%
      mutate(temp = ifelse(value == "Y", lag(value), NA)) %>%
      select(!value) %>%
      filter(!is.na(temp)) %>%
      pivot_wider(names_from = name,,
                  values_from = temp) %>%
      rename(Date = "Keep_dat")
    
    # A tibble: 3 × 4
      Event       Date  Keep_area  Keep_out
      <int>      <chr>   <chr>      <chr>   
    1     1 1/01/2022      B        Success 
    2     2 18/10/2021     B        Success 
    3     3 28/03/2022     F        Partial 
    

    It works by creating a single column that combines all the data in every column except "Event", so one "event" row for each column.

    Because the columns with the data you want are followed by the columns containing the "Y" values, the default for pivot_longer() is to stack the data in the order they appear.

    The result is that the values that you want will always have a "Y" directly below them.