rdataframefiltergroup-by

Filtering rows between zero values and save as new dataframes or datatables


I have a large csv dataset with more than 45k rows and 19 different variables. I'd like to filter it by a specific variable (V4) so that each filtered group starts with 0 and then the next 0 will mark the start of a new group/dataframe/datatable, while keeping all other variables inside this new table as well. I need those separate groups to further analyse each case of data. I tried:

filtered_data <- my_data %>%       
group_by("V4") %>%       
filter("V4" == 0 & "V4" !=0)      
View(filtered_data)    

The first "V4" == 0 seems to work but I'm struggling how to define the end of each filtered dataframe e.g. how to filter from 0 to 3, then 0 to 5 etc. How can I determine the length of each case? Is there a logical operator that saves each group before V4 turns 0 again? Or would it be better to create a loop?

Example of my_data:

        V1    V2    V3    V4    .    .    .    V19 
    1                     0
    2                     1
    3                     2
    4        `            3
    5                     0
    6                     1
    7                     2
    8                     3
    9                     4
    10                    5
    11                    0
   ...
   45k   

Solution

  • Here is a way to group your rows with basic arithmetic. I create the groups using a cumulative sum of an indicator variable (V4 is 0 or not) and split the data.frame into single dataframes using group_split.

    # example data 12000 rows in total, 4000 groups of 3 rows
    df <- data.frame(V1 = 1:12000, 
                     V2 = sample(LETTERS, 12000, replace = T), 
                     V4 = rep(0:2, 4000))
    
    df <- df %>%
      mutate(Groups = ifelse(V4 == 0, 1, 0),
             Groups = cumsum(Groups)) %>%
      group_split(Groups)
    

    So the first group/dataframe is

    > df[[1]]
    # A tibble: 3 x 4
         V1 V2       V4 Groups
      <int> <chr> <int>  <dbl>
    1     1 L         0      1
    2     2 L         1      1
    3     3 Y         2      1
    

    the second

    > df[[2]]
    # A tibble: 3 x 4
         V1 V2       V4 Groups
      <int> <chr> <int>  <dbl>
    1     4 Z         0      2
    2     5 N         1      2
    3     6 Y         2      2
    

    and so on.

    If you want to save each data.frame seperately you could use something like this:

    # new environment that holds all data.frames
    dfEnv <- new.env()
    
    df %>%
      mutate(Groups = ifelse(V4 == 0, 1, 0),
             Groups = cumsum(Groups)) %>%
      group_by(Groups) %>%
      do({
        # save every group inside the new environment as a single data.frame
        dfEnv[[paste0("Group_", unique(.$Groups))]] <- .
      })
    

    Now you have dfEnv$Group_1, dfEnv$Group_2, ... and so on. Inside do() you could also use saveRDS or write.csv to save the data to disk.