rfunctionsplitlapply

How do I remove more than one section from a split data frame in r that uses lapply()?


I have created a function that I use in lapply() an already split data frame. The problem is that lapply() does not fully run through the function if some parts of the conditions of the split data frame are missing. Or is there a way to remove the sections that contain NAs through lapply()? Another problem is that I cannot remove the sections with NA's before I use the lapply() code because lapply() wants all the conditions in the split() function to be there or else it gives the same error as if there are sections of NAs.

I have tried to remove one split section with the following solution and it works:

DATA structure:

Participant   visit  trial  tl    tr    br    bl    time  CoPx  CoPy  
1   006        FU1   lc     81    7     6     1      1    -0.8   -13      
2   006        FU1   lo     79    7     8     1      2    -0.7   -15      
3   006        FU1   rc     7     9     9     0      3    -0.7   -51  
4   006        FU1   ro     9     8     3     1      4    -0.4   -15  
5   006        FU2   lc     9     4     8     1      5    -0.5   -17  
6   006        FU2   lo     79    7     2     10     6    -0.5   -17  
7   006        FU2   rc     7     4     9     1      7    -0.3   -86  
8   006        FU2   ro     7     4     7     13     8    -0.8   -200  
9   009        FU1   lc     81    7     6     1      1    -0.8   -13  
10  009        FU1   lo     79    7     8     1      2    -0.7   -15  
11  009        FU1   rc     7     9     9     0      3    -0.7   -51  
12  009        FU1   ro     9     8     3     1      4    -0.4   -15  
13  009        FU2   lc     9     4     8     1      5    -0.5   -17  
14  009        FU2   lo     79    7     2     10     6    -0.5   -17  
15  009        FU2   rc     NA    NA    NA    NA     NA    NA     NA  
16  009        FU2   ro     7     4     7     13     8    -0.8   -200

CODE:

`minifunc.area <- function(df){
# This mini function is designed to extract the center of the ellipse and the area of the 95%         confidence interval
# The code in this function comes from https://stackoverflow.com/questions/38782051/how-to-    calculate-the-area-of-ellipse-drawn-by-ggplot2

# remove NA's before running filter
#df <- drop_na(df)

# Plot object
p <- ggplot(df, aes(x = CoPx, y = CoPy))+
  geom_point()+
  stat_ellipse(level = 0.95) # 95% confidence interval

# Get ellipse coordinates from plot
pb <- ggplot_build(p)
pb <- as.data.frame(pb$data[[2]])
el <- pb[,1:2]

# Center of ellipse
ctr <- MASS::cov.trob(el)$center  


  # Calculate distance to center from each point on the ellipse
dist2center <- sqrt(rowSums((t(t(el) - ctr))^2))

# Calculate area of ellipse from semi-major and semi-minor axes. 
# These are, respectively, the largest and smallest values of dist2center. 
answer <- pi * min(dist2center) * max(dist2center)

} # end of mini function
  

splt.df <- split(df, list(df$trial, df$Participant))

`

example of output:

$lo.FU2.006
# A tibble: 6,041 × 12
# Groups:   Participant, visit, trial [1]
  Participant visit trial    tl    tr    br    bl  time   CoPx   CoPy
  <chr>       <chr> <chr> <dbl> <dbl> <dbl> <dbl> <int>  <dbl>  <dbl>
1 006         FU2   lo     99.4  75.6  67.1  88.3     1 -0.678  -1.198
2 006         FU2   lo     67.5  98.3  44.8  85.6     2 -0.375  -0.375


$rc.FU2.009
# A tibble: 6,041 × 12
# Groups:   Participant, visit, trial [1]
  Participant visit trial    tl    tr    br    bl  time   CoPx   CoPy
  <chr>       <chr> <chr> <dbl> <dbl> <dbl> <dbl> <int>  <dbl>  <dbl>
1 009         FU2   rc       NA    NA    NA    NA    NA    NA    NA
# Figure out which df has missing data and define which row is missing
# removing duplicated trials per participant
temp.df <- df %>%
  distinct(Participant, trial)
# creating a new column with assenting numbers to count row number
temp.df$row.nr <- c(1:nrow(temp.df))

# identify which row has the missing value
save.na <- temp.df %>%
  dplyr::filter(Participant == p.nr & trial == trial.name)

na.row.nr <- unlist(save.na1[3]) # chr with the section number that needs to be removed from the split section (example: '15')

# apply the mini function to the splt data
splt <- lapply(splt.df[-na.row.nr], function(splt.df) minifunc.area(splt.df))

This above works, however, now I would like to do the same thing but with two missing sections.

I have tried the attempts below:

# Attempt 1)  
 na.row.nr <- unlist(save.na1[3])
 # where na.row.nr[1] is the first section to be removed and na.row.nr[2] is the second

# apply the mini function to the splt data
splt <- lapply(splt.df[-c(na.row.nr[1], na.row.nr[2])], function(splt.df) minifunc.area(splt.df))

 # Attempt 2) 
 na.row.nr <- unlist(save.na1[3])
 # where na.row.nr[1] is the first section to be removed and na.row.nr[2] is the second

# apply the mini function to the splt data
splt <- lapply(splt.df[-na.row.nr[1], -na.row.nr[2]], function(splt.df) minifunc.area(splt.df))

# Attempt 3) 
 na.row.nr <- unlist(save.na1[3])
 # where na.row.nr[1] is the first section to be removed and na.row.nr[2] is the second

 missing.rows <- c(na.row.nr[1], na.row.nr[2])

# apply the mini function to the splt data
splt <- lapply(splt.df[-na.row.nr[1] & -na.row.nr[2]], function(splt.df) minifunc.area(splt.df))

the output:

Error in `[.data.frame`(pb, , 1:2) : undefined columns selected
In addition: Warning message:
Removed 1 rows containing non-finite values (`stat_ellipse()`). 

from my understanding it means that lapply() cannot run through all the sections and stops as soon as it comes to a section that contains NA or does not exist


Solution

  • You can use if() to catch errors before they occur, or as mentioned in the comments, with safely() from purrr.

    I think an ellipse needs at least 3 points minimum, so you can catch this condition with if():

    minifunc.area <- function(df){
    
      if(nrow(df)<3) {
        cat("Not enough rows in data to calculate an ellipse\n")
        return(NA_real_)
      }
    

    The stat_ellipse() function may also fail if there are not enough unique points, so you can capture that again with if():

      # Get ellipse coordinates from plot
      pb <- ggplot_build(p)
      pb <- as.data.frame(pb$data[[2]])
    
      if(nrow(pb)<3 | ncol(pb)<2) return(NA_real_)
    

    This should be enough to handle most situations. Let's try with the whole dataset.

    minifunc.area(df)
    [1] 1031.753
    Warning message:
    Removed 1 row containing non-finite outside the scale range (`stat_ellipse()`).
    

    Now let's split by some of the variables. I'll use sapply here just so the output fits better on the screen.

    Split by Participant:

    splt.df <- split(df, df$Participant)
    sapply(splt.df, minifunc.area)
           6        9 
    291.9459 175.9905
    

    Split by trial:

    splt.df <- split(df, df$trial)
    sapply(splt.df, minifunc.area)
          lc       lo       rc       ro 
    12.24941       NA       NA       NA
    

    Split by participant and trial:

    splt.df <- split(df, list(df$trial, df$Participant))
    lapply(splt.df, minifunc.area)
    lc.6 lo.6 rc.6 ro.6 lc.9 lo.9 rc.9 ro.9 
      NA   NA   NA   NA   NA   NA   NA   NA 
    

    As seen, your function now returns NA if there is not enough data for some of the splits.