I have created a function that I use in lapply() an already split data frame. The problem is that lapply() does not fully run through the function if some parts of the conditions of the split data frame are missing. Or is there a way to remove the sections that contain NAs through lapply()? Another problem is that I cannot remove the sections with NA's before I use the lapply() code because lapply() wants all the conditions in the split() function to be there or else it gives the same error as if there are sections of NAs.
I have tried to remove one split section with the following solution and it works:
DATA structure:
Participant visit trial tl tr br bl time CoPx CoPy
1 006 FU1 lc 81 7 6 1 1 -0.8 -13
2 006 FU1 lo 79 7 8 1 2 -0.7 -15
3 006 FU1 rc 7 9 9 0 3 -0.7 -51
4 006 FU1 ro 9 8 3 1 4 -0.4 -15
5 006 FU2 lc 9 4 8 1 5 -0.5 -17
6 006 FU2 lo 79 7 2 10 6 -0.5 -17
7 006 FU2 rc 7 4 9 1 7 -0.3 -86
8 006 FU2 ro 7 4 7 13 8 -0.8 -200
9 009 FU1 lc 81 7 6 1 1 -0.8 -13
10 009 FU1 lo 79 7 8 1 2 -0.7 -15
11 009 FU1 rc 7 9 9 0 3 -0.7 -51
12 009 FU1 ro 9 8 3 1 4 -0.4 -15
13 009 FU2 lc 9 4 8 1 5 -0.5 -17
14 009 FU2 lo 79 7 2 10 6 -0.5 -17
15 009 FU2 rc NA NA NA NA NA NA NA
16 009 FU2 ro 7 4 7 13 8 -0.8 -200
CODE:
`minifunc.area <- function(df){
# This mini function is designed to extract the center of the ellipse and the area of the 95% confidence interval
# The code in this function comes from https://stackoverflow.com/questions/38782051/how-to- calculate-the-area-of-ellipse-drawn-by-ggplot2
# remove NA's before running filter
#df <- drop_na(df)
# Plot object
p <- ggplot(df, aes(x = CoPx, y = CoPy))+
geom_point()+
stat_ellipse(level = 0.95) # 95% confidence interval
# Get ellipse coordinates from plot
pb <- ggplot_build(p)
pb <- as.data.frame(pb$data[[2]])
el <- pb[,1:2]
# Center of ellipse
ctr <- MASS::cov.trob(el)$center
# Calculate distance to center from each point on the ellipse
dist2center <- sqrt(rowSums((t(t(el) - ctr))^2))
# Calculate area of ellipse from semi-major and semi-minor axes.
# These are, respectively, the largest and smallest values of dist2center.
answer <- pi * min(dist2center) * max(dist2center)
} # end of mini function
splt.df <- split(df, list(df$trial, df$Participant))
`
example of output:
$lo.FU2.006 # A tibble: 6,041 × 12 # Groups: Participant, visit, trial [1] Participant visit trial tl tr br bl time CoPx CoPy <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> 1 006 FU2 lo 99.4 75.6 67.1 88.3 1 -0.678 -1.198 2 006 FU2 lo 67.5 98.3 44.8 85.6 2 -0.375 -0.375 $rc.FU2.009 # A tibble: 6,041 × 12 # Groups: Participant, visit, trial [1] Participant visit trial tl tr br bl time CoPx CoPy <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> 1 009 FU2 rc NA NA NA NA NA NA NA
# Figure out which df has missing data and define which row is missing
# removing duplicated trials per participant
temp.df <- df %>%
distinct(Participant, trial)
# creating a new column with assenting numbers to count row number
temp.df$row.nr <- c(1:nrow(temp.df))
# identify which row has the missing value
save.na <- temp.df %>%
dplyr::filter(Participant == p.nr & trial == trial.name)
na.row.nr <- unlist(save.na1[3]) # chr with the section number that needs to be removed from the split section (example: '15')
# apply the mini function to the splt data
splt <- lapply(splt.df[-na.row.nr], function(splt.df) minifunc.area(splt.df))
This above works, however, now I would like to do the same thing but with two missing sections.
I have tried the attempts below:
# Attempt 1)
na.row.nr <- unlist(save.na1[3])
# where na.row.nr[1] is the first section to be removed and na.row.nr[2] is the second
# apply the mini function to the splt data
splt <- lapply(splt.df[-c(na.row.nr[1], na.row.nr[2])], function(splt.df) minifunc.area(splt.df))
# Attempt 2)
na.row.nr <- unlist(save.na1[3])
# where na.row.nr[1] is the first section to be removed and na.row.nr[2] is the second
# apply the mini function to the splt data
splt <- lapply(splt.df[-na.row.nr[1], -na.row.nr[2]], function(splt.df) minifunc.area(splt.df))
# Attempt 3)
na.row.nr <- unlist(save.na1[3])
# where na.row.nr[1] is the first section to be removed and na.row.nr[2] is the second
missing.rows <- c(na.row.nr[1], na.row.nr[2])
# apply the mini function to the splt data
splt <- lapply(splt.df[-na.row.nr[1] & -na.row.nr[2]], function(splt.df) minifunc.area(splt.df))
the output:
Error in `[.data.frame`(pb, , 1:2) : undefined columns selected In addition: Warning message: Removed 1 rows containing non-finite values (`stat_ellipse()`).
from my understanding it means that lapply() cannot run through all the sections and stops as soon as it comes to a section that contains NA or does not exist
You can use if()
to catch errors before they occur, or as mentioned in the comments, with safely()
from purrr.
I think an ellipse needs at least 3 points minimum, so you can catch this condition with if()
:
minifunc.area <- function(df){
if(nrow(df)<3) {
cat("Not enough rows in data to calculate an ellipse\n")
return(NA_real_)
}
The stat_ellipse()
function may also fail if there are not enough unique points, so you can capture that again with if()
:
# Get ellipse coordinates from plot
pb <- ggplot_build(p)
pb <- as.data.frame(pb$data[[2]])
if(nrow(pb)<3 | ncol(pb)<2) return(NA_real_)
This should be enough to handle most situations. Let's try with the whole dataset.
minifunc.area(df)
[1] 1031.753
Warning message:
Removed 1 row containing non-finite outside the scale range (`stat_ellipse()`).
Now let's split by some of the variables. I'll use sapply
here just so the output fits better on the screen.
Split by Participant:
splt.df <- split(df, df$Participant)
sapply(splt.df, minifunc.area)
6 9
291.9459 175.9905
Split by trial:
splt.df <- split(df, df$trial)
sapply(splt.df, minifunc.area)
lc lo rc ro
12.24941 NA NA NA
Split by participant and trial:
splt.df <- split(df, list(df$trial, df$Participant))
lapply(splt.df, minifunc.area)
lc.6 lo.6 rc.6 ro.6 lc.9 lo.9 rc.9 ro.9
NA NA NA NA NA NA NA NA
As seen, your function now returns NA if there is not enough data for some of the splits.