I am having difficulties solving the error "there should be the same number of samples in x and y". I notice that others have posted on this site regarding this error, but their solutions have not worked for me. I am attaching an abbreviated version of my dataset here.
x_train
is here:
x_train <- structure(list(laterality = c("Left", "Right", "Right", "Right",
"Left", "Left", "Left", "Left", "Left", "Right"), age = c(66L,
56L, 69L, 49L, 60L, 70L, 58L, 53L, 59L, 64L), insurance = c("MEDICARE",
"UNITED", "MEDICARE", "UNITED", "COMMERCIAL", "MEDICARE", "AETNA",
"AETNA", "OXFORD", "MEDICARE_MANAGED"), employment = c("Retired",
"FullTime", "Retired", "FullTime", "Disabled", "SelfEmployed",
"Retired", "FullTime", "FullTime", "Disabled"), sex = c("Female",
"Male", "Female", "Female", "Female", "Female", "Male", "Male",
"Female", "Male"), race = c("WhiteorCaucasian", "WhiteorCaucasian",
"WhiteorCaucasian", "WhiteorCaucasian", "WhiteorCaucasian", "WhiteorCaucasian",
"Other", "BlackorAfricanAmerican", "WhiteorCaucasian", "WhiteorCaucasian"
), ethnicity = c("NotHispanicorLatino", "NotHispanicorLatino",
"NotHispanicorLatino", "NotHispanicorLatino", "NotHispanicorLatino",
"NotHispanicorLatino", "NotHispanicorLatino", "NotHispanicorLatino",
"NotHispanicorLatino", "NotHispanicorLatino"), bmi = c(22.3,
33, 34.3, 36, 30, 20, 29.5, 33.4, 26.5, 34.2), PreferredLanguage = c("English",
"English", "English", "English", "English", "English", "English",
"English", "English", "English"), married = c("Married", "Married",
"Married", "Married", "Married", "Married", "Divorced", "Single",
"Married", "Married"), RadiographSevere = c("No", "No", "No",
"No", "No", "No", "No", "No", "No", "No"), HxAnxietyDepression = c("No",
"No", "No", "Yes", "Yes", "Yes", "No", "No", "No", "No"), SurgeryYear = c(2017L,
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L
), operativetime = c(82L, 79L, 85L, 76L, 84L, 86L, 67L, 75L,
72L, 100L), HipApproach = c("Anterior", "Posterior", "Posterior",
"Posterior", "Posterior", "Anterior", "Posterior", "Posterior",
"Posterior", "Posterior")), row.names = c(NA, -10L), class = c("data.table",
"data.frame"))
y_train
is here:
y_train <- structure(list(POD1AverageNrsScoreCut = c("[0,5)", "[0,5)", "[0,5)",
"[0,5)", "[5,10)", "[0,5)", "[0,5)", "[5,10)", "[0,5)", "[0,5)"
)), row.names = c(NA, -10L), class = c("data.table", "data.frame"
))
Code I am using for rfe is here:
library(caret)
control <- rfeControl(functions = rfFuncs, # random forest
method = "repeatedcv", # repeated cv
repeats = 3, # number of repeats
number = 10) # number of folds
result_rfe <- rfe(x = x_train, y = y_train, sizes = c(1:30), rfeControl = control)
I see your output is two classes of limit intervals. Maybe if you try them as factors y = as.factor(unlist(y_train))
? It worked for me
control <- rfeControl(functions = rfFuncs, # random forest
method = "repeatedcv", # repeated cv
repeats = 3, # number of repeats
number = 10) # number of folds
result_rfe <- rfe(x = x_train, y = as.factor(unlist(y_train)), sizes = c(1:30), rfeControl = control)
Output:
>result_rfe
Recursive feature selection
Outer resampling method: Cross-Validated (10 fold, repeated 3 times)
Resampling performance over subset size:
Variables Accuracy Kappa AccuracySD KappaSD Selected
1 0.06667 0 0.2537 0
2 0.06667 0 0.2537 0
3 0.30000 0 0.4661 0
4 0.20000 0 0.4068 0
5 0.36667 0 0.4901 0
6 0.40000 0 0.4983 0
7 0.43333 0 0.5040 0
8 0.53333 0 0.5074 0 *
9 0.30000 0 0.4661 0
10 0.33333 0 0.4795 0
11 0.20000 0 0.4068 0
12 0.26667 0 0.4498 0
13 0.06667 0 0.2537 0
14 0.13333 0 0.3457 0
15 0.20000 0 0.4068 0
The top 5 variables (out of 8):
insurance, laterality, HipApproach, employment, ethnicity
Note: I don't know if this is what you expected, I don't know the data context and your approach.
Original answer: Subscript out of bounds error in caret's rfe function