I have the classic titanic data. Here is the description of the cleaned data.
> str(titanic)
'data.frame': 887 obs. of 7 variables:
$ Survived : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 2 2 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ Age : num 22 38 26 35 35 27 54 2 27 14 ...
$ Siblings.Spouses.Aboard: int 1 1 0 1 0 0 0 3 0 1 ...
$ Parents.Children.Aboard: int 0 0 0 0 0 0 0 1 2 0 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
I first split the data.
set.seed(123)
train_ind <- sample(seq_len(nrow(titanic)), size = smp_size)
train <- titanic[train_ind, ]
test <- titanic[-train_ind, ]
Then I changed Survived column to 0 and 1.
train$Survived <- as.factor(ifelse(train$Survived == 'Yes', 1, 0))
test$Survived <- as.factor(ifelse(test$Survived == 'Yes', 1, 0))
Finally, I ran gradient boosting algorithm.
dt_gb <- gbm(Survived ~ ., data = train)
Here are the results.
> print(dt_gb)
gbm(formula = Survived ~ ., data = train)
A gradient boosted model with bernoulli loss function.
100 iterations were performed.
There were 6 predictors of which 0 had non-zero influence.
Since there are 0 predictors that have non-zero influence, the predictions are NA. I am wondering why this is case? Anything wrong with my code?
Refrain from converting Survival
to 0/1 factor in training and test data. Instead, change the Survival
column to a 0/1 vector with numeric
type.
# e.g. like this
titanic$Survival <- as.numeric(titantic$Survival) - 1
# data should look like this
> str(titanic)
'data.frame': 887 obs. of 7 variables:
$ Survived : num 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ Age : num 22 38 26 35 35 27 54 2 27 14 ...
$ Siblings.Spouses.Aboard: int 1 1 0 1 0 0 0 3 0 1 ...
$ Parents.Children.Aboard: int 0 0 0 0 0 0 0 1 2 0 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
Then fit the model with Bernoulli loss.
dt_gb <- gbm::gbm(formula = Survived ~ ., data = titanic,
distribution = "bernoulli")
> print(dt_gb)
gbm::gbm(formula = Survived ~ ., distribution = "bernoulli",
data = titanic)
A gradient boosted model with bernoulli loss function.
100 iterations were performed.
There were 6 predictors of which 6 had non-zero influence.
Obtain predicted survival probabilities for the first few passengers:
>head(predict(dt_gb, type = "response"))
[1] 0.1200703 0.9024225 0.5875393 0.9271306 0.1200703 0.1200703