In my dataset
I have a binary Target
(0 or 1) variable, and 8 features: nchar
, rtc
, Tmean
, week_day
, hour
, ntags
, nlinks
and nex
. week_day
is a factor while the others are numeric. I built a decision tree classifier, but my question concerns the feature scaling:
library(caTools)
set.seed(123)
split = sample.split(dataset$Target, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling
training_set[-c(2,4)] = scale(training_set[-c(2,4)])
test_set[-c(2,4)] = scale(test_set[-c(2,4)])
The model returns that Tmean=-0.057
and ntags=2
are two splitting points. How can I recover the original value of these two features, that is, that assumed by the variables before the rescaling operation performed by scale()
.
If the data were scaled with scale
, the following function unscale
might be of help solving the question.
The original vector and the unscaled one are all.equal
but not identical
, due to floating-point precision.
unscale <- function(x){
xbar <- attr(x, "scaled:center")
se <- attr(x, "scaled:scale")
if(is.null(xbar) & is.null(se)){
x
} else {
y <- t(se * t(x) + xbar)
attr(y, "scaled:center") <- NULL
attr(y, "scaled:scale") <- NULL
y
}
}
set.seed(2020)
A <- matrix(rnorm(120, sd = 16), ncol = 5)
s <- scale(A)
identical(A, unscale(s)) #FALSE
zeros <- as.vector(A - unscale(s))
all.equal(zeros, rep(0, 120))
#[1] TRUE
The function also works with data.frames but the class of its output is "matrix"
, not the original "data.frame"
. This is the result of scale
's output.
B <- as.data.frame(matrix(A, ncol = 5))
s2 <- scale(B)
B2 <- as.data.frame(unscale(s2))
all.equal(B, B2)
#[1] TRUE
But the right way of scaling/unscaling an object with a dim
attribute, such as a data.frame, is vector by vector. This can be done with a lapply
loop, for instance.
s3 <- B
s3[] <- lapply(B, scale)
B3 <- s3
B3[] <- lapply(s3, unscale)
all(B - B3 < .Machine$double.eps^0.5)
#[1] TRUE