I am running into a bit of a roadblock with programming a data generating function for regression predictions. The normally way one would do what I am trying to do (without automating it like I am seeking to), is to do the following:
#### Fit Data ####
fit <- lm(Petal.Length ~ Petal.Width + Sepal.Width,iris)
#### Create Test Data ####
newdata <- data.frame(
Petal.Width = mean(iris$Petal.Width),
Sepal.Width = seq(
min(iris$Sepal.Width),
max(iris$Sepal.Width),
length.out = 100
)
)
#### Generate Predictions ####
pred <- predict(fit,newdata=newdata)
pred
The idea is that you select one variable of interest and control the other values by setting them to their mean, then predict the data. This consequently gives you the following predicted values:
1 2 3 4 5 6 7 8
4.133390 4.124783 4.116176 4.107569 4.098962 4.090355 4.081749 4.073142
9 10 11 12 13 14 15 16
4.064535 4.055928 4.047321 4.038714 4.030107 4.021500 4.012893 4.004286
17 18 19 20 21 22 23 24
3.995680 3.987073 3.978466 3.969859 3.961252 3.952645 3.944038 3.935431
25 26 27 28 29 30 31 32
3.926824 3.918217 3.909611 3.901004 3.892397 3.883790 3.875183 3.866576
33 34 35 36 37 38 39 40
3.857969 3.849362 3.840755 3.832148 3.823542 3.814935 3.806328 3.797721
41 42 43 44 45 46 47 48
3.789114 3.780507 3.771900 3.763293 3.754686 3.746079 3.737473 3.728866
49 50 51 52 53 54 55 56
3.720259 3.711652 3.703045 3.694438 3.685831 3.677224 3.668617 3.660010
57 58 59 60 61 62 63 64
3.651404 3.642797 3.634190 3.625583 3.616976 3.608369 3.599762 3.591155
65 66 67 68 69 70 71 72
3.582548 3.573941 3.565335 3.556728 3.548121 3.539514 3.530907 3.522300
73 74 75 76 77 78 79 80
3.513693 3.505086 3.496479 3.487872 3.479266 3.470659 3.462052 3.453445
81 82 83 84 85 86 87 88
3.444838 3.436231 3.427624 3.419017 3.410410 3.401803 3.393197 3.384590
89 90 91 92 93 94 95 96
3.375983 3.367376 3.358769 3.350162 3.341555 3.332948 3.324341 3.315734
97 98 99 100
3.307128 3.298521 3.289914 3.281307
However, I will probably have to do this over and over again and coding all of this by hand every time isn't going to be very efficient, so I am looking to automate it with a custom function.
So far, this is what I have come up with to attempt automating the process, but it is obviously not helpful. The idea is for the function to take all but one of the variables as their mean, and afterwards select one variable as a sequenced number (from its min to its max) like what I have above. The generated data should also retain the names of the predictors plugged in (so they should say "test1" and so on when input into the function):
#### Create Test Data ####
test.data <- data.frame(
test1 = rnorm(100),
test2 = rnorm(100),
test3 = rnorm(100),
test4 = rnorm(100)
)
#### Make Function ####
gen.seq <- function(data,x1,x2,x3,x4){
data <- data
newdata <- data.frame(
x1 = mean(data$x1, na.rm = T),
x2 = mean(data$x2, na.rm = T),
x3 = mean(data$x3, na.rm = T),
x4 = seq(
min(data$x4, na.rm = T),
max(data$x4, na.rm = T),
length.out = 100
)
)
}
#### Generate Mean Controlled Data ####
gen.seq(test.data,
test1,
test2,
test3,
test4)
I would also like it to include the predict
function within this function if possible, but without accomplishing the data generation step first, it is futile to do at the moment. How do I accomplish this?
A more general/agnostic answer which simply creates the dataframes
reps=3 # sequence length
cols=c("test1","test2","test4") # columns to vary
test.data.mean=as.data.frame.list(colMeans(test.data))
sapply(
cols,
function(x){
y=names(test.data.mean)[names(test.data.mean)!=x]
z=setNames(data.frame(seq(min(test.data[x]),max(test.data[x]),length.out=reps)),x)
z[y]=test.data.mean[y]
z[colnames(test.data.mean)]
},
simplify=F,
USE.NAMES=T
)
resulting in
$test1
test1 test2 test3 test4
1 -1.9394516 -0.03640007 -0.04115825 -0.07265569
2 0.1961531 -0.03640007 -0.04115825 -0.07265569
3 2.3317578 -0.03640007 -0.04115825 -0.07265569
$test2
test1 test2 test3 test4
1 -0.05502075 -2.66943429 -0.04115825 -0.07265569
2 -0.05502075 -0.02634115 -0.04115825 -0.07265569
3 -0.05502075 2.61675199 -0.04115825 -0.07265569
$test4
test1 test2 test3 test4
1 -0.05502075 -0.03640007 -0.04115825 -2.60890222
2 -0.05502075 -0.03640007 -0.04115825 0.01795227
3 -0.05502075 -0.03640007 -0.04115825 2.64480676