I'm fairly sure this is a bug, but I just wanted to put it to the community first. In the example page for the Reshape
function of the splitstackshape package:
set.seed(1)
mydf <- data.frame(id_1 = 1:6, id_2 = c("A", "B"), varA.1 = sample(letters, 6),
varA.2 = sample(letters, 6), varA.3 = sample(letters, 6),
varB.2 = sample(10, 6), varB.3 = sample(10, 6),
varC.3 = rnorm(6))
mydf
id_1 id_2 varA.1 varA.2 varA.3 varB.2 varB.3 varC.3
1 1 A g y r 4 3 -0.04493361
2 2 B j q j 7 4 -0.01619026
3 3 A n p s 8 1 0.94383621
4 4 B u b l 2 10 0.82122120
5 5 A e e p 10 6 0.59390132
6 6 B s d u 1 2 0.91897737
Then,
## Note that these data are unbalanced
## reshape() will not work
## Not run:
reshape(mydf, direction = "long", idvar=1:2, varying=3:ncol(mydf))
## End(Not run)
## The Reshape() function can handle such scenarios
Reshape(mydf, id.vars = c("id_1", "id_2"),
var.stubs = c("varA", "varB", "varC"))
id_1 id_2 time varA varB varC
1: 1 A 1 g 4 -0.04493361
2: 2 B 1 j 7 -0.01619026
3: 3 A 1 n 8 0.94383621
4: 4 B 1 u 2 0.82122120
5: 5 A 1 e 10 0.59390132
6: 6 B 1 s 1 0.91897737
7: 1 A 2 y 3 NA
8: 2 B 2 q 4 NA
9: 3 A 2 p 1 NA
10: 4 B 2 b 10 NA
11: 5 A 2 e 6 NA
12: 6 B 2 d 2 NA
13: 1 A 3 r NA NA
14: 2 B 3 j NA NA
15: 3 A 3 s NA NA
16: 4 B 3 l NA NA
17: 5 A 3 p NA NA
18: 6 B 3 u NA NA
But based on the variable names (the numeric suffixes to be precise) in the wide format, shouldn't the output be:
id_1 id_2 time varA varB varC
1: 1 A 1 g NA NA
2: 2 B 1 j NA NA
3: 3 A 1 n NA NA
4: 4 B 1 u NA NA
5: 5 A 1 e NA NA
6: 6 B 1 s NA NA
7: 1 A 2 y 4 NA
8: 2 B 2 q 7 NA
9: 3 A 2 p 8 NA
10: 4 B 2 b 2 NA
11: 5 A 2 e 10 NA
12: 6 B 2 d 1 NA
13: 1 A 3 r 3 -0.04493361
14: 2 B 3 j 4 -0.01619026
15: 3 A 3 s 1 0.94383621
16: 4 B 3 l 10 0.82122120
17: 5 A 3 p 6 0.59390132
18: 6 B 3 u 2 0.91897737
Since VarA was measured at all three time points (1,2, and 3), VarB was measured at time points 2 and 3, while VarC was measured only at time point 3. So am I missing something obvious...
The tidyr version seems to get it right:
> library(tidyr)
> mydf %>% gather(key="variable", value="value", varA.1:varC.3) %>%
+ separate(variable, into=c("variable","time")) %>%
+ spread("variable", "value")
id_1 id_2 time varA varB varC
1 1 A 1 g <NA> <NA>
2 1 A 2 y 4 <NA>
3 1 A 3 r 3 -0.0449336090152309
4 2 B 1 j <NA> <NA>
5 2 B 2 q 7 <NA>
6 2 B 3 j 4 -0.0161902630989461 ...
This has been fixed in version 1.4.4, now available on CRAN. Thanks for reporting the bug.
After an update.packages()
, you should be able to get the following:
packageVersion("splitstackshape")
## [1] ‘1.4.4’
Reshape(mydf, id.vars = c("id_1", "id_2"), var.stubs = c("varA", "varB", "varC"))
## id_1 id_2 time varA varB varC
## 1: 1 A 1 g NA NA
## 2: 2 B 1 j NA NA
## 3: 3 A 1 n NA NA
## 4: 4 B 1 u NA NA
## 5: 5 A 1 e NA NA
## 6: 6 B 1 s NA NA
## 7: 1 A 2 y 4 NA
## 8: 2 B 2 q 7 NA
## 9: 3 A 2 p 8 NA
## 10: 4 B 2 b 2 NA
## 11: 5 A 2 e 10 NA
## 12: 6 B 2 d 1 NA
## 13: 1 A 3 r 3 -0.04493361
## 14: 2 B 3 j 4 -0.01619026
## 15: 3 A 3 s 1 0.94383621
## 16: 4 B 3 l 10 0.82122120
## 17: 5 A 3 p 6 0.59390132
## 18: 6 B 3 u 2 0.91897737