I have the dataframe
test <- structure(list(
y2002 = c("freshman","freshman","freshman","sophomore","sophomore","senior"),
y2003 = c("freshman","junior","junior","sophomore","sophomore","senior"),
y2004 = c("junior","sophomore","sophomore","senior","senior",NA),
y2005 = c("senior","senior","senior",NA, NA, NA)),
.Names = c("2002","2003","2004","2005"),
row.names = c(c(1:6)),
class = "data.frame")
> test
2002 2003 2004 2005
1 freshman freshman junior senior
2 freshman junior sophomore senior
3 freshman junior sophomore senior
4 sophomore sophomore senior <NA>
5 sophomore sophomore senior <NA>
6 senior senior <NA> <NA>
and I need to create a vertex/edge list (for use with igraph) with every time the student category changes in consecutive years, while ignoring when there is no change, as in
testvertices <- structure(list(
vertex =
c("freshman","junior", "freshman","junior","sophomore","freshman",
"junior","sophomore","sophomore","sophomore"),
edge =
c("junior","senior","junior","sophomore","senior","junior",
"sophomore","senior","senior","senior"),
id =
c("1","1","2","2","2","3","3","3","4","5")),
.Names = c("vertex","edge", "id"),
row.names = c(1:10),
class = "data.frame")
> testvertices
vertex edge id
1 freshman junior 1
2 junior senior 1
3 freshman junior 2
4 junior sophomore 2
5 sophomore senior 2
6 freshman junior 3
7 junior sophomore 3
8 sophomore senior 3
9 sophomore senior 4
10 sophomore senior 5
At this point I'm ignoring the ids, my graph should weight edges by count (i.e., freshman -> junior =3). The idea is to make a tree graph. I know it is beside the main munging point, but that's in case you ask...
If I understand you correctly, you need something like this:
elist <- lapply(seq_len(nrow(test)), function(i) {
x <- as.character(test[i,])
x <- unique(na.omit(x))
x <- rep(x, each=2)
x <- x[-1]
x <- x[-length(x)]
r <- matrix(x, ncol=2, byrow=TRUE)
if (nrow(r) > 0) { r <- cbind(r, i) } else { r <- cbind(r, numeric()) }
r
})
do.call(rbind, elist)
# i
# [1,] "freshman" "junior" "1"
# [2,] "junior" "senior" "1"
# [3,] "freshman" "junior" "2"
# [4,] "junior" "sophomore" "2"
# [5,] "sophomore" "senior" "2"
# [6,] "freshman" "junior" "3"
# [7,] "junior" "sophomore" "3"
# [8,] "sophomore" "senior" "3"
# [9,] "sophomore" "senior" "4"
#[10,] "sophomore" "senior" "5"
It is not the most efficient solution, but I think it is fairly didactic. We create edges separately for each row of your input matrix, hence the lapply
. To create the edges from a row, we first remove NAs and duplicates, and then include each vertex twice. Finally, we remove the first and last vertex. This way we created an edge list matrix, we only need to drop the first and last vertex and format it in two columns (actually it would be more efficient to leave it as a vector, never mind).
When adding the extra column, we must be careful to check whether our edge list matrix has zero rows.
The do.call
function will just glue everything together. The result is a matrix, which you can convert to a data frame if you like, via as.data.frame()
, and then you can also convert the third column to numeric. You can also change the column names if you like.