rdataframeigraphvertexdata

From dataframe to vertex/edge array


I have the dataframe

test <- structure(list(
     y2002 = c("freshman","freshman","freshman","sophomore","sophomore","senior"),
     y2003 = c("freshman","junior","junior","sophomore","sophomore","senior"),
     y2004 = c("junior","sophomore","sophomore","senior","senior",NA),
     y2005 = c("senior","senior","senior",NA, NA, NA)), 
              .Names = c("2002","2003","2004","2005"),
              row.names = c(c(1:6)),
              class = "data.frame")
> test
       2002      2003      2004   2005
1  freshman  freshman    junior senior
2  freshman    junior sophomore senior
3  freshman    junior sophomore senior
4 sophomore sophomore    senior   <NA>
5 sophomore sophomore    senior   <NA>
6    senior    senior      <NA>   <NA>

and I need to create a vertex/edge list (for use with igraph) with every time the student category changes in consecutive years, while ignoring when there is no change, as in

testvertices <- structure(list(
 vertex = 
  c("freshman","junior", "freshman","junior","sophomore","freshman",
    "junior","sophomore","sophomore","sophomore"),
 edge = 
  c("junior","senior","junior","sophomore","senior","junior",
    "sophomore","senior","senior","senior"),
 id =
  c("1","1","2","2","2","3","3","3","4","5")),
                       .Names = c("vertex","edge", "id"),
                       row.names = c(1:10),
                       class = "data.frame")
> testvertices
      vertex      edge id
1   freshman    junior  1
2     junior    senior  1
3   freshman    junior  2
4     junior sophomore  2
5  sophomore    senior  2
6   freshman    junior  3
7     junior sophomore  3
8  sophomore    senior  3
9  sophomore    senior  4
10 sophomore    senior  5

At this point I'm ignoring the ids, my graph should weight edges by count (i.e., freshman -> junior =3). The idea is to make a tree graph. I know it is beside the main munging point, but that's in case you ask...


Solution

  • If I understand you correctly, you need something like this:

    elist <- lapply(seq_len(nrow(test)), function(i) {
      x <- as.character(test[i,])
      x <- unique(na.omit(x))
      x <- rep(x, each=2)
      x <- x[-1]
      x <- x[-length(x)]
      r <- matrix(x, ncol=2, byrow=TRUE)
      if (nrow(r) > 0) { r <- cbind(r, i) } else { r <- cbind(r, numeric()) }
      r
    })
    
    do.call(rbind, elist)
    
    #                              i  
    # [1,] "freshman"  "junior"    "1"
    # [2,] "junior"    "senior"    "1"
    # [3,] "freshman"  "junior"    "2"
    # [4,] "junior"    "sophomore" "2"
    # [5,] "sophomore" "senior"    "2"
    # [6,] "freshman"  "junior"    "3"
    # [7,] "junior"    "sophomore" "3"
    # [8,] "sophomore" "senior"    "3"
    # [9,] "sophomore" "senior"    "4"
    #[10,] "sophomore" "senior"    "5"
    

    It is not the most efficient solution, but I think it is fairly didactic. We create edges separately for each row of your input matrix, hence the lapply. To create the edges from a row, we first remove NAs and duplicates, and then include each vertex twice. Finally, we remove the first and last vertex. This way we created an edge list matrix, we only need to drop the first and last vertex and format it in two columns (actually it would be more efficient to leave it as a vector, never mind).

    When adding the extra column, we must be careful to check whether our edge list matrix has zero rows.

    The do.call function will just glue everything together. The result is a matrix, which you can convert to a data frame if you like, via as.data.frame(), and then you can also convert the third column to numeric. You can also change the column names if you like.