I would like to build a graph out of train dataset. Here's my code:
val vertices = df.rdd.flatMap(row => row.getAs[Seq[Row]](3)
.map(element => (element.getLong(0),element.getBoolean(1),element.getBoolean(2))))
val verticesTrain = vertices.filter{case(id,test,validation) => (test==false)&&(validation==false)}.map(_._1)
val edges = df.rdd.flatMap(row => row.getAs[Seq[Row]](1)
.map(element => (element.getLong(0),element.getLong(1))))
val graph = Graph.apply(verticesTrain.map(vertex => (vertex,1.0)),edges.map{case(s,d)=>Edge(s,d,1.0)})
However when I count graph's vertices it seems I have all of the vertices, not only the ones from verticesTrain
graph.vertices.count()
Out: Long
56944
verticesTrain.count()
Out: Long
44906
How can I build the graph, with only verticesTrain as vertices?
Using subgraph worked :
This function should be used when you want to filter out of a graph either edges or vertices.
Here is the code I used for this particular problem :
val graph = Graph.apply(verticesTrain.map(vertex => (vertex,1.0)),edges.map{case(s,d)=>Edge(s,d,1.0)})
val filtered = graph.subgraph(vpred = (vid,vd)=>vd!=null.asInstanceOf[Double])
filtered.vertices.count()
Out: Long
44906