scalaapache-sparkspark-graphx

Creat graph from specific vertices graphx spark


I would like to build a graph out of train dataset. Here's my code:

val vertices = df.rdd.flatMap(row => row.getAs[Seq[Row]](3)
        .map(element => (element.getLong(0),element.getBoolean(1),element.getBoolean(2))))

val verticesTrain = vertices.filter{case(id,test,validation) => (test==false)&&(validation==false)}.map(_._1)

val edges = df.rdd.flatMap(row => row.getAs[Seq[Row]](1)
        .map(element => (element.getLong(0),element.getLong(1))))

val graph = Graph.apply(verticesTrain.map(vertex => (vertex,1.0)),edges.map{case(s,d)=>Edge(s,d,1.0)})

However when I count graph's vertices it seems I have all of the vertices, not only the ones from verticesTrain

graph.vertices.count()
Out: Long
56944
verticesTrain.count()
Out: Long
44906

How can I build the graph, with only verticesTrain as vertices?


Solution

  • Using subgraph worked :

    This function should be used when you want to filter out of a graph either edges or vertices.

    Here is the code I used for this particular problem :

    val graph = Graph.apply(verticesTrain.map(vertex => (vertex,1.0)),edges.map{case(s,d)=>Edge(s,d,1.0)})
    
    val filtered = graph.subgraph(vpred = (vid,vd)=>vd!=null.asInstanceOf[Double])
    
    filtered.vertices.count()
    Out: Long
    44906