javaapache-sparkjava-pair-rdd

Converting pairRDD to dataset in spark using java


How to create Spark dataset from pairRDD using java. Could you please help?


Solution

  • Basically, to go from a dataset to a pairRDD in Java, you first need to convert the dataset to a RDD using javaRDD() and then to a pairRDD using mapToPair.

    Here is an example:

    //creating a dataset (of rows)
    Dataset<Row> ds = spark
        .range(5)
        .select(col("id").alias("x"),
                col("id").multiply(col("id")).alias("y"));
    JavaPairRDD<Long, Long> pairRDD = ds
        .javaRDD() // to RDD in Java
        .mapToPair(row -> new Tuple2<>(row.getLong(0), row.getLong(1)));