apache-sparkapache-spark-sqlapache-spark-datasetapache-spark-2.0

Two big files join as one to many relationship in Java Spark


I have two big files

  1. email file
  2. attachment file

For simplicity say

email file is having:
eId  emailcontent 
e1     xxxxxxxx
e2     yyyyyyyy
e3     zzzzzzzz

attachment file is having:
aid   attachmentcontent   eid
a1       att1             e1  
a2       att2             e1  
a3       att3             e2
a4       att4             e3
a5       att5             e3
a6       att6             e3

NOTE: Broadcast variable join has already performed with email file with some other small file. Both files are big enough that broadcast variable can't be used again.

I want to join these two files using JavaPairRDD with eid as join column but can't make pairRDD with eid because with same eid key multiple attachments are linked.

Tried to convert the JavaRDD<Email> and JavaRDD<Attachment> to Dataset and perform the join operation, but Email class is complex class(it contains multiple classes as list of variables) hence converting to Dataset does not return any records in it.

Above two approaches are not solving my problem. Hence looking for any solution which is not considered here or in above considered scenarios if I am missing something.


Solution

  • Above problem is solved using JavaPairRDD.

    For email file created JavaPairRDD<eId, Email> as eId is unique for each email and for attachment file created JavaPairRDD<eId, Iterator<Attachment>> as eId is having multiple attachments.

    Then created JavaPairRDD for email: JavaPairRDD<eId, Email> rddEmail = emailRdd.mapToPair(record -> new Tuple2<>(eId, email)); and JavaPairRDD for attachment: JavaPairRDD<eId, Iterator<Attachment>> rddAttachment = attachmentRdd.mapToPair(record -> new Tuple2<>(eId, attachment)).groupByKey();

    Finally performed the rddEmail.join(rddAttachment) and other logics as per requirement.