I have two big files
For simplicity say
email file is having:
eId emailcontent
e1 xxxxxxxx
e2 yyyyyyyy
e3 zzzzzzzz
attachment file is having:
aid attachmentcontent eid
a1 att1 e1
a2 att2 e1
a3 att3 e2
a4 att4 e3
a5 att5 e3
a6 att6 e3
NOTE: Broadcast variable join has already performed with email file with some other small file. Both files are big enough that broadcast variable can't be used again.
I want to join these two files using JavaPairRDD
with eid
as join column but can't make pairRDD with eid
because with same eid
key multiple attachments are linked.
Tried to convert the JavaRDD<Email>
and JavaRDD<Attachment>
to Dataset and perform the join operation, but Email class is complex class(it contains multiple classes as list of variables) hence converting to Dataset does not return any records in it.
Above two approaches are not solving my problem. Hence looking for any solution which is not considered here or in above considered scenarios if I am missing something.
Above problem is solved using JavaPairRDD
.
For email file created JavaPairRDD<eId, Email>
as eId
is unique for each email and for attachment file created JavaPairRDD<eId, Iterator<Attachment>>
as eId
is having multiple attachments.
Then created JavaPairRDD for email: JavaPairRDD<eId, Email> rddEmail = emailRdd.mapToPair(record -> new Tuple2<>(eId, email));
and JavaPairRDD for attachment: JavaPairRDD<eId, Iterator<Attachment>> rddAttachment = attachmentRdd.mapToPair(record -> new Tuple2<>(eId, attachment)).groupByKey();
Finally performed the rddEmail.join(rddAttachment)
and other logics as per requirement.