apache-sparkgraphcluster-analysisspark-graphx

Grouping people by hobbies


I have been trying to solve this problem but can't really connect it with any solution. I have following data set:

[
  {"name": "sam", "hobbies": ["Books", "Music", "Gym"]},
  {"name": "Steve", "hobbies": ["Books", "Swimming"]},
  {"name": "Alex", "hobbies": ["Gym", "Music"]}
]

I am trying to generate output dataset that can combine people by hobbies. So output should look something like this:

[
  {"names": ["sam", "Steve"], "hobbies": ["Books"]},
  {"names": ["sam", "Alex"], "hobbies": ["Music", "Gym"]},
  {"names": ["Steve"], "hobbies": ["Swimming"]}
]

Its a large dataset so I was trying to use Spark.

Things I have tried:

Let me know if I am missing something obvious here. Thanks.


Solution

  • Check below code.

    scala> df.show(false)
    +-------------------+-----+
    |hobbies            |name |
    +-------------------+-----+
    |[Books, Music, Gym]|sam  |
    |[Books, Swimming]  |Steve|
    |[Gym, Music]       |Alex |
    +-------------------+-----+
    

    Use groupBy & collect_list

    1. Group By hobbies & Collect List of names
    2. Group By names & Collect List of hobbies
    scala> :paste
    // Entering paste mode (ctrl-D to finish)
    
    df
    .withColumn("hobbies",explode($"hobbies"))
    .groupBy($"hobbies").agg(collect_list($"name").as("names")) // For Hobbies List
    .groupBy($"name").agg(collect_list($"hobbies").as("hobbies")) // For Name List
    .select(collect_list(to_json(struct($"hobbies",$"names"))).as("data")) // Final Json Output
    .show(false)
    
    
    // Exiting paste mode, now interpreting.
    
    +--------------------------------------------------------------------------------------------------------------------------------------------+
    |data                                                                                                                                        |
    +--------------------------------------------------------------------------------------------------------------------------------------------+
    |[{"hobbies":["Swimming"],"names":["Steve"]}, {"hobbies":["Books"],"names":["sam","Steve"]}, {"hobbies":["Music","Gym"],"names":["sam","Alex"]}]|
    +--------------------------------------------------------------------------------------------------------------------------------------------+
    
    

    Formatted Output

    [
      { "hobbies": ["Swimming"],"names": ["Steve"]},
      {"hobbies": ["Books"],"names": ["sam","Steve"]},
      {"hobbies": ["Music","Gym"],"names": ["sam","Alex"]}
    ]