dataframejoinpysparkdatabricks

PySpark join dataframes with unique ids


I have 2 dataframes in the databricks that I would like to perform a join on. The dataframes are as so:

df1

name count_1
a    3
b    4
c    3

df2

name count_2
a    3
b    4
d    8

I would like to join them and retain all the information (even if name d is absent from df1). So the desired output would be as so:

name count_1 count_2
a    3       3
b    4       4
c    3       null
d    null    8

So I have myself the following query:

joined_df = df1.join(df2, df1['name'] == df2['name'], "outer")

It performs the join I want, but it creates a new column called name to accommodate the row with name d. So currently I get this:

name    count_1  name  count_2 
a       3        a     3
b       4        b     4
c       3        null  null
null    null     d     8

So the question is: how can I perform a join to achieve my desired result instead of my current result?


Solution

  • Turns out I am just being silly. If I just used on arg of join as intended, I would not have this problem. The solution is as so:

    joined_df = df1.join(df2, 'name', "outer")