I have three pyspark dataframes in Databricks: raw_old
, raw_new
, and master_df
. These are placeholders to work out the logic on a smaller scale (actual tables contain billions of rows of data). There is a column in all three called label
. I want to calculate the number of labels that appear in:
raw_old
and raw_new
(the answer is 3: A789, B456, D123)raw_new
and master_df
(the answer is 2: C456, D123)raw_old
and master_df
(the answer is 4: A654, B987, C987, D123)raw_old
, raw_new
, and master_df
(the answer is 1: D123)The three tables are below. How do I calculate the above bullet points?
raw_old
+---+-----+
| id|label|
+---+-----+
| 1| A987|
| 2| A654|
| 3| A789|
| 4| B321|
| 5| B456|
| 6| B987|
| 7| C321|
| 8| C654|
| 9| C987|
| 10| D123|
+---+-----+
raw_new
+---+-----+
| id|label|
+---+-----+
| 1| A123|
| 2| A456|
| 3| A789|
| 4| B123|
| 5| B456|
| 6| B789|
| 7| C123|
| 8| C456|
| 9| C789|
| 10| D123|
+---+-----+
master_df
+---+-----+
| id|label|
+---+-----+
| 1| A999|
| 2| A654|
| 3| A000|
| 4| B111|
| 5| B000|
| 6| B987|
| 7| C999|
| 8| C456|
| 9| C987|
| 10| D123|
+---+-----+
You should use an inner join to get the elements in common between the datasets
joined_data = raw_old.join(
raw_new,
on=raw_old["label"] == raw_new["label"],
how="inner"
)
and then you can collect the result back to Python, keeping all the heavy work in Spark
print(joined_data.count())
When joining 3 dataframes, you can do the first 2 and join the resulted dataframe with the third one.