pysparkapache-spark-sqldata-quality

Pyspark how can identify unmatched row value from two data frame


I have below two data frame from which i am trying to identify the unmatched row value from data frame two. This is the part of migration where i want to see the difference after source data being migrated/moved to different destination.

source_df
+---+-----+-----+
|key|val11|val12|
+---+-----+-----+
|abc|  1.1|  1.2|
|def|  3.0|  3.4|
+---+-----+-----+

dest_df
+---+-----+-----+
|key|val11|val12|
+---+-----+-----+
|abc|  2.1|  2.2|
|def|  3.0|  3.4|
+---+-----+-----+

i want to see the output something like below

key: abc,

col:          val11                  val12

difference:  [src-1.1,dst:2.1]       [src:1.2,dst:2.2]

Any solution for this?


Solution

  • source_df  = spark.createDataFrame(
      [
    ('abc','1.1','1.2'),
    ('def','3.0','3.4'),
      ], ['key','val11','val12']
    )
    
    dest_df  = spark.createDataFrame(
      [
    ('abc','2.1','2.2'),
    ('def','3.0','3.4'),
      ], ['key','val11','val12']
    )
    
    report = source_df\
        .join(dest_df, 'key', 'full')\
        .filter((source_df.val11 != dest_df.val11) | (source_df.val12 != dest_df.val12))\
        .withColumn('difference_val11', F.concat(F.lit('[src:'), source_df.val11, F.lit(',dst:'),dest_df.val11,F.lit(']')))\
        .withColumn('difference_val12', F.concat(F.lit('[src:'), source_df.val12, F.lit(',dst:'),dest_df.val12,F.lit(']')))\
        .select('key', 'difference_val11', 'difference_val12')
    
    report.show()
    
    +---+-----------------+-----------------+
    |key| difference_val11| difference_val12|
    +---+-----------------+-----------------+
    |abc|[src:1.1,dst:2.1]|[src:1.1,dst:2.1]|
    +---+-----------------+-----------------+
    

    Or, if you want exactally in that format:

    for x in report.select('key', 'difference_val11', 'difference_val12').collect():
        print("key: " + str(x[0]) + ",\n\n" +\
              "col:          val11                 val12\n\n" +\
             "difference:   " + str(x[1]) + "     " + str(x[2]))
    

    Output:

    key: abc,
    
    col:          val11                 val12
    
    difference:   [src:1.1,dst:2.1]     [src:1.2,dst:2.2]