mysqlsqlpython-dedupe

How is having < comparison different than != in following case?


I'm trying to understand this example at https://dedupeio.github.io/dedupe-examples/docs/mysql_example.html. How is having < comparison different than != in following case?

 read_cur.execute("""
           select a.donor_id,
                  json_object('city', a.city,
                              'name', a.name,
                              'zip', a.zip,
                              'state', a.state,
                              'address', a.address),
                  b.donor_id,
                  json_object('city', b.city,
                              'name', b.name,
                              'zip', b.zip,
                              'state', b.state,
                              'address', b.address)
           from (select DISTINCT l.donor_id as east, r.donor_id as west
                 from blocking_map as l
                 INNER JOIN blocking_map as r
                 using (block_key)
                 **where l.donor_id < r.donor_id) ids**
           INNER JOIN processed_donors a on ids.east=a.donor_id
           INNER JOIN processed_donors b on ids.west=b.donor_id
           """)

VS

    read_cur.execute("""
           select a.donor_id,
                  json_object('city', a.city,
                              'name', a.name,
                              'zip', a.zip,
                              'state', a.state,
                              'address', a.address),
                  b.donor_id,
                  json_object('city', b.city,
                              'name', b.name,
                              'zip', b.zip,
                              'state', b.state,
                              'address', b.address)
           from (select DISTINCT l.donor_id as east, r.donor_id as west
                 from blocking_map as l
                 INNER JOIN blocking_map as r
                 using (block_key)
                 **where l.donor_id != r.donor_id) ids**
           INNER JOIN processed_donors a on ids.east=a.donor_id
           INNER JOIN processed_donors b on ids.west=b.donor_id
           """)

Solution

  • When you use < you are assured that every pair will be included only once.

    l.donor_id  r.donor_id
        1           2
        3           4
    

    Whereas if you use != every pair will be included twice.

    l.donor_id  r.donor_id
        1           2
        2           1
        3           4
        4           3