sqloracle-databaseduplicatesoracle18c

Select duplicates (including near-duplicate numbers)


I have a ROAD_PROJECTS table in Oracle 18c:

with road_projects (proj_id, road_id, year_, status, from_measure, to_measure) as (

select 100, 1, 2022, 'APPROVED',  null, 100.1 from dual union all 
select 101, 1, 2022, 'APPROVED',     0, 100.1 from dual union all

select 102, 1, 2022, 'APPROVED',     0, 200.6 from dual union all
select 103, 1, 2022, 'APPROVED',     0, 199.3 from dual union all
select 104, 1, 2022, 'APPROVED',     0,   201 from dual union all

select 105, 2, 2023, 'PROPOSED',     0,    50 from dual union all
select 106, 2, 2023, 'PROPOSED',    75,   100 from dual union all

select 107, 3, 2024, 'DEFERRED',     0,   100 from dual union all
select 108, 3, 2025, 'DEFERRED',     0,   110 from dual union all

select 109, 4, 2026, 'PROPOSED',     0,  null from dual union all
select 110, 4, 2026, 'DEFERRED',     0,  null from dual)

select * from road_projects
   PROJ_ID    ROAD_ID      YEAR_ STATUS   FROM_MEASURE TO_MEASURE
---------- ---------- ---------- -------- ------------ ----------
       100          1       2022 APPROVED         null      100.1 --duplicates (other than PROJ_ID); null is to be treated as zero
       101          1       2022 APPROVED            0      100.1

       102          1       2022 APPROVED            0      200.6 --duplicates: TO_MEASURES are approximately the same (within a 5 metre tolerance)
       103          1       2022 APPROVED            0      199.3
       104          1       2022 APPROVED            0        201

       105          2       2023 PROPOSED            0         50 --not duplicates: FROM_MEASURES are different and TO_MEASURES are different
       106          2       2023 PROPOSED           75        100

       107          3       2024 DEFERRED            0        100 --not duplicates: YEARS are different and TO_MEASURES are different
       108          3       2025 DEFERRED            0        110

       109          4       2026 PROPOSED            0       null  --not duplicates: STATUSES are different
       110          4       2026 DEFERRED            0       null

DB<>fiddle


I want to select rows where the ROAD_ID, YEAR_, STATUS, FROM_MEASURES, and TO_MEASURES are duplicated.

The result would look like this:

   PROJ_ID    ROAD_ID      YEAR_ STATUS   FROM_MEASURE TO_MEASURE
---------- ---------- ---------- -------- ------------ ----------
       100          1       2022 APPROVED         null      100.1 --duplicates
       101          1       2022 APPROVED            0      100.1

       102          1       2022 APPROVED            0      200.6 --duplicates
       103          1       2022 APPROVED            0      199.3
       104          1       2022 APPROVED            0        201

How can I select duplicates -- including near-duplicate numbers?


Solution

  • You could use an exists clause to look for a close duplicate:

    select * from road_projects rp
    where exists (
      select null
      from road_projects rp2
      where rp2.proj_id != rp.proj_id
      and rp2.road_id = rp.road_id
      and rp2.year_ = rp.year_
      and rp2.status = rp.status
      and abs(coalesce(rp2.from_measure, 0) - coalesce(rp.from_measure, 0)) < 5
      and abs(coalesce(rp2.to_measure, 0) - coalesce(rp.to_measure, 0)) < 5
    )
    order by proj_id
    
    PROJ_ID ROAD_ID YEAR_ STATUS FROM_MEASURE TO_MEASURE
    100 1 2022 APPROVED null 100.1
    101 1 2022 APPROVED 0 100.1
    102 1 2022 APPROVED 0 200.6
    103 1 2022 APPROVED 0 199.3
    104 1 2022 APPROVED 0 201

    fiddle

    The tolerance is applied by subtracting the two from (or to) measures, getting the absolute (unsigned) value for that difference, and seeing if that is less than 5.


    Partly for fun, partly because of @Jorge's comment... you could also do this with match_recognize pattern matching, which would allow you to keep track of which duplicates go together, and the min/max from/to across a group of duplicates, if that sort of thing might be useful:

    select proj_id, road_id, year_, status, from_measure, to_measure,
      from_tolerance, to_tolerance,
      dupe_group, first_proj_id, first_from_measure, last_to_measure
    from road_projects
    match_recognize (
      partition by road_id, year_, status
      order by from_measure nulls first, to_measure
      measures
        match_number() as dupe_group,
        first(proj_id) as first_proj_id,
        first(from_measure) as first_from_measure,
        final last(to_measure) as last_to_measure,
        abs(coalesce(dupe.from_measure, 0) - coalesce(prev(dupe.from_measure), 0)) as from_tolerance,
        abs(coalesce(dupe.to_measure, 0) - coalesce(prev(dupe.to_measure), 0)) as to_tolerance
      all rows per match
      after match skip past last row
      pattern (orig dupe+)
      define
        dupe as abs(coalesce(dupe.from_measure, 0) - coalesce(prev(dupe.from_measure), 0)) < 5
          and abs(coalesce(dupe.to_measure, 0) - coalesce(prev(dupe.to_measure), 0)) < 5
    )
    order by proj_id
    
    PROJ_ID ROAD_ID YEAR_ STATUS FROM_MEASURE TO_MEASURE FROM_TOLERANCE TO_TOLERANCE DUPE_GROUP FIRST_PROJ_ID FIRST_FROM_MEASURE LAST_TO_MEASURE
    100 1 2022 APPROVED null 100.1 0 0 1 100 null 100.1
    101 1 2022 APPROVED 0 100.1 0 0 1 100 null 100.1
    102 1 2022 APPROVED 0 200.6 0 1.3 2 103 0 201
    103 1 2022 APPROVED 0 199.3 0 0 2 103 0 201
    104 1 2022 APPROVED 0 201 0 .4 2 103 0 201

    fiddle