sqloracle-databaseperformancequery-optimizationinsert-select

Optimization needed for insert into query with select having self join


A table EMPLOYEE has below structure with 5 Million rows (10^6).

Name   
------  
EMPNAME
EMPID
MANAGERID (foreign key to same table)
STATUS  

We have a different table EmpAct where we perform insert as below

INSERT INTO empact VALUES
(empName, empid, status)
SELECT empName, empid, status
FROM  employee e
WHERE e1.status in (1, 2, 3) 
      OR
      EXISTS (SELECT 1 
              FROM employee  m 
              WHERE m.empid = e.managerid AND
                 m.status IN (1,2,3) 
            )

This becomes a costly operation because for each non active employee (not status 1,2,3) it tries to make an exists run into the same table of 5 Million records O(N^2) operation.

Is there a way to make it a planar O(N) operation?

Also, is the insert into query ok to use or should we use some other PL/SQL construct to make inserts?


Solution

  • This query can be converted to O(N) by re-writing the query to a format that supports hash joins. Although the algorithm analysis gets tricky very quickly, and I'm not sure if the new form will be faster.

    Sample Schema

    create table employee
    (
        empname varchar2(100),
        empid number primary key,
        managerid number references employee(empid),
        status number
    );
    create table empact as select empName, empid, status from employee where 1=0;
    
    insert into employee
    select level, level, level, 1 from dual connect by level <= 100000;
    begin
        dbms_stats.gather_table_stats(user, 'EMPLOYEE');
    end;
    /
    

    Original Query - O(N*LOG(N))

    explain plan for
    INSERT INTO empact
    SELECT empName, empid, status
    FROM  employee e
    WHERE e.status in (1, 2, 3) 
          OR
          EXISTS (SELECT 1 
                  FROM employee  m 
                  WHERE m.empid = e.managerid AND
                     m.status IN (1,2,3)
                );
    
    select * from table(dbms_xplan.display(format => 'basic'));
    
    
    Plan hash value: 961581243
    
    ------------------------------------------------------
    | Id  | Operation                     | Name         |
    ------------------------------------------------------
    |   0 | INSERT STATEMENT              |              |
    |   1 |  LOAD TABLE CONVENTIONAL      | EMPACT       |
    |   2 |   FILTER                      |              |
    |   3 |    TABLE ACCESS FULL          | EMPLOYEE     |
    |   4 |    TABLE ACCESS BY INDEX ROWID| EMPLOYEE     |
    |   5 |     INDEX UNIQUE SCAN         | SYS_C0027564 |
    ------------------------------------------------------
    

    The FILTER operation is a bit weird, but in this case I believe it's acting like a loop, so we can multiple together operation 3 and 4/5 to find the total run time. The TABLE ACCESS FULL is O(N) and the INDEX UNIQUE SCAN would be O(LOG(N)). So you should be seeing O(N*LOG(N)) instead of O(N^2).

    If you're seeing two full table scans, that would be O(N^2), but then you should try to investigate why Oracle isn't using the index.

    Striving for O(N)

    If you want to compare data in O(N) I believe a hash join is the only option. Hash joins only work with equality conditions, and in this case I think Oracle isn't smart enough to understand how to rewrite your query into regular equality conditions. We can do it ourselves by splitting the query into two pieces and UNIONing them together:

    explain plan for
    INSERT INTO empact
    SELECT empName, empid, status
    FROM  employee e
    WHERE e.status in (1, 2, 3) 
    UNION
    SELECT empName, empid, status
    FROM  employee e
    WHERE EXISTS (SELECT 1 
                  FROM employee  m 
                  WHERE m.empid = e.managerid AND
                     m.status IN (1,2,3)
                );
    
    select * from table(dbms_xplan.display(format => 'basic'));
    
    Plan hash value: 3147379352
    
    ---------------------------------------------
    | Id  | Operation                | Name     |
    ---------------------------------------------
    |   0 | INSERT STATEMENT         |          |
    |   1 |  LOAD TABLE CONVENTIONAL | EMPACT   |
    |   2 |   SORT UNIQUE            |          |
    |   3 |    UNION-ALL             |          |
    |   4 |     TABLE ACCESS FULL    | EMPLOYEE |
    |   5 |     HASH JOIN            |          |
    |   6 |      TABLE ACCESS FULL   | EMPLOYEE |
    |   7 |      TABLE ACCESS FULL   | EMPLOYEE |
    ---------------------------------------------
    

    The new plan is more complicated but it uses a hash join in operation 5, which in theory can be O(2N). But there's another O(N) FULL TABLE SCAN on line 4. And there's a O(N*LOG(N)) SORT UNIQUE on line 2, although hopefully the N is much smaller by that step.

    What's Best?

    You're thinking about this question the right way - more people should consider the algorithm complexity of their programs. And usually converting a join condition to something that supports hash joins is a good idea. But with this weird join condition I'm not sure if you'll see an improvement. This might be one of the many cases where the constants and implementation details outweigh the complexity.

    If you want to learn more, here's a chapter I wrote about using algorithm analysis to understand SQL performance.