mysqldatetimeindexingquery-optimizationin-clause

Optimise query (Indexing, EXPLAIN) Mysql


Based on an advise from another developer here on stackoverflow, I have updated my query as below, but I still need to optimise it further. Can someone guide my on how best I can apply indexing to the query.

See Query Below:

SELECT a.id, a.user_unique_id, a.loan_location, 
          a.ippis, a.tel_no,
          a.organisation, a.branch, a.loan_agree, 
          a.loan_type, a.appr, a.sold, 
          a.loan_status, a.top_up, a.current_loan, 
          a.date_created, a.date_updated, c.loan_id, c.user_unique_id AS tu_user_unique_id, 
          c.ippis AS tu_ippis, c.top_up_approved, 
           c.loan_type AS tu_loan_type, c.dse, c.status, c.current_loan AS tu_current_loan, 
          c.record_category, c.date_created AS tu_date_created, 
          c.date_updated AS tu_date_updated 
FROM loan_applications_tbl a
LEFT JOIN topup_or_reapplication_tbl c
    ON a.ippis=c.ippis   
WHERE ((c.status IN ('pending', 'corrected', 'Rejected', 'Processing', 'Captured', 'Reviewed', 'top up') 
       AND MONTH(CURRENT_DATE) IN (MONTH(c.date_created), MONTH(c.date_updated) 
       AND YEAR(CURRENT_DATE) IN (YEAR(c.date_created), YEAR(c.date_updated)) 
       AND   c.current_loan='1' )) 
OR ( a.loan_status IN ('pending', 'corrected', 'Rejected', 'Processing', 'Captured', 'Reviewed', 'top up')
     AND MONTH(CURRENT_DATE) IN (MONTH(a.date_created), MONTH(a.date_updated)) )
     AND YEAR(CURRENT_DATE) IN (YEAR(a.date_created), YEAR(a.date_updated)) 
     AND (a.current_loan='1' 
          OR (a.current_loan='0' 
              AND a.loan_status IN('Approved','Closed')))))

Execution time: 53s

No of records: 11000

using mysql EXPLAIN gives the screenshot below: (How do I maximise the information in the possible_keys column

EXPLAIN gives the screenshot below: (How do I maximise the information in the possible_keys column)

I HAVE UPDATED ADDITIONAL INFORMATION BELOW:

I am using the OR between c and a for the below reasons:

  1. a is the parent table with 66 columns which gets populated with loan entries, if a new entry on a has a matching/existing ippis (Unique field on a) some columns in a are updated/overwriten with data from the new entry, while the remaining data in the entry are inserted as new rows in c (ippis is not unique in table c). This is to keep a history of all subsequent loan requests while not giving room for redundancy

  2. While retrieving records I needed the big OR clause to enable me check both a and c tables for all instances of each loan records where the status, date and current_loan columns match the parameters in my WHERE clause.

  3. a will ALWAYS have a complete record in it but c will NOT ALWAYS have a record in it, except there are more loan requests for the same unique ID. a contains the "who is the account person such as by unique ID", and the additional / supplemental status detail FOR THE FIRST LOAN, subsequently, after the first loan "c" will be additional / supplemental status detail of the actual loan applications with the same Unique ID.

  4. if "A" is created Mar 12, and a new "c" record is created on Mar 16. The "A" record also gets the last updated stamped with Mar 16 since it has a child attachment that has some impact against it, while the new c record has it own created and updated time stamps. The Updated field will be blank/null for the a record until changes are made or there is a c record, The Updated field will be blank/null for c record until there's some changes made to c record

I hope this is understandable


Solution

  • I keep forgetting the term as it comes up very rarely to me, but anyhow, your indexes can not be optimized by using MONTH() and YEAR() as they are functions on the underlying data. By applying a date RANGE, they can. So you can keep your month/year such as if something was created in Jan 2021 and updated in Mar 2021, but in addition, adding an "and c.date_created >= current_date AND current_date <= c.date_updated", you CAN utilize the index if it has the created date in it (less important in this case for the date updated. Similarly for your other table.

    In addition, when you have your left-join from the "a" to the "c" table, then applying where, its almost like you are trying to force the join but remains left-join due to the OR.

    I would move the "c" based condition to the left-join, then just test for the record found there as NULL or not.

    Although not clear (was not clarified when I asked), I THINK that when a new "A" record is created, the system may actually put the creation date into both the date created and date updated. IF THIS IS THE CASE, then we only need to query/concern the last updated date field with the current month/year of activity. That is now the PRIMARY requirement for the where clause -- REGARDLESS of the underlying OR condition to the "C" table.

    Additionally, since the month() and year() are not sargeable (Thanks Ollie), I am doing a prequery to get the beginning of the current month and next month so I can build out a

    WHERE > beginning of this month and LESS than beginning of next month
    

    As for indexes, I would start update to

    loan_applications_tbl ( date_created, date_updated, loan_status, current_loan, ippis )
    topup_or_reapplication_tbl ( ippis, status, current_loan, date_created, date_updated )
    

    Final query to try with.

    SELECT 
            a.id, 
            a.user_unique_id, 
            a.loan_location, 
            a.ippis, 
            a.tel_no,
            a.organisation, 
            a.branch, 
            a.loan_agree, 
            a.loan_type, 
            a.appr, 
            a.sold, 
            a.loan_status, 
            a.top_up, 
            a.current_loan, 
            a.date_created, 
            a.date_updated, 
            c.loan_id, 
            c.user_unique_id tu_user_unique_id, 
            c.ippis tu_ippis, 
            c.top_up_approved,
            c.loan_type tu_loan_type, 
            c.dse, 
            c.status, 
            c.current_loan tu_current_loan,
            c.record_category, 
            c.date_created tu_date_created,
            c.date_updated tu_date_updated 
        FROM 
            -- this creates inline mySQL variables I can use for the WHERE condition
            -- by doing comma after with no explicit join, it is a single row
            -- and thus no Cartesian result, just @variables available now
            ( select 
                    -- first truncating any TIME portion by casting to DATE()
                    @myToday := date(curdate()),
                    @howFarBack := date_sub( @myToday, interval 6 month ),
                    -- now subtract day of month -1 to get first of THIS month
                    @beginOfMonth := date_sub( @myToday, interval dayOfMonth( @myToday ) -1 day ),
                    -- and now, add 1 month for beginning of next
                    @beginNextMonth := date_add( @beginOfMonth, interval 1 month ) ) SqlVars,
    
            loan_applications_tbl a
        
                LEFT JOIN topup_or_reapplication_tbl c
                    ON  a.ippis = c.ippis   
                    AND c.current_loan='1'
                    AND c.status IN ('pending', 'corrected', 'Rejected', 
                                    'Processing', 'Captured', 'Reviewed', 'top up') 
                    AND 
                    (
                            (@beginOfMonth <= c.date_created 
                        AND c.date_created < @beginNextMonth)
            
                    OR
                            (@beginOfMonth <= a.date_updated 
                        AND a.date_updated < @beginNextMonth )
                    )
    
        WHERE
                -- forces only activity for the single month in question
                -- since the "a" table knows of any "updates" to the "C",
                -- its updated basis will keep overall restriction to any accounts
    
                -- updated within this month in question only
                -- testing specifically for created OR updated within the
                -- current month in question
    
            a.date_created >= @howFarBack
            AND
                (
                        (@beginOfMonth <= a.date_created 
                    AND a.date_created < @beginNextMonth)
            
                OR
                        (@beginOfMonth <= a.date_updated 
                    AND a.date_updated < @beginNextMonth )
                )
            
            -- and NOW we can easily apply the OR without requiring
            -- to run against the ENTIRE set of BOTH tables.
            AND (
                        c.ippis IS NOT NULL
                    OR 
                        ( a.loan_status IN (  'pending', 'corrected', 'Rejected', 'Processing', 
                                'Captured', 'Reviewed', 'top up')
                        AND (   
                                a.current_loan = '1' 
                            OR  (   a.current_loan = '0' 
                                AND a.loan_status IN ('Approved', 'Closed')
                                )
                            )
                        )
                )
    

    CLOSING COMMENTS FOR QUERY

    I modified the query and also the primary index on the first table to INCLUDE (first position) the date created of the record. I also added an additional variable @howFarBack to be the maximum going back time to consider for a loan. I defaulted to 6 months back. Would you ever need to consider a given account older than 6 months for a loan? Or is the "a" account records something that could go back 10 years and want to include? My impression is that it is a new LOAN APPLICATION add date. If so, allowing to go 6 months back before it is approved, finalized, cancelled would still prevent going through as many months of data historically.

    In the WHERE clause, I added explicit add for the CREATED_DATE >= @howFarBack. It would never be possible for a child record to be created, let alone updated any time prior to the original add date. This will force only current month activity OR FORWARD to qualify.

    Ex: Create a loan on April 28th. So running the query, the beginning of the month is April 1st but LESS than May 1st (this allows inclusion of April 30 at 11:59:59pm)

    Now, we get into May and a change on the loan is done on May 4th. We are in a new month and the @howFarBack still allows older applications as far as December 2020 to POSSIBLY qualify vs the entire table of applications that could go back as far as 2005 for all we know. You are always staying with the most current data and you can change the @howFarBack easily enough as the maximum going back time. This should help your performance needs.