mysqlasp.netsortingfilterpagination

SQL sorting , paging, filtering best practices in ASP.NET


I am wondering how Google does it. I have a lot of slow queries when it comes to page count and total number of results. Google returns a count value of 250,000,00 in a fraction of a second.

I am dealing with grid views. I have built a custom pager for a gridview that requires an SQL query to return a page count based on the filters set by the user. The filters are at least 5 which includes a keyword, a category and subcategory, a date range filter, and a sort expression filter for sorting. The query contains about 10 massive table left joins.

This query is executed every time a search is performed and a query execution last an average of 30 seconds - be it count or a select. I believe what's making it slow is my query string of inclusive and exclusive date range filters. I have replaced (<=,>=) to BETWEEN and AND but still I experience the same problem.

See the query here: http://friendpaste.com/4G2uZexRfhd3sSVROqjZEc

I have problems with a long date range parameter.

Check my table that contains the dates: http://friendpaste.com/1HrC0L62hFR4DghE6ypIRp

UPDATE [9/17/2010] I minimized my date query and removed the time. I tried reducing the joins for my count query (I am actually having a problem with my filter count which takes to long to return a result of 60k rows).

      SELECT COUNT(DISTINCT esched.course_id)
        FROM courses c
           LEFT JOIN events_schedule esched
              ON c.course_id = esched.course_id
           LEFT JOIN course_categories cc
              ON cc.course_id = c.course_id
           LEFT JOIN categories cat
              ON cat.category_id = cc.category_id
     WHERE     1 = 1
           AND c.course_type = 1
           AND active = 1
           AND c.country_id = 52
           AND c.course_title LIKE '%cook%'
           AND cat.main_category_id = 40
           AND cat.category_id = 360
 AND (

    (2010-09-01' <= esched.date_start OR 2010-09-01' <= esched.date_end) 
    AND

    ('2010-09-25' >= esched.date_start OR '2010-09-25' >= esched.date_end)     
   )

I just noticed that my query is quite fast when I have a filter on my main or sub category fields. However when I only have a date filter and the range is a month or a week it needs to count a lot of rows and is done in 30seconds in average.

These are the static fields:

AND c.course_type = 1
AND active = 1
AND c.country_id = 52

UPDATE [9/17/2010] If a create a hash for these three fields and store it on one field will it do a change in speed?

These are my dynamic fields:

AND c.course_title LIKE '%cook%'
AND cat.main_category_id = 40
AND cat.category_id = 360
// ?DateStart and ?DateEnd

UPDATE [9/17/2010]. Now my problem is the leading % in LIKE query

Will post an updated explain


Solution

  • Search engines like Google use very complex behind-the-scenes algorythyms to index searches. Essentially, they have already determined which words occur on each page as well as the relative importance of those words and the relative importance of the pages (relative to other pages). These indexes are very quick because they are based on Bitwise Indexing.

    Consider the following google searches:

    custom : 542 million google hits
    pager : 10.8 m
    custom pager 1.26 m
    

    Essentially what they have done is created a record for the word custom and in that record they have placed a 1 for every page that contains it and a 0 for every page that doesn't contain it. Then they zip it up because there are a lot more 0s than 1s. They do the same for pager.

    When the search custom pager comes in, they unzip both records, perform a bitwise AND on them and this results in an array of bits where length is the total number of pages that they have indexed and the number of 1s represents the hit count for the search. The position of each bit corresponds to a particular result which is known in advance and they only have to look up the full details of the first 10 to display on the first page.

    This is oversimplified, but that is the general principle.

    Oh yes, they also have huge banks of servers performing the indexing and huge banks of servers responding to search requests. HUGE banks of servers!

    This makes them a lot quicker than anything that could be done in a relational database.

    Now, to your question: Could you paste some sample SQL for us to look at?

    One thing you could try is changing the order that the tables and joins appear in your SQl statement. I know that it seems that it shouldn't make a difference but it certainly can. If you put the most restrictive joins earlier in the statement then you could well end up with fewer overall joins performed within the database.

    A real world example. Say you wanted to find all of the entries in the phonebook under the name 'Johnson', with the number beginning with '7'. One way would be to look for all the numbers beginning with 7 and then join that with the numbers belonging to people called 'Johnson'. In fact it would be far quicker to perform the filtering the other way around even if you had indexing on both names and numbers. This is because the name 'Johnson' is more restrictive than the number 7.

    So order does count, and datbase software is not always good at determining in advance which joins to perform first. I'm not sure about MySQL as my experience is mostly with SQL Server which uses index statistics to calculate which order to perform joins. These stats get out of date after a number of inserts, updates and deletes, so they have to be re-computed periodically. If MySQL has something similar, you could try this.

    UPDATE I have looked at the query that you posted. Ten left joins is not unusual and should perform fine as long as you have the right indexes in place. Yours is not a complicated query.

    What you need to do is break this query down to its fundamentals. Comment out the lookup joins such as those to currency, course_stats, countries, states and cities along with the corresponding fields in the select statement. Does it still run as slowly? Probably not. But it is probably still not ideal.

    So comment out all of the rest until you just have the courses and the group by course id and order by courseid. Then, experiment with adding in the left joins to see which one has the greatest impact. Then, focusing on the ones with the greatest impact on performance, change the order of the queries. This is the trial - and - error approach,. It would be a lot better for you to take a look at the indexes on the columns that you are joining on.

    For example, the line cm.method_id = c.method_id would require a primary key on course_methodologies.method_id and a foreign key index on courses.method_id and so on. Also, all of the fields in the where, group by and order by clauses need indexes.

    Good luck

    UPDATE 2 You seriously need to look at the date filtering on this query. What are you trying to do?

       AND ((('2010-09-01 00:00:00' <= esched.date_start
              AND esched.date_start <= '2010-09-25 00:00:00')
             OR ('2010-09-01 00:00:00' <= esched.date_end
                 AND esched.date_end <= '2010-09-25 00:00:00'))
            OR ((esched.date_start <= '2010-09-01 00:00:00'
                 AND '2010-09-01 00:00:00' <= esched.date_end)
                OR (esched.date_start <= '2010-09-25 00:00:00'
                    AND '2010-09-25 00:00:00' <= esched.date_end)))
    

    Can be re-written as:

    AND (
    
        //date_start is between range - fine
        (esched.date_start BETWEEN '2010-09-01 00:00:00' AND '2010-09-25 00:00:00') 
    
        //date_end is between range - fine
        OR (esched.date_end BETWEEN '2010-09-01 00:00:00' AND '2010-09-25 00:00:00')       
    
        OR (esched.date_start <= '2010-09-01 00:00:00' AND esched.date_end >= '2010-09-01 00:00:00' ) 
    
        OR (esched.date_start <= '2010-09-25 00:00:00' AND esched.date_end > = '2010-09-25 00:00:00')
      )