optimizationquery-optimizationdatabase-performance

Custom search query for many-to-many related tables optimization


I have two tables that are related by a many-to-many relation. Now, I want to retrieve results from table 1, where the rows in table 1 are related to specific criteria in a column in table 2. For example, if we have two tables: "pizza" and "toppings", I want to retrieve all the pizzas that have either toppings "a", "b", "c", or "d".

I'm wondering if it's faster to loop through the film titles once and compare the title (first approach), or loop through them multiple times and intersect the resulting IDs (second approach).

Also, I'm looking for a larger database that has this relation, which I can use to test the performance.

Please note that I will be using SQLite, and the performance matters mostly on this database.

To test the performance, I used the "sqlite-salika.db" database (from Kaggle) and ran the following code. However, the time difference doesn't provide much insight into which approach performs better. Both queries are run on SQLite, and the table is not expected to have more than 10,000 rows. But I want to know which approach will have better performance in general, especially in larger databases.


public static void CheckStringWithEachID(SQLiteConnection conn) {
    Stopwatch sw = new Stopwatch();
    int count = 0;

    sw.Start();
    // AND happens before OR
    using (var cmd = new SQLiteCommand(conn)) {
        cmd.CommandText = @"
            SELECT act.first_name,act.last_name, fi.title  FROM actor act
            JOIN film_actor fa ON act.actor_id = fa.actor_id
            JOIN film fi ON fa.film_id = fi.film_id
            WHERE LOWER(fi.title) LIKE LOWER('t%')
                AND LOWER(fi.title) LIKE LOWER('%r')
                OR LOWER(fi.title) LIKE LOWER('%bird%')
                OR LOWER(fi.title) LIKE LOWER('%house%')
            ";
        var reader = cmd.ExecuteReader();
        while (reader.Read()) {
            //Console.WriteLine(string.Format("{0,-12} - {1,-13} {2,-20}", reader.GetValue(0),reader.GetValue(1),reader.GetValue(2)));
            count++;
        }
    }
    sw.Stop();
    Console.WriteLine("Result query count = " + count);
    Console.WriteLine("Elapsed={0}", sw.Elapsed); //00:00:00.0539724
}
public static void CheckIfInIDs(SQLiteConnection conn) {
    Stopwatch sw = new Stopwatch();
    int count = 0;
    sw.Start();
    using (var cmd = new SQLiteCommand(conn)) {
        cmd.CommandText = @"
            SELECT act.* FROM actor act
            JOIN film_actor fa ON act.actor_id = fa.actor_id
            JOIN film fi ON fa.film_id = fi.film_id
            WHERE LOWER(fi.film_id)  IN (SELECT film_id FROM film WHERE LOWER(title) LIKE LOWER('t%'))
                AND LOWER(fi.film_id)  IN (SELECT film_id FROM film WHERE LOWER(title) LIKE LOWER('%r'))
                OR LOWER(fi.film_id)  IN (SELECT film_id FROM film WHERE LOWER(title) LIKE LOWER('%bird%'))
                OR LOWER(fi.film_id)  IN (SELECT film_id FROM film WHERE LOWER(title) LIKE LOWER('%house%'))
            ";
        var reader = cmd.ExecuteReader();
        while (reader.Read()) {
            //Console.WriteLine(string.Format("{0,-12} - {1,-13} {2,-20}", reader.GetValue(0), reader.GetValue(1), reader.GetValue(2)));
            count++;
        }
    }
    sw.Stop();
    Console.WriteLine("Result query count = " + count);
    Console.WriteLine("Elapsed={0}", sw.Elapsed);
}
static void Main(string[] args) {
    using (var conn = new SQLiteConnection("Data Source=" + "sqlite-sakila.db")){
        conn.Open();
        CheckStringWithEachID(conn); // connection first time overhead? slower for some reason
        CheckStringWithEachID(conn);
        CheckIfInIDs(conn);
        conn.Close();

    }

}

I expected a larger and consistent time variance to provide a clear indication of which approach is better overall.

Results of the code (while adding/changing/removing some of the OR and AND conditions):

Result query count = 31
Elapsed=00:00:00.0058455
Result query count = 31
Elapsed=00:00:00.0059143

Result query count = 87
Elapsed=00:00:00.0080970
Result query count = 87
Elapsed=00:00:00.0065998

Result query count = 77
Elapsed=00:00:00.0049120
Result query count = 77
Elapsed=00:00:00.004717

Solution

  • I have found that the first answer is does not actually work like I expected. Because it will compare each film_title (in my example) with all those conditions, which is not the intended. I want the actors that have participated in film_titles where at least the from a film_title he participated in satisfying one of the conditions that are ANDed (Hence the second way is the one of two ways I have found that will work. The other way not written here was to do multiple the same query multiple times with different namings and have the evaluated value changed, then concatenate them together).

    Unfortunately I can't link the other query as I've lost where I found it. But the string to form it would be a lot bigger.