How to correctly index fields/groups of fields in a mysql database?

I really need advice on how to properly optimize mysql database queries. I'm afraid to spoil everything at the very beginning, which will turn out to be a big headache for me later.

There is a catalog with goods that have their own attributes - category, gender, age, brand, color, material, season, manufacturer. All these fields will participate in the selection/search. But not always all at once. These can be compound queries of the type:

Category, gender, presence

Category, gender, brand, color, season, availability

Category, manufacturer, availability

etc.

Exactly 2 fields will always be used - category and availability (Products that are not available in the catalog do not need to be displayed, but are not deleted from the database).

The migration is the following:

Schema::create('products', function (Blueprint $table) {
$table->id();
$table->string('name', 100);
$table->string('title', 100);
$table->string('category_title', 50)->index()->nullable();
$table->integer('old_price')->nullable();
$table->integer('price')->nullable();
$table->integer('sale')->nullable();
$table->string('description', 500)->nullable();
$table->string('short_description', 150)->nullable();
$table->string('img_small', 115)->nullable();
$table->string('age', 15)->index()->nullable();
$table->string('gender', 10)->index()->nullable();
$table->string('brand', 50)->index()->nullable();
$table->string('color', 50)->index()->nullable();
$table->string('material', 50)->index()->nullable();
$table->string('seazon', 50)->index()->nullable();
$table->string('country', 50)->index()->nullable();
$table->integer('stock')->default('0')->index();
$table->float('rating')->nullable()->index();
$table->integer('bought')->default('0')->index();

$table->timestamps();
}

At the moment, all fields participating in the index are marked with indexes. It is clear that when requesting multiple where conditions - this is not so useful. I tried using a composite index:

$table->index(['category_title', 'stock', 'gender', 'brand', 'color']);

And it works for this kind of requests. But I strongly doubt if I do a Cartesian intersection on all possible query conditions and add indexes there as the best solution.

A question - how in such situation it is correct to implement index fields? I thank you in advance for your help in resolving this issue.

Solution

You're right that you can't create as many indexes as it would take to cover every permutation of attributes. That would take N-factorial indexes, but in MySQL you can't make more than 64 indexes per table.

If you need to support arbitrary permutations of these attributes, you can't fully index all the possibilities.

What you can do is index the columns that you said are present in the conditions of every query: category and availability. This will at least narrow down the set of rows examined, then the other conditions will be applied to the examined rows, even if it isn't optimized with an index.

You could also make an index over three columns: category and availability and one more, so the search would filter examined rows by one attribute in addition to the common ones. Then any other conditions would be applied to the examined rows.

So you could make these indexes:

category, availability, gender
category, availability, brand
category, availability, color
category, availability, season
category, availability, manufacturer

And so on. The optimizer will choose one of these indexes, whichever it thinks is the most effective at narrowing down the number of examined rows.

Re your questions in comments.

When using the index "category, availability, gender" - will there be a quick search for "category, availability"? Or does this index only work for ternary comparisons?

And index on (category, availability, gender) will help a query that searches a subset of those columns, starting from left to right.

Think of a telephone book. It's like an index of (last name, first name, phone number) columns in that order. If you search by last name alone, it still helps. If you search by two columns — last name and first name — it also helps.

So you do NOT have to search using all the columns of the index to use the index.

But if you search for first name alone, those names are separated, because they could belong to anyone with different last names in the whole book, so the index is of no help.

So you must search using consecutive columns from left to right in the index. You don't have to use all of the columns, but you cannot skip columns.

If I make one long index "category, availability, gender, brand, color, season, manufacturer" - will it help speed when querying

Because of the rule above, a single index with every column is probably not helpful. It will help only for a query that searches the leading consecutive columns of that index.

In your example, you have an index on (category, availability, gender, brand, color, season, manufacturer) and a search by category, availability, gender, manufacturer, brand. It will narrow down the search by the leftmost columns of the index: category, availability, gender, brand.

Index:  (category, availability, gender, brand, color, season, manufacturer)

Search: (category, availability, gender, brand, ... manufacturer)

The order of AND search terms in your query is commutative, so MySQL knows how to match them to the columns of the index. Your WHERE clause does not need to be in the same order as the index definition.

Your search does not have terms for color or season, so that's a gap in the set of leftmost columns. Therefore subsequent columns of the index like manufacturer are not used for optimizing that query.

A lot of people tell me what I'm doing wrong. I need to link tables to each other and store attributes in other tables. If I make tables for each attribute - will the query run faster?

It sounds like those people are describing database normalization.

Performance is not the purpose of database normalization. The purpose is to put attributes in the right tables, which helps to avoid data anomalies. If every "fact" is stored only once, there is no possibility that a fact can be inconsistent with the same information stored in another table.

This doesn't necessarily help performance, but it helps because your data is less likely to become illogical and corrupt. If you tell your customer that their bank account balance is corrupt, but the query is very fast, that won't be much consolation to them.

Database normalization is not about separating attributes randomly. There are logical guides for designing normalized databases, and you should study them. You can find free resources like wikipedia: https://en.wikipedia.org/wiki/Database_normalization

I also wrote an appendix describing normal forms in my book SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming.