[SOLVED] Can we have performance issues when using mixed table charsets in MySQL or in Postgres?

Can we have performance issues when using mixed table charsets in MySQL or in Postgres?

I'm working on a database that has tables with different charsets. Since it's a big database, I was wondering if it could lead to a performance issue. Yes, the usual value comparison a DB is usually doing is the JOIN and is done comparing integers, but are there any other performance problems we could experience having tables with different charsets, other than the bigger space taken by some charsets?

Solution

If you do string comparisons with incompatible collations, those comparisons cannot use an index on the string column. I've seen this happen when doing a JOIN on a string column, and the tables joined had different collations (naturally if they also have different character sets, they are also different collations).

But you said your joins are on integer columns, not string columns. So joins shouldn't be a problem in your case.

You can also have performance problems when doing lookups against string columns if your table character set doesn't match your session character set.

Example: My table is defined with utf8mb4, but I set my session to utf8, so string literals will be utf8. Seems like a harmless change, right?

mysql> set names utf8;

mysql> explain select * from mytable where text = 'abc123';
+----+-------------+---------+------------+------+---------------+------+---------+-------+------+----------+-------+
| id | select_type | table   | partitions | type | possible_keys | key  | key_len | ref   | rows | filtered | Extra |
+----+-------------+---------+------------+------+---------------+------+---------+-------+------+----------+-------+
|  1 | SIMPLE      | mytable | NULL       | ref  | text          | text | 83      | const |    1 |   100.00 | NULL  |
+----+-------------+---------+------------+------+---------------+------+---------+-------+------+----------+-------+

I guess the utf8 string 'abc123' has a clear way to be promoted to utf8mb4 to match the column it compares to.

But if I force a specific collation that is not supported by utf8mb4, I see it has to do a table-scan and compare to rows one by one, instead of an indexed lookup:

mysql> explain select * from mytable where text = 'abc123' collate utf8_general_ci;
+----+-------------+---------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table   | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra       |
+----+-------------+---------+------------+------+---------------+------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | mytable | NULL       | ALL  | NULL          | NULL | NULL    | NULL |   10 |   100.00 | Using where |
+----+-------------+---------+------------+------+---------------+------+---------+------+------+----------+-------------+

There's a difference between implicit collations and explicit collations. Suppose I set my session to use something that doesn't have a clear path to utf8mb4:

mysql> set names latin1;

mysql> explain select * from mytable where text = 'abc123';
+----+-------------+---------+------------+------+---------------+------+---------+-------+------+----------+-------+
| id | select_type | table   | partitions | type | possible_keys | key  | key_len | ref   | rows | filtered | Extra |
+----+-------------+---------+------------+------+---------------+------+---------+-------+------+----------+-------+
|  1 | SIMPLE      | mytable | NULL       | ref  | text          | text | 83      | const |    1 |   100.00 | NULL  |
+----+-------------+---------+------------+------+---------------+------+---------+-------+------+----------+-------+

So far so good, but if I am explicit about the collation:

mysql> explain select * from mytable where text = 'abc123' collate latin1_general_ci;
ERROR 1267 (HY000): Illegal mix of collations (utf8mb4_general_ci,IMPLICIT) and (latin1_general_ci,EXPLICIT) for operation '='

The bottom line is that you should use the same character set and collation to make your life easier. Use it for all tables and for the session too.

In these modern times, it's hard to think of a reason to use anything other than utf8mb4.

P.S. Space shouldn't be a problem. UTF-8 character sets allow multibyte characters, but they don't expand the size of characters that fit in a single byte. UTF-8 is a variable-width character encoding. So characters in the ASCII range (0-127) are stored in one byte anyway. Read UTF-8 on wikipedia for details, it has a nice explanation.