mysqlsqlencodingprimary-keydiacritics

Use accent sensitive primary key in MySQL


Desired result :

Have an accent sensitive primary key in MySQL.

I have a table of unique words, so I use the word itself as a primary key (by the way if someone can give me an advice about it, I have no idea if it's a good design/practice or not).

I need that field to be accent (and why not case) sensitive, because it must distinguish between, for instance, 'demandé' and 'demande', two different inflexions of the French verb "demander". I do not have any problem to store accented words in the database. I just can't insert two accented characters strings that are identical when unaccented.

Error :

When trying to create the 'demandé' row with the following query:

INSERT INTO `corpus`.`token` (`name_token`) VALUES ('demandé');

I got this error :

ERROR 1062: 1062: Duplicate entry 'demandé' for key 'PRIMARY'

Questions :

SOLUTION using 'collate utf8_general_ci' in table declaration

SOLUTION using 'collate utf8_bin' with WHERE statement

Step by step :

Database creation :

CREATE DATABASE corpus DEFAULT CHARACTER SET utf8;

Table of unique words :

CREATE TABLE token (name_token VARCHAR(50), freq INTEGER, CONSTRAINT pk_token PRIMARY KEY (name_token))

Queries

SELECT * FROM corpus.token WHERE name_token = 'demande';
SELECT * FROM corpus.token WHERE name_token = 'demandé';

both returns the same row:

demande

Solution

  • Collations. You have two choices, not three:

    utf8_bin treats all of these as different: demandé and demande and Demandé.

    utf8_..._ci (typically utf8_general_ci or utf8_unicode_ci) treats all of these as the same: demandé and demande and Demandé.

    If you want only case sensitivity (demandé = demande, but neither match Demandé), you are out of luck.

    If you want only accent sensitivity (demandé = Demandé, but neither match demande), you are out of luck.

    Declaration. The best way to do whatever you pick:

    CREATE TABLE (
        name VARCHAR(...)  CHARACTER SET utf8  COLLATE utf8_...  NOT NULL,
        ...
        PRIMARY KEY(name)
    )
    

    Don't change collation on the fly. This won't use the index (that is, will be slow) if the collation is different in name:

    WHERE name = ... COLLATE ...
    

    BINARY. The datatypes BINARY, VARBINARY and BLOB are very much like CHAR, VARCHAR, and TEXT with COLLATE ..._bin. Perhaps the only difference is that text will be checked for valid utf8 storing in a VARCHAR ... COLLATE ..._bin, but it will not be checked when storing into VARBINARY.... Comparisons (WHERE, ORDER BY, etc) will be the same; that is, simply compare the bits, don't do case folding or accent stripping, etc.