phpmysqlutf-8unicode-normalizationcleditor

With PHP and MySQL, how do I properly write smart quotes to the database?


I have a PHP website with the CLEditor richtext control on it. When I try to write Euros and British Pounds to the database, the character goes through just fine because I have the charset set to UTF-8 in the containing page HTML, in the richtext control IFRAME HTML, and in the MySQL table collation. All is well on that front. However, when I try to write smart quotes, I end up seeing this output in the database:

This is a “testâ€.

(If that doesn't show up properly above in you browser, the test word has something like a Latin a, a Euro symbol, and the small AE symbol in front of the word, and a Latin a and a Euro symbol after it.)

When I use PHP to read that value back out of the database to display it on the page, it ends up as black diamonds with question marks on them as well as some other Latin characters.

What should I be doing to fix this?


Solution

  • First, make sure your MySQL table is using UTF-8 as its encoding. If it is, it will look like this:

    mysql> SHOW CREATE TABLE Users (
    ...
    ) ENGINE=InnoDB AUTO_INCREMENT=30 DEFAULT CHARSET=utf8 |
    

    Next, make sure your HTML page is set to display UTF-8:

    <html>
        <head>
            <meta http-equiv="content-type" content="text/html;charset=UTF-8" />
        </head>
        ....
    </html>
    

    Then it should work.


    EDIT: I purposefully did not talk about collation, because I thought it was already considered, but for the benefit of everyone, let me add some more to this answer.

    You state,

    I have the charset set to UTF-8 … in the MySQL table collation.

    Table collation is not the same thing as charset.

    Collation is the act of automagically trying to convert one charset to another FOR THE PURPOSES OF QUERYING. E.g., if you have a charset of latin1 and a collation of UTF-8, and you do something like SELECT * FROM foo WHERE bar LIKE '%—%'; (UTF-8 U+2014) on a table with a charset of latin1 that match either L+0151 or U+2014.

    Not so coincidentally... if you were output this latin1 encoded character onto a UTF-8 encoded web page, you will get the following:

    This is a “testâ€.

    That seems to be the output of your problem, exactly. Here's the HTML to duplicate it:

    <?php
    $string = "This is a “test”.";
    ?>
    <html>
        <head>
            <meta http-equiv="Content-Type" content="text/html;charset=utf8"/>
        </head>
        <body>
            <p><?php echo $string; ?></p>
        </body>
    </html>
    

    Make sure you save this file in latin1...

    To see what charset your table is set to, run this query:

    SELECT CCSA.character_set_name, TABLE_COLLATION FROM information_schema.`TABLES` T,
           information_schema.`COLLATION_CHARACTER_SET_APPLICABILITY` CCSA
    WHERE CCSA.collation_name = T.table_collation
      AND T.table_schema = "database"
      AND T.table_name = "table";
    

    The only proper results for your uses (unless you're using multiple non-English languages) is:

    +--------------------+-----------------+
    | character_set_name | TABLE_COLLATION |
    +--------------------+-----------------+
    | utf8               | utf8_general_ci |
    +--------------------+-----------------+
    

    Thanks for the upvotes ;-)