I recently migrated a website and encountered an issue related to character encoding. The database is configured with Latin-1, while Apache defaults to UTF-8.
This mismatch causes serialized functions related to the tagging system of posts in the website to break (unserialize(): Error at offset 22 of 27 bytes in), as they rely on direct database connections to retrieve tag associations.
Which Environment Variable Could Be Responsible? I ask this myself, because in the previous EC2 environment, something—possibly an environment variable—seemed to manage this communication, ensuring proper interaction between PHP and the database. However, after migration, this mechanism is missing, and I haven't been able to determine where it was configured.
I tried the following approaches without success:
Checking and changing php.ini, httpd.conf and charset.conf settings to latin1 (default was set to UTF-8).
Converting the database to UTF-8.
Adding a condition in a function in config.php to force the connection to use latin1 (this temporarily resolved the issue but doesn’t explain the root cause).
Here are the server details of the previous EC2 instance, the new instance is based on apache7.3 docker:
MySQL: Ver 15.1 Distrib 5.5.68-MariaDB, for Linux (x86_64) using readline 5.1
Apache: 2.4.6 (CentOS)
PHP: 7.3.33 with Zend OPcache v7.3.33
Just to add more details:
The database is stored on the same EC2 server, and I’m connecting to it via localhost. Regarding PHP execution, I believe it’s running via mod_php on the server. I interact with the database using mysql functions.
Does anyone know which environment variable or configuration file might have been responsible for ensuring charset compatibility in the previous setup?
Any insights would be greatly appreciated!
Does anyone know which [parameter] might have been responsible for ensuring charset compatibility in the previous setup?
Any insights would be greatly appreciated!
Short answer: Many parameters can affect this.
But you perhaps want to understand the root of it (and how to fix that.)
Virtually, you may have to redo the migration. And before redoing the migration (mentally) you have to migrate the serialized data first and fix the column type of the affected columns.
(OR: If your setup currently works with the workaround you may be able to do this without redoing the migration. After applying the fix (see below), it must work without the workaround and the workaround should be removed.)
As outlined in Storing a serialized object in MySql database (Q&A), serialized data must not be stored in text fields with a character encoding other than BINARY unless you manage to preserve it's binary value on the connection.
To remove the double negation: PHP serialized data should be stored in BINARY columns. If you do otherwise problems like the ones you describe might arise.
Is UTF-8 now wrong? No, you did good to use UTF-8 towards the browser, this greatly enhances portability and interoperability for written text, JSON Text and HTML/XML data.
You may want to consider after you managed to solve the problem for the serialize-columns to migrate from Latin-1 to UTF-8 for the other text-columns, too.
This will help you to fully migrate your database and learn about the database charsets/connection encoding and all these configuration details. But I'd recommend to practice with the BINARY columns first (PHP serialized data).
References: