phpcsvcharacter-encodingbyte-order-markutf-16le

PHP cannot parse CSV correctly (file is in UTF-16LE)


I am trying to parse a CSV file using PHP.
The file uses commas as delimiter and double quotes for fields containing comma(s), as:

foo,"bar, baz",foo2

The issue I am facing is that I get fields containing comma(s) separated. I get:

Instead of: 2, rue du ....


Encoding:
The file doesn't seem to be in UTF8. It has weird wharacters at the beginning (apparently not BOM, looks like this when converted from ASCII to UTF8: ÿþ) and doesn't displays accents.

But it fails to convert:


Parsing:
I tried to parse with this one-liner (see those 2 comments) using str_getcsv():

$csv = array_map('str_getcsv', file($file['tmp_name']));

I then tried with fgetcsv() :

$f = fopen($file['tmp_name'], 'r');
while (($l = fgetcsv($f)) !== false) {
    $arr[] = $l;
}
$f = fclose($f);

In both ways I get my adress field in 2 parts. But when I try this code sample I get correctly parsed fields:

$str = 'foo,"bar, baz",foo2,azerty,"ban, bal",doe';
$data = str_getcsv($str);
echo '<pre>' . print_r($data, true) . '</pre>';

To sum up with questions:


Solution

  • I finally solved it myself:

    I sent the file into online encoding detection websites which returned UTF16LE. After checking about what is UTF16LE it says it has BOM (Byte Order Mark).
    My previous attempts were using file() which returns an array of the lines of a file and with fopen() which returns a resource, but we still parse line by line.

    The working solution came in my mind about converting the whole file (every line at once) instead of converting each line separately. Here is a working solution:

    $f = file_get_contents($file['tmp_name']);          // Get the whole file as string
    $f = mb_convert_encoding($f, 'UTF8', 'UTF-16LE');   // Convert the file to UTF8
    $f = preg_split("/\R/", $f);                        // Split it by line breaks
    $f = array_map('str_getcsv', $f);                   // Parse lines as CSV data
    

    I don't get the adress fields separated at internal commas anymore.