I am trying to parse a CSV file using PHP.
The file uses commas as delimiter and double quotes for fields containing comma(s), as:
foo,"bar, baz",foo2
The issue I am facing is that I get fields containing comma(s) separated. I get:
"2
rue du ..."
Instead of: 2, rue du ...
.
Encoding:
The file doesn't seem to be in UTF8. It has weird wharacters at the beginning (apparently not BOM, looks like this when converted from ASCII to UTF8: ÿþ
) and doesn't displays accents.
mb_detect_encoding()
on the csv lines it returns ASCIIBut it fails to convert:
mb_convert_encoding()
converts from ASCII
but returns asian characters from UTF-16LE
iconv()
returns Notice: iconv(): Wrong charset, conversion from UTF-16LE
/ASCII
to UTF8
is not allowed.Parsing:
I tried to parse with this one-liner (see those 2 comments) using str_getcsv()
:
$csv = array_map('str_getcsv', file($file['tmp_name']));
I then tried with fgetcsv()
:
$f = fopen($file['tmp_name'], 'r');
while (($l = fgetcsv($f)) !== false) {
$arr[] = $l;
}
$f = fclose($f);
In both ways I get my adress field in 2 parts. But when I try this code sample I get correctly parsed fields:
$str = 'foo,"bar, baz",foo2,azerty,"ban, bal",doe';
$data = str_getcsv($str);
echo '<pre>' . print_r($data, true) . '</pre>';
To sum up with questions:
UTF-16 LE
and doesn't display weird characters at the beginning)I finally solved it myself:
I sent the file into online encoding detection websites which returned UTF16LE. After checking about what is UTF16LE it says it has BOM (Byte Order Mark).
My previous attempts were using file()
which returns an array of the lines of a file and with fopen()
which returns a resource, but we still parse line by line.
The working solution came in my mind about converting the whole file (every line at once) instead of converting each line separately. Here is a working solution:
$f = file_get_contents($file['tmp_name']); // Get the whole file as string
$f = mb_convert_encoding($f, 'UTF8', 'UTF-16LE'); // Convert the file to UTF8
$f = preg_split("/\R/", $f); // Split it by line breaks
$f = array_map('str_getcsv', $f); // Parse lines as CSV data
I don't get the adress fields separated at internal commas anymore.