phputf-8file-uploadextended-asciismart-quotes

Handling Extended ASCII in File Uploads


A website I recently completed with a friend has a gallery where one can upload images and text files. The only accepted text file (to ease development) is .txt and normally goes off without a hitch (or not..)

The problems I've encountered are the same of any developer: Microsoft's Extended ASCII.

Before outputting the text from the file, I go over several different layers to try to clean it up:

$txtfile = file_get_contents(".".$this->var['submission']['file_loc']);

// BOM Fun
    $boms = array
    (
        "utf8"    => array(3,pack("CCC",0xEF,0xBB,0xBF)),
        "utf16be"       => array(2,pack("CC",0xFE,0xFF)),
        "utf16le"       => array(2,pack("CC",0xFF,0xFE)),
        "utf32be"       => array(4,pack("CCCC",0x00,0x00,0xFE,0xFF)),
        "utf32le"       => array(4,pack("CCCC",0xFF,0xFE,0x00,0x00)),
        "gb18030"       => array(4,pack("CCCC",0x84,0x31,0x95,0x33))
    );
    foreach($boms as $bom)
    {
        if(mb_substr($txtfile,0,$bom[0]) == $bom[1])
        {
            $txtfile = substr($txtfile,$bom[0]);
            break;
        }
    }
$txtfile_o = $txtfile;
$badwords = array(chr(145),chr(146),chr(147),chr(148),chr(151),chr(133));
$fixwords = array("'","'",'"','"','-','...');
$txtfile_o = str_replace($badwords,$fixwords,$txtfile_o);
$txtfile_o = mb_convert_encoding($txtfile_o,"UTF-8");

The str_replace is the general method of converting Microsoft's awful smart quotes, em-dash, and ellipsis into their normal ASCII equivalents for output.

This code works perfectly find under the condition that the file uploaded is ANSI / us-ascii.

This code does not work (for no particular reason) when the uploaded file is UTF-8.

When the file is UTF-8, viewing the file itself in the web browser works fine, but printing it out via the web interface using this code does not. In this event, the smart quotes become some sort of accented a character.

This is where I'm stuck. The output encoding for the webpage is UTF-8, the web browser sees it as UTF-8, the file is in UTF-8 and yet neither the replace for the smart quotes works nor does the web browser display them correctly.

Any and all help on this would be greatly appreciated.


Solution

  • If I understand correctly your problem is that your code that replaces "extended ASCII" characters for their ASCII counterparts fails when the user submits a file in UTF-8.

    This was to be expected. You cannot operate on UTF-8 files with str_replace and the like, which operate at the byte level, while a character in UTF-8 is constituted by one byte only for characters in the ASCII range.

    What I'd recommend you to do is to use some heuristic to determine if the file is encoded in UTF-8 (the BOM is a good way if you're sure it'll be present) or Windows-1252 or whatever and then convert it to UTF-8 if it isn't. In that case, you wouldn't need to replace any characters, you could preserve the smart quotes.