phpmimeemail-parsing

PHP Mailparse chokes on non-ascii characters


I'm using Mailparse to parse and store email in a MySQL db. Emails are piped directly to a PHP script. More than 99% of emails to the system are parsed correctly. However, I noticed some emails were being truncated. The problem it seems is with unicode characters between the header and body of the message ...

Delivered-To: nkafq123@gmail.com
Received: by 10.152.1.193 with SMTP id 1csp311490lao;
        Mon, 20 Oct 2014 05:33:31 -0700 (PDT)
Return-Path: <lunalono@telia.com>
Received: from vps4596.inmotionhosting.com (vps4596.inmotionhosting.com. [74.124.217.238])
        by mx.google.com with ESMTPS id fb7si7786786pab.30.2014.10.20.05.33.30
        for <nkafq123@gmail.com>
        (version=TLSv1 cipher=RC4-SHA bits=128/128);
        Mon, 20 Oct 2014 05:33:30 -0700 (PDT)
Message-ID: <14FBD481E1074C79A706F0C071746F3D@acerDator>
From: =?utf-8?Q?Annelen_geretschl=C3=A4ger?= <lunalono@telia.com>
To: "neokio" <nkafq123@gmail.com>
References: <CAEMnOreG=99=qx-ONib=g+3mCQnUHC2kgdu2uBdSav5WP303BA@mail.gmail.com>
In-Reply-To: <CAEMnOreG=99=qx-ONib=g+3mCQnUHC2kgdu2uBdSav5WP303BA@mail.gmail.com>
Subject: This message will be broken
Date: Mon, 20 Oct 2014 14:33:24 +0200
MIME-Version: 1.0
Content-Type: multipart/alternative;
    boundary="----=_NextPart_000_0018_01CFEC72.CE424470"
X-Priority: 3
X-MSMail-Priority: Normal
Importance: Normal
X-Mailer: Microsoft Windows Live Mail 14.0.8117.416
X-MimeOLE: Produced By Microsoft MimeOLE V14.0.8117.416
X-Source: 
X-Source-Args: 
X-Source-Dir: 

Det här är ett flerdelat meddelande i MIME-format.

------=_NextPart_000_0018_01CFEC72.CE424470
Content-Type: text/plain;
    charset="utf-8"
Content-Transfer-Encoding: quoted-printable

This is a test ... the above "Det här är" chunk will be cut off at "Det h", and nothing else will arrive.

------=_NextPart_000_0018_01CFEC72.CE424470

The above will get cropped just after the headers, and all that arrives is "Det h". Somehow, non-ascii characters (ü) are causing mailparse to choke when they're outside of the headers or multipart wrappers. This may be the 5-year old Swedish version of Microsoft Windows Live Mail the client is using, messing up headers and such, but that's no excuse, I need to be able to receive it.

I'm running PHP 5.4.30, which has default_charset = "utf-8" in php.ini. But I noticed that phpinfo() had mailparse.def_charset = "us-ascii" by default, even though there was no config in php.ini for it. After adding the line and setting it to "utf8", phpinfo() showed utf-8 correctly. However the error persists. I'm out of ideas.

Any suggestions on how to deal with this error?


Solution

  • Just an idea that I mentioned in the comments... This part is related to the section of the message. If, by some reason, decoding fails, the content is returned 'as is'. You can try to decode it based on $headers['transfer-encoding']; or leave it untouched. $email is a full message source with headers. $section is data obtained by mailparse_msg_get_part (manual, examples, google)

    $headers = mailparse_msg_get_part_data($section);
    $content = '';
    
    set_error_handler(function() use(&$content, $headers, $email){
         $start   = $headers['starting-pos-body'];
         $end     = $headers['ending-pos-body'];
         $content = substr($email, $start, $end - $start);
    });
    
    ob_start();
    mailparse_msg_extract_part($section, $email);
    $body = ob_get_clean();
    
    restore_error_handler();
    
    if (!empty($content)) $body = $content;
    

    Result (after some manipulations as I leave only headers that I actually need)

    ["charset"]=>
    string(5) "utf-8"
    ["content-charset"]=>
    string(5) "utf-8"
    ["content-type"]=>
    string(10) "text/plain"
    ["content"]=>
    string(108) "This is a test ... the above "Det här är" chunk will be cut off at "Det h", and nothing else will arrive. "