javaregexemail-parsingrfc2822

How to properly parse email received headers?


What is the right way to parse the received headers of a .eml file, in order to extract all the hops' information? In particular I need to extract the following information:

I found the following specs, but it appears that there is no standard convention for the format of the received headers, and it may vary depending on the server:

  1. RFC 2821 - Received Lines in Gatewaying
  2. RFC 2822 - Trace Fields
  3. RFC 822 - 4.1 Syntax

For me the most clear explanation was the one from the RFC 822 spec:

 received    =  "Received"    ":"           ; one per relay
                  ["from" domain]           ; sending host
                  ["by"   domain]           ; receiving host
                  ["via"  atom]             ; physical path
                 *("with" atom)             ; link/mail protocol
                  ["id"   msg-id]           ; receiver msg id
                  ["for"  addr-spec]        ; initial form



                   ";"    date-time         ; time received

Considering the following received headers

Received: from VE1PR01MB5599.eurprd01.prod.exchangelabs.com
 (2603:10a6:7:7c::43) by HE1PR0102MB2714.eurprd01.prod.exchangelabs.com with
 HTTPS via HE1PR0402CA0054.EURPRD04.PROD.OUTLOOK.COM; Thu, 9 Jan 2020 16:34:13
 +0000

Received: from VI1PR0102CA0029.eurprd01.prod.exchangelabs.com
 (2603:10a6:802::42) by VE1PR01MB5599.eurprd01.prod.exchangelabs.com
 (2603:10a6:803:11f::30) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2602.12; Thu, 9 Jan
 2020 16:34:13 +0000

Received: from DB5EUR01FT034.eop-EUR01.prod.protection.outlook.com
 (2a01:111:f400:7e02::203) by VI1PR0102CA0029.outlook.office365.com
 (2603:10a6:802::42) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2623.9 via Frontend
 Transport; Thu, 9 Jan 2020 16:34:13 +0000

Received: from relay-out.ohc.cu (200.55.138.44) by
 DB5EUR01FT034.mail.protection.outlook.com (10.152.4.246) with Microsoft SMTP
 Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.20.2623.9 via Frontend Transport; Thu, 9 Jan 2020 16:34:12 +0000

Received: from relay-in.ohc.cu (relay-in.ohc.cu [127.0.0.1])
    by relay-out.ohc.cu (Postfix) with ESMTP id 69EA722DD
    for <some.email@some.domain>; Thu,  9 Jan 2020 11:29:43 -0500 (CST)

Received: from relay-out.ohc.cu ([127.0.0.1])
    by relay-in.ohc.cu (relay-in.ohc.cu [127.0.0.1]) (amavisd-new, port 10024)
    with ESMTP id 7CZku5Y59vGC for <some.email@some.domain>;
    Thu,  9 Jan 2020 11:29:38 -0500 (CST)

Received: from correo.patrimonio.ohc.cu (unknown [192.168.229.20])
    by relay-out.ohc.cu (Postfix) with ESMTP id B83BA22F5
    for <some.email@some.domain>; Thu,  9 Jan 2020 11:29:36 -0500 (CST)

Received: from localhost (localhost.localdomain [127.0.0.1])
    by correo.patrimonio.ohc.cu (Postfix) with ESMTP id 65413232A001
    for <some.email@some.domain>; Thu,  9 Jan 2020 11:40:05 -0500 (CST)

Received: from correo.patrimonio.ohc.cu ([127.0.0.1])
    by localhost (correo.patrimonio.ohc.cu [127.0.0.1]) (amavisd-new, port 10024)
    with ESMTP id hNMp-6lHHtzH for <some.email@some.domain>;
    Thu,  9 Jan 2020 11:40:05 -0500 (CST)

Received: from correoweb.patrimonio.ohc.cu (unknown [192.168.229.23])
    by correo.patrimonio.ohc.cu (Postfix) with ESMTPA id EC62A232A00A;
    Thu,  9 Jan 2020 11:39:53 -0500 (CST)

the most changing fields seems to be

  1. host domain

    e.g.

    • from VE1PR01MB5599.eurprd01.prod.exchangelabs.com (2603:10a6:7:7c::43)
    • by HE1PR0102MB2714.eurprd01.prod.exchangelabs.com
    • from VI1PR0102CA0029.eurprd01.prod.exchangelabs.com (2603:10a6:802::42)
    • by VE1PR01MB5599.eurprd01.prod.exchangelabs.com (2603:10a6:803:11f::30)
    • from DB5EUR01FT034.eop-EUR01.prod.protection.outlook.com (2a01:111:f400:7e02::203)
    • by VI1PR0102CA0029.outlook.office365.com (2603:10a6:802::42)
    • from relay-out.ohc.cu (200.55.138.44)
    • by DB5EUR01FT034.mail.protection.outlook.com (10.152.4.246)
    • from relay-in.ohc.cu (relay-in.ohc.cu [127.0.0.1])
    • by relay-out.ohc.cu (Postfix)
    • from relay-out.ohc.cu ([127.0.0.1])
    • by relay-in.ohc.cu (relay-in.ohc.cu [127.0.0.1]) (amavisd-new, port 10024)
    • from correo.patrimonio.ohc.cu (unknown [192.168.229.20])
    • by relay-out.ohc.cu (Postfix)
    • from localhost (localhost.localdomain [127.0.0.1])
    • by correo.patrimonio.ohc.cu (Postfix)
    • from correo.patrimonio.ohc.cu ([127.0.0.1])
    • by localhost (correo.patrimonio.ohc.cu [127.0.0.1]) (amavisd-new, port 10024)
    • from correoweb.patrimonio.ohc.cu (unknown [192.168.229.23])
    • by correo.patrimonio.ohc.cu (Postfix)
  2. mail protocol e.g.

    • with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384)
    • with ESMTP

What is the consolidated approach in extracting such info, considering their changing nature? Other answers on SO discouraged the use of regular expression for this task, but then how can one do this parsing? It would be ok for me if there existsted some tested regex or maybe a Java code/library to parse the received headers to extract the above info.


Solution

  • I want to offer the following solution. You can find a full explanation of the regular expression used here.

    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    import java.util.LinkedList;
    import java.util.HashMap;
    import java.lang.StringBuilder;
    
    class Rextester {
        public static void main(String[] args) {
            Pattern p = Pattern.compile("(?:(Received:)|\\G(?!\\A))" +
                                        "\\s*(from|by|with|id|via|for|;)" +
                                        "\\s*(\\S+?(?:\\s+\\S+?)*?)\\s*" +
                                        "(?=Received:|by|with|id|via|for|;|\\z)");
            String text = "Received: from VE1PR01MB5599.eurprd01.prod.exchangelabs.com\n" +
                          " (2603:10a6:7:7c::43) by HE1PR0102MB2714.eurprd01.prod.exchangelabs.com with\n" +
                          " HTTPS via HE1PR0402CA0054.EURPRD04.PROD.OUTLOOK.COM; Thu, 9 Jan 2020 16:34:13\n" +
                          " +0000\n" +
                          "\n" +
                          "Received: from VI1PR0102CA0029.eurprd01.prod.exchangelabs.com\n" +
                          " (2603:10a6:802::42) by VE1PR01MB5599.eurprd01.prod.exchangelabs.com\n" +
                          " (2603:10a6:803:11f::30) with Microsoft SMTP Server (version=TLS1_2,\n" +
                          " cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2602.12; Thu, 9 Jan\n" +
                          " 2020 16:34:13 +0000\n" +
                          "\n" +
                          "Received: from DB5EUR01FT034.eop-EUR01.prod.protection.outlook.com\n" +
                          " (2a01:111:f400:7e02::203) by VI1PR0102CA0029.outlook.office365.com\n" +
                          " (2603:10a6:802::42) with Microsoft SMTP Server (version=TLS1_2,\n" +
                          " cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2623.9 via Frontend\n" +
                          " Transport; Thu, 9 Jan 2020 16:34:13 +0000\n" +
                          "\n" +
                          "Received: from relay-out.ohc.cu (200.55.138.44) by\n" +
                          " DB5EUR01FT034.mail.protection.outlook.com (10.152.4.246) with Microsoft SMTP\n" +
                          " Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id\n" +
                          " 15.20.2623.9 via Frontend Transport; Thu, 9 Jan 2020 16:34:12 +0000\n" +
                          "\n" +
                          "Received: from relay-in.ohc.cu (relay-in.ohc.cu [127.0.0.1])\n" +
                          "    by relay-out.ohc.cu (Postfix) with ESMTP id 69EA722DD\n" +
                          "    for <some.email@some.domain>; Thu,  9 Jan 2020 11:29:43 -0500 (CST)\n" +
                          "\n" +
                          "Received: from relay-out.ohc.cu ([127.0.0.1])\n" +
                          "    by relay-in.ohc.cu (relay-in.ohc.cu [127.0.0.1]) (amavisd-new, port 10024)\n" +
                          "    with ESMTP id 7CZku5Y59vGC for <some.email@some.domain>;\n" +
                          "    Thu,  9 Jan 2020 11:29:38 -0500 (CST)\n" +
                          "\n" +
                          "Received: from correo.patrimonio.ohc.cu (unknown [192.168.229.20])\n" +
                          "    by relay-out.ohc.cu (Postfix) with ESMTP id B83BA22F5\n" +
                          "    for <some.email@some.domain>; Thu,  9 Jan 2020 11:29:36 -0500 (CST)\n" +
                          "\n" +
                          "Received: from localhost (localhost.localdomain [127.0.0.1])\n" +
                          "    by correo.patrimonio.ohc.cu (Postfix) with ESMTP id 65413232A001\n" +
                          "    for <some.email@some.domain>; Thu,  9 Jan 2020 11:40:05 -0500 (CST)\n" +
                          "\n" +
                          "Received: from correo.patrimonio.ohc.cu ([127.0.0.1])\n" +
                          "    by localhost (correo.patrimonio.ohc.cu [127.0.0.1]) (amavisd-new, port 10024)\n" +
                          "    with ESMTP id hNMp-6lHHtzH for <some.email@some.domain>;\n" +
                          "    Thu,  9 Jan 2020 11:40:05 -0500 (CST)\n" +
                          "\n" +
                          "Received: from correoweb.patrimonio.ohc.cu (unknown [192.168.229.23])\n" +
                          "    by correo.patrimonio.ohc.cu (Postfix) with ESMTPA id EC62A232A00A;\n" +
                          "    Thu,  9 Jan 2020 11:39:53 -0500 (CST)";
            LinkedList<HashMap<String, String>> data = new LinkedList<HashMap<String, String>>();
            HashMap<String, String> e;
            StringBuilder sb = new StringBuilder(4096);
            Matcher m = p.matcher(text);
            while (m.find()) {
                if (m.group(1) != null) {
                    data.add(new HashMap<String, String>());
                }
                e = data.getLast();
                e.put(m.group(2), m.group(3));
            }
            sb.append("[");
            data.stream().forEach((x) -> sb.append(x).append(",\n"));
            if (sb.length() > 2) {
                sb.setLength(sb.length() - 2);
            }
            sb.append("]");
            System.out.println(sb);
        }
    }
    

    Output:

    [{with=HTTPS, by=HE1PR0102MB2714.eurprd01.prod.exchangelabs.com, from=VE1PR01MB5599.eurprd01.prod.exchangelabs.com
     (2603:10a6:7:7c::43), ;=Thu, 9 Jan 2020 16:34:13
     +0000, via=HE1PR0402CA0054.EURPRD04.PROD.OUTLOOK.COM},
    {with=Microsoft SMTP Server (version=TLS1_2,
     cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384), by=VE1PR01MB5599.eurprd01.prod.exchangelabs.com
     (2603:10a6:803:11f::30), from=VI1PR0102CA0029.eurprd01.prod.exchangelabs.com
     (2603:10a6:802::42), id=15.20.2602.12, ;=Thu, 9 Jan
     2020 16:34:13 +0000},
    {with=Microsoft SMTP Server (version=TLS1_2,
     cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384), by=VI1PR0102CA0029.outlook.office365.com
     (2603:10a6:802::42), from=DB5EUR01FT034.eop-EUR01.prod.protection.outlook.com
     (2a01:111:f400:7e02::203), id=15.20.2623.9, ;=Thu, 9 Jan 2020 16:34:13 +0000, via=Frontend
     Transport},
    {with=Microsoft SMTP
     Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384), by=DB5EUR01FT034.mail.protection.outlook.com (10.152.4.246), from=relay-out.ohc.cu (200.55.138.44), id=15.20.2623.9, ;=Thu, 9 Jan 2020 16:34:12 +0000, via=Frontend Transport},
    {with=ESMTP, by=relay-out.ohc.cu (Postfix), for=<some.email@some.domain>, from=relay-in.ohc.cu (relay-in.ohc.cu [127.0.0.1]), id=69EA722DD, ;=Thu,  9 Jan 2020 11:29:43 -0500 (CST)},
    {with=ESMTP, by=relay-in.ohc.cu (relay-in.ohc.cu [127.0.0.1]) (amavisd-new, port 10024), for=<some.email@some.domain>, from=relay-out.ohc.cu ([127.0.0.1]), id=7CZku5Y59vGC, ;=Thu,  9 Jan 2020 11:29:38 -0500 (CST)},
    {with=ESMTP, by=relay-out.ohc.cu (Postfix), for=<some.email@some.domain>, from=correo.patrimonio.ohc.cu (unknown [192.168.229.20]), id=B83BA22F5, ;=Thu,  9 Jan 2020 11:29:36 -0500 (CST)},
    {with=ESMTP, by=correo.patrimonio.ohc.cu (Postfix), for=<some.email@some.domain>, from=localhost (localhost.localdomain [127.0.0.1]), id=65413232A001, ;=Thu,  9 Jan 2020 11:40:05 -0500 (CST)},
    {with=ESMTP, by=localhost (correo.patrimonio.ohc.cu [127.0.0.1]) (amavisd-new, port 10024), for=<some.email@some.domain>, from=correo.patrimonio.ohc.cu ([127.0.0.1]), id=hNMp-6lHHtzH, ;=Thu,  9 Jan 2020 11:40:05 -0500 (CST)},
    {with=ESMTPA, by=correo.patrimonio.ohc.cu (Postfix), from=correoweb.patrimonio.ohc.cu (unknown [192.168.229.23]), id=EC62A232A00A, ;=Thu,  9 Jan 2020 11:39:53 -0500 (CST)}]
    

    Demo.