What is the right way to parse the received headers of a .eml file, in order to extract all the hops' information? In particular I need to extract the following information:
I found the following specs, but it appears that there is no standard convention for the format of the received headers, and it may vary depending on the server:
For me the most clear explanation was the one from the RFC 822 spec:
received = "Received" ":" ; one per relay
["from" domain] ; sending host
["by" domain] ; receiving host
["via" atom] ; physical path
*("with" atom) ; link/mail protocol
["id" msg-id] ; receiver msg id
["for" addr-spec] ; initial form
";" date-time ; time received
Considering the following received
headers
Received: from VE1PR01MB5599.eurprd01.prod.exchangelabs.com
(2603:10a6:7:7c::43) by HE1PR0102MB2714.eurprd01.prod.exchangelabs.com with
HTTPS via HE1PR0402CA0054.EURPRD04.PROD.OUTLOOK.COM; Thu, 9 Jan 2020 16:34:13
+0000
Received: from VI1PR0102CA0029.eurprd01.prod.exchangelabs.com
(2603:10a6:802::42) by VE1PR01MB5599.eurprd01.prod.exchangelabs.com
(2603:10a6:803:11f::30) with Microsoft SMTP Server (version=TLS1_2,
cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2602.12; Thu, 9 Jan
2020 16:34:13 +0000
Received: from DB5EUR01FT034.eop-EUR01.prod.protection.outlook.com
(2a01:111:f400:7e02::203) by VI1PR0102CA0029.outlook.office365.com
(2603:10a6:802::42) with Microsoft SMTP Server (version=TLS1_2,
cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2623.9 via Frontend
Transport; Thu, 9 Jan 2020 16:34:13 +0000
Received: from relay-out.ohc.cu (200.55.138.44) by
DB5EUR01FT034.mail.protection.outlook.com (10.152.4.246) with Microsoft SMTP
Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
15.20.2623.9 via Frontend Transport; Thu, 9 Jan 2020 16:34:12 +0000
Received: from relay-in.ohc.cu (relay-in.ohc.cu [127.0.0.1])
by relay-out.ohc.cu (Postfix) with ESMTP id 69EA722DD
for <some.email@some.domain>; Thu, 9 Jan 2020 11:29:43 -0500 (CST)
Received: from relay-out.ohc.cu ([127.0.0.1])
by relay-in.ohc.cu (relay-in.ohc.cu [127.0.0.1]) (amavisd-new, port 10024)
with ESMTP id 7CZku5Y59vGC for <some.email@some.domain>;
Thu, 9 Jan 2020 11:29:38 -0500 (CST)
Received: from correo.patrimonio.ohc.cu (unknown [192.168.229.20])
by relay-out.ohc.cu (Postfix) with ESMTP id B83BA22F5
for <some.email@some.domain>; Thu, 9 Jan 2020 11:29:36 -0500 (CST)
Received: from localhost (localhost.localdomain [127.0.0.1])
by correo.patrimonio.ohc.cu (Postfix) with ESMTP id 65413232A001
for <some.email@some.domain>; Thu, 9 Jan 2020 11:40:05 -0500 (CST)
Received: from correo.patrimonio.ohc.cu ([127.0.0.1])
by localhost (correo.patrimonio.ohc.cu [127.0.0.1]) (amavisd-new, port 10024)
with ESMTP id hNMp-6lHHtzH for <some.email@some.domain>;
Thu, 9 Jan 2020 11:40:05 -0500 (CST)
Received: from correoweb.patrimonio.ohc.cu (unknown [192.168.229.23])
by correo.patrimonio.ohc.cu (Postfix) with ESMTPA id EC62A232A00A;
Thu, 9 Jan 2020 11:39:53 -0500 (CST)
the most changing fields seems to be
host domain
e.g.
mail protocol e.g.
What is the consolidated approach in extracting such info, considering their changing nature? Other answers on SO discouraged the use of regular expression for this task, but then how can one do this parsing? It would be ok for me if there existsted some tested regex or maybe a Java code/library to parse the received headers to extract the above info.
I want to offer the following solution. You can find a full explanation of the regular expression used here.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.LinkedList;
import java.util.HashMap;
import java.lang.StringBuilder;
class Rextester {
public static void main(String[] args) {
Pattern p = Pattern.compile("(?:(Received:)|\\G(?!\\A))" +
"\\s*(from|by|with|id|via|for|;)" +
"\\s*(\\S+?(?:\\s+\\S+?)*?)\\s*" +
"(?=Received:|by|with|id|via|for|;|\\z)");
String text = "Received: from VE1PR01MB5599.eurprd01.prod.exchangelabs.com\n" +
" (2603:10a6:7:7c::43) by HE1PR0102MB2714.eurprd01.prod.exchangelabs.com with\n" +
" HTTPS via HE1PR0402CA0054.EURPRD04.PROD.OUTLOOK.COM; Thu, 9 Jan 2020 16:34:13\n" +
" +0000\n" +
"\n" +
"Received: from VI1PR0102CA0029.eurprd01.prod.exchangelabs.com\n" +
" (2603:10a6:802::42) by VE1PR01MB5599.eurprd01.prod.exchangelabs.com\n" +
" (2603:10a6:803:11f::30) with Microsoft SMTP Server (version=TLS1_2,\n" +
" cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2602.12; Thu, 9 Jan\n" +
" 2020 16:34:13 +0000\n" +
"\n" +
"Received: from DB5EUR01FT034.eop-EUR01.prod.protection.outlook.com\n" +
" (2a01:111:f400:7e02::203) by VI1PR0102CA0029.outlook.office365.com\n" +
" (2603:10a6:802::42) with Microsoft SMTP Server (version=TLS1_2,\n" +
" cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2623.9 via Frontend\n" +
" Transport; Thu, 9 Jan 2020 16:34:13 +0000\n" +
"\n" +
"Received: from relay-out.ohc.cu (200.55.138.44) by\n" +
" DB5EUR01FT034.mail.protection.outlook.com (10.152.4.246) with Microsoft SMTP\n" +
" Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id\n" +
" 15.20.2623.9 via Frontend Transport; Thu, 9 Jan 2020 16:34:12 +0000\n" +
"\n" +
"Received: from relay-in.ohc.cu (relay-in.ohc.cu [127.0.0.1])\n" +
" by relay-out.ohc.cu (Postfix) with ESMTP id 69EA722DD\n" +
" for <some.email@some.domain>; Thu, 9 Jan 2020 11:29:43 -0500 (CST)\n" +
"\n" +
"Received: from relay-out.ohc.cu ([127.0.0.1])\n" +
" by relay-in.ohc.cu (relay-in.ohc.cu [127.0.0.1]) (amavisd-new, port 10024)\n" +
" with ESMTP id 7CZku5Y59vGC for <some.email@some.domain>;\n" +
" Thu, 9 Jan 2020 11:29:38 -0500 (CST)\n" +
"\n" +
"Received: from correo.patrimonio.ohc.cu (unknown [192.168.229.20])\n" +
" by relay-out.ohc.cu (Postfix) with ESMTP id B83BA22F5\n" +
" for <some.email@some.domain>; Thu, 9 Jan 2020 11:29:36 -0500 (CST)\n" +
"\n" +
"Received: from localhost (localhost.localdomain [127.0.0.1])\n" +
" by correo.patrimonio.ohc.cu (Postfix) with ESMTP id 65413232A001\n" +
" for <some.email@some.domain>; Thu, 9 Jan 2020 11:40:05 -0500 (CST)\n" +
"\n" +
"Received: from correo.patrimonio.ohc.cu ([127.0.0.1])\n" +
" by localhost (correo.patrimonio.ohc.cu [127.0.0.1]) (amavisd-new, port 10024)\n" +
" with ESMTP id hNMp-6lHHtzH for <some.email@some.domain>;\n" +
" Thu, 9 Jan 2020 11:40:05 -0500 (CST)\n" +
"\n" +
"Received: from correoweb.patrimonio.ohc.cu (unknown [192.168.229.23])\n" +
" by correo.patrimonio.ohc.cu (Postfix) with ESMTPA id EC62A232A00A;\n" +
" Thu, 9 Jan 2020 11:39:53 -0500 (CST)";
LinkedList<HashMap<String, String>> data = new LinkedList<HashMap<String, String>>();
HashMap<String, String> e;
StringBuilder sb = new StringBuilder(4096);
Matcher m = p.matcher(text);
while (m.find()) {
if (m.group(1) != null) {
data.add(new HashMap<String, String>());
}
e = data.getLast();
e.put(m.group(2), m.group(3));
}
sb.append("[");
data.stream().forEach((x) -> sb.append(x).append(",\n"));
if (sb.length() > 2) {
sb.setLength(sb.length() - 2);
}
sb.append("]");
System.out.println(sb);
}
}
Output:
[{with=HTTPS, by=HE1PR0102MB2714.eurprd01.prod.exchangelabs.com, from=VE1PR01MB5599.eurprd01.prod.exchangelabs.com
(2603:10a6:7:7c::43), ;=Thu, 9 Jan 2020 16:34:13
+0000, via=HE1PR0402CA0054.EURPRD04.PROD.OUTLOOK.COM},
{with=Microsoft SMTP Server (version=TLS1_2,
cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384), by=VE1PR01MB5599.eurprd01.prod.exchangelabs.com
(2603:10a6:803:11f::30), from=VI1PR0102CA0029.eurprd01.prod.exchangelabs.com
(2603:10a6:802::42), id=15.20.2602.12, ;=Thu, 9 Jan
2020 16:34:13 +0000},
{with=Microsoft SMTP Server (version=TLS1_2,
cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384), by=VI1PR0102CA0029.outlook.office365.com
(2603:10a6:802::42), from=DB5EUR01FT034.eop-EUR01.prod.protection.outlook.com
(2a01:111:f400:7e02::203), id=15.20.2623.9, ;=Thu, 9 Jan 2020 16:34:13 +0000, via=Frontend
Transport},
{with=Microsoft SMTP
Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384), by=DB5EUR01FT034.mail.protection.outlook.com (10.152.4.246), from=relay-out.ohc.cu (200.55.138.44), id=15.20.2623.9, ;=Thu, 9 Jan 2020 16:34:12 +0000, via=Frontend Transport},
{with=ESMTP, by=relay-out.ohc.cu (Postfix), for=<some.email@some.domain>, from=relay-in.ohc.cu (relay-in.ohc.cu [127.0.0.1]), id=69EA722DD, ;=Thu, 9 Jan 2020 11:29:43 -0500 (CST)},
{with=ESMTP, by=relay-in.ohc.cu (relay-in.ohc.cu [127.0.0.1]) (amavisd-new, port 10024), for=<some.email@some.domain>, from=relay-out.ohc.cu ([127.0.0.1]), id=7CZku5Y59vGC, ;=Thu, 9 Jan 2020 11:29:38 -0500 (CST)},
{with=ESMTP, by=relay-out.ohc.cu (Postfix), for=<some.email@some.domain>, from=correo.patrimonio.ohc.cu (unknown [192.168.229.20]), id=B83BA22F5, ;=Thu, 9 Jan 2020 11:29:36 -0500 (CST)},
{with=ESMTP, by=correo.patrimonio.ohc.cu (Postfix), for=<some.email@some.domain>, from=localhost (localhost.localdomain [127.0.0.1]), id=65413232A001, ;=Thu, 9 Jan 2020 11:40:05 -0500 (CST)},
{with=ESMTP, by=localhost (correo.patrimonio.ohc.cu [127.0.0.1]) (amavisd-new, port 10024), for=<some.email@some.domain>, from=correo.patrimonio.ohc.cu ([127.0.0.1]), id=hNMp-6lHHtzH, ;=Thu, 9 Jan 2020 11:40:05 -0500 (CST)},
{with=ESMTPA, by=correo.patrimonio.ohc.cu (Postfix), from=correoweb.patrimonio.ohc.cu (unknown [192.168.229.23]), id=EC62A232A00A, ;=Thu, 9 Jan 2020 11:39:53 -0500 (CST)}]
Demo.