I want to read eml-files and extract the plain text.
So far i have found the TIdMessage
with which i can iterate over the TIdMessage.MessageParts
and check if their PartType
is mptText
. All of that works quite well.
My problem is reading Messages correctly if TIdMessage.Encoding = TIdMessageEncoding.meMIME
i just can´t get behind the logic of that format. I would like to get the whole text without tags from the EML-File. Is there always a text/plain
-Part in a mail?
Until now i´v got the following two functions which return the html-Content for a Message.
function GetMultiPartAlternative(aMsg: TIdMessage; aParentIndex, aLastIndex: Integer): String;
var
Part: TIdMessagePart;
i: Integer;
begin
Result := '';
for i := aLastIndex - 1 downto aParentIndex + 1 do
begin
Part := aMsg.MessageParts.Items[i];
if { (Part.ParentPart = aParentIndex) and } (Part is TIdText) then
begin
if Part.ContentType.StartsWith('text/html') then
begin
Result := (Part as TIdText).Body.Text;
Exit;
end
else if Part.ContentType.StartsWith('text/plain') then
begin
Result := (Part as TIdText).Body.Text;
Exit;
end;
end;
end;
end;
function GetMultiPartMixed(aMsg: TIdMessage; aParentIndex, aLastIndex: Integer): String;
var
Part: TIdMessagePart;
i: Integer;
begin
Result := '';
for i := aLastIndex - 1 downto aParentIndex + 1 do
begin
Part := aMsg.MessageParts.Items[i];
if { (Part.ParentPart = aParentIndex) and } (Part is TIdText) then
begin
if Part.ContentType.StartsWith('multipart/alternative') then
begin
Result := GetMultiPartAlternative(aMsg, aParentIndex, aLastIndex);
Exit;
end
else if Part.ContentType.StartsWith('text/html') then
begin
Result := (Part as TIdText).Body.Text;
Exit;
end
else if Part.ContentType.StartsWith('text/plain') then
begin
Result := (Part as TIdText).Body.Text;
Exit;
end;
aLastIndex := i;
end;
end;
end;
TIdMessage
uses the MessageParts
collection for MIME
emails. Your code is fine for accessing individual MIME parts (and +1 for iterating the parts in the correct order!). Simply ignore HTML parts if you are only interested in PlainText parts.
Is there always a
text/plain
-Part in a mail?
Unfortunately, no. It depends on what formats the sender decides to include. It is customary but not required for an HTML email to include a PlainText alternative for readers that don't understand HTML.
Please read this article on Indy's blog: HTML Messages (it's geared towards sending emails, but it does describe the TIdMessage
layout for common scenarios you can encounter when reading emails, too).
Having HTML without a PlainText alternative is a real possibility you need to account for. If there is no PlainText provided then you will have to parse out the text from the HTML instead.
On a side note: you should not use ContentType.StartsWith('...')
, as that is not very accurate. Use IsHeaderMediaType(ContentType, '...')
instead, eg:
if IsHeaderMediaType(Part.ContentType, 'text/html') then
IsHeaderMediaType()
is declared in the IdGlobalProtocols
unit.