Parsing e-mails

I am trying to split mail-files like this one:

Message-ID: <53197.1075859003723.JavaMail.evans@thyme>
Date: Tue, 23 Oct 2001 10:31:09 -0700 (PDT)
From: scott.dozier@enron.com
To: tom.donohoe@enron.com, bonnie.chang@enron.com, m..love@enron.com
Subject: RE: CMS Deal #1027152
Cc: lisa.valderrama@enron.com, thomas.mcfatridge@enron.com
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Bcc: lisa.valderrama@enron.com, thomas.mcfatridge@enron.com
X-From: Dozier, Scott </O=ENRON/OU=NA/CN=RECIPIENTS/CN=SDOZIER>
X-To: Donohoe, Tom </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Tdonoho>, Chang, Bonnie </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Bchang>, Love, Phillip M. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Plove>
X-cc: Valderrama, Lisa </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Lvalde2>, McFatridge, Thomas </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Tmcfatri>
X-bcc: 
X-Folder: \TDONOHO (Non-Privileged)\Inbox
X-Origin: Donohoe-T
X-FileName: TDONOHO (Non-Privileged).pst

I am not sure if they have confirmed either deal.  However, deal #1034254 was never pathed by us, whereas 1027152 was.  Therefore, nothing billed out under 1034254.

Bonnie - I am including you on this note in case you can add anything about the pathing of the two deals mentioned in this note.  Niether CMS orgaination shows anything on Trunkline that matches this.  We spoke briefly about this last week.

Phillip - I am including you in case you can add any clarity or determine who we did this deal(s) with.

Thank you,
Scott
5-7213

 -----Original Message-----
From:   Donohoe, Tom  
Sent:   Tuesday, October 23, 2001 12:02 PM
To: Dozier, Scott
Subject:    RE: CMS Deal #1027152

if they are not confirming this deal are they confirming 1034254?

 -----Original Message-----
From:   Dozier, Scott  
Sent:   Tuesday, October 23, 2001 9:24 AM
To: Donohoe, Tom
Cc: Valderrama, Lisa; McFatridge, Thomas
Subject:    RE: CMS Deal #1027152
Importance: High

Tom,

In contacting our scheduler and subsequently a CMS scheduler, neither CMS Field Services nor CMS Marketing, Services, and Trading are able to identify the deal.  Currently, I am preparing to fax a copy of our confirmation on the deal to CMS Field Services,  Again, it is not an executed copy, but I am assuming they may not have sent it back.  Furthermore, the CMS Field Services scheduler has told me that they don't even schedule any Trunkline deals.

Considering all of this, I am assuming the worst - that unless we can provide a trader name etc. they will short pay on this deal.  So, do you know who represented us with CMS on this deal any one that might know who their trader is or how this deal was booked?  We are getting ready to settle for Sep prod so any help asap would be appreciated.

Scott
5-7213

 -----Original Message-----
From:   Dozier, Scott  
Sent:   Thursday, October 18, 2001 12:21 PM
To: Donohoe, Tom
Subject:    RE: CMS Deal #1027152

They do not recognize that deal at all.

The most recent name and number is a Conoco trader.  I have a confirmation on this deal with CMS.  However, it is not an executed copy (i.e. sent back or confirmed by CMS).  Is there some one who represented us with CMS on this that might know who their trader is or how this deal was booked?  I will attempt to contact the scheduler in the mean time but any help would be good.

thanks.

in many files like that:

Header
Body
Original message 1
Original message 2 
...

I've already read some posts about splitting mails, and it seems that using Mime4j should be a good idea. So I did that:

public class test {

    public static void main(String[] args) throws IOException, MimeException {
        // TODO Auto-generated method stub
        MimeTokenStream stream = new MimeTokenStream();
        stream.parse(new FileInputStream("test"));
        File header = new File ("header");
        File body = new File ("body");
        BufferedWriter headerWriter = new BufferedWriter(new FileWriter(header));
        BufferedWriter bodyWriter = new BufferedWriter(new FileWriter(body));
        String str;
        for (EntityState state = stream.getState();
                state != EntityState.T_END_OF_STREAM;
                state = stream.next()) {
            switch (state) {
              case T_BODY:
                  str = stream.getInputStream().toString();
                  bodyWriter.write(str);
                break;
              case T_FIELD:
                  str = stream.getField().toString() + "\n";
                  headerWriter.write(str);
                break;
            }
          }
        headerWriter.close();
        bodyWriter.close();

    }

}

This code correctly split the mail in two files: header and body. There's probably a better way to do it, but I find the Mime4j Javadoc not so helpful... well, I'm still trying to fully understand how it works.

However, I got two problems:

1) The body start with a line obviously created by Mime which looks like that:

[LineReaderInputStreamAdaptor: [pos: 937][limit: 4096][

and I don't know how to get rid of it.

2) The "original messages" are all in the body. I don't know how to split the body in more parts according to those "original messages". Moreover, all the mails doesn't have this format. Sometimes original messages are "revealed" only by a tab, or a > character before each line, or just by the little header "from, to", or with another line like -------forwarded--------, etc... So i can't split it using the format.

I thought Mime4j should recognise those parts as "Multipart" messages, but it seems not (there was a case T_START_MULTIPART but it wasn't finding anything.)

Solution

You get that odd looking text from the stream.getInputStream().toString(); that you write to the header file.

ThetoString() method is mainly for debugging. Calling it on that InputStream doesn't get the stream's contents (which could be a lot), but just a description of that stream, and that's what you see.

To get that stream's data, you need to read it from the input stream and copy it to the output stream. See this answer for various ways of doing this.

As far as the original messages go: your example is one email message. It only has 1 MIME part, the plain text part. People just copied the original message and put their answer on top, above the message they are replying to.

If they forwarded the message as an attachment, the MIME structure would look different: you'd see a Content-Type: multipart/mixed; boundary="..." and then that boundary text would separate the individual messages. Probably Apache James would detect them and handle them correctly.

MIME multipart is used for attachments, or for alternative parts of an email (plain text vs html). It does not refer to people top-posting their replies.

Since your example email does not have that MIME structure, your best bet is to manually parse the email body, looking for -----Original Message-----. Note that this is brittle (you don't know what people's mail clients may use, people may modify this manually (maybe by accident)).

import org.apache.james.mime4j.stream.*;
import static org.apache.james.mime4j.stream.MimeTokenStream.*;
import java.io.*;

public class Library {
    private static final String SEP = " -----Original Message-----";
    private static final String CRLF = "\r\n";

    static int fileNo = 0;

    public static void main(String[] args) throws Exception {
        MimeTokenStream stream = new MimeTokenStream();
        stream.parse(new FileInputStream(args[0]));
        try (BufferedWriter headerWriter = new BufferedWriter(new FileWriter("header"))) {
            for (EntityState state = stream.getState();
                    state != EntityState.T_END_OF_STREAM;
                    state = stream.next()) {
                switch (state) {
                case T_BODY:
                    writePart(new BufferedReader(new InputStreamReader(stream.getInputStream())));
                    break;
                case T_FIELD:
                    headerWriter.write(stream.getField().toString());
                    headerWriter.write(CRLF);
                    break;
                }
            }
        }
    }

    private static void writePart(BufferedReader in) throws Exception {
        BufferedWriter out = null;
        try {
            out = new BufferedWriter(new FileWriter(fileNo + ".eml"));
            String line = in.readLine();
            while (line != null) {
                if (SEP.equals(line)) {
                    out.close();
                    fileNo++;
                    out = new BufferedWriter(new FileWriter(fileNo + ".eml"));
                }
                out.write(line);
                out.write(CRLF);
                line = in.readLine();
            }
        }
        finally {
            out.close();
        }
    }
}