javajakarta-mailsimple-java-mail

parse outlook emails using outlook-message-parser library


I am trying to load emails from INBOX from remote mailbox and parse them to extract attachments and converted body in HTML format.

I use the below code snippet to parse using outlook message parser jar

ResultSuccess insertMessage(Message currentMsg) {

    final OutlookMessageParser msgp = new OutlookMessageParser();

    final OutlookMessage msg = parseMsg(currentMsg.getInputStream());
}

and the currentMsg is of Type javax.mail.Message

Code snippet of getting emails from server is as follows

Properties props = new Properties();
Message currentMessage;

Session session = Session.getInstance(props, null);

session.setDebug(debug);

store = session.getStore(PROTOCOL);

store.connect(host, username, password);

Message message[] = inboxfolder.getMessages();

Message copyMessage[] = new Message[1];

int n = message.length;

for (int j = 0; j < n; j++) {
    currentMessage = message[j];
    ResultSuccess result = insertMessage(currentMessage);

    

Exception details are as follows

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header signature; read 0x615F3430305F2D2D, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document
    at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:151)
    at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:117)
    at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:285)
    at org.simplejavamail.outlookmessageparser.OutlookMessageParser.parseMsg(OutlookMessageParser.java:133)
    at com.email.Email_Parse.loadMessages(Email_Parse.java:38)
    at com.email.Email_Parse.getMessages(Email_Parse.java:116)
    at com.email.Email_Parse.main(Email_Parse.java:26)

However the issue doesn't occur when I try to load emails from local disk and parse them.

Any idea on how to resolve the issue?


Solution

  • I suppose you're using outlook-message-parser to parse the emails stored on disk.

    Messages retrieved from the mail server are not in the Outlook file format (even if the remote server is an Microsoft Exchange server or Microsoft's Outlook email service) so outlook-message-parser won't be able to parse them.

    You should use the JavaMail Api to retrieve the body of the message and its attachments.

    This page has a description (with a few examples) of the steps needed to read a message with attachments. Here is an excerpt :

    Q: How do I read a message with an attachment and save the attachment?

    A: As described above, a message with an attachment is represented in MIME as a multipart message. In the simple case, the results of the Message object's getContent method will be a MimeMultipart object. The first body part of the multipart object wil be the main text of the message. The other body parts will be attachments. The msgshow.java demo program shows how to traverse all the multipart objects in a message and extract the data of each of the body parts. The getDisposition method will give you a hint as to whether the body part should be displayed inline or should be considered an attachment (but note that not all mailers provide this information). So to save the contents of a body part in a file, use the saveFile method of MimeBodyPart.

    To save the data in a body part into a file (for example), use the getInputStream method to access the attachment content and copy the data to a FileOutputStream. Note that when copying the data you can not use the available method to determine how much data is in the attachment. Instead, you must read the data until EOF. The saveFile method of MimeBodyPart will do this for you. However, you should not use the results of the getFileName method directly to name the file to be saved; doing so could cause you to overwrite files unintentionally, including system files.

    Note that there are also more complicated cases to be handled as well. For example, some mailers send the main body as both plain text and html. This will typically appear as a multipart/alternative content (and a MimeMultipart object) in place of a simple text body part. Also, messages that are digitally signed or encrypted are even more complex. Handling all these cases can be challenging. Please refer to the various MIME specifications and other resources listed on our main page.

    Emails are not always in html, sometimes they are just plain text. Most of the time they are "multipart". For example, an email can have an html part that will be displayed by email clients that support html (gmail, thunderbird ...) and another plain text part that can be used by other email clients that can't display html (think text-based email clients).

    So before dumping the content of an email you have to check its content type (or if it has multiple part, check the content type of the parts).

    For the html parts, dumping the content verbatim can give you the desired result depending on how images are referenced.

    If an image is referenced using an http URL (like <img src="https://example.com/a.png"/>) no further work is necessary to display the result in a browser.

    If an image is referenced using a Content-Id URL (like <img src="cid:image002.gif@01D44EB0.904DB790"/>) then you have to do extra work to be able to display the result correctly in a browser.

    You have to look for the correct image in the email parts and decide how to include it in the final result.

    For example, save it to disk and replace the reference in the html with its path on the disk so that <img src="cid:image002.gif@01D44EB0.904DB790"/> becomes something like this <img src="/path/to/saved/images/imagexyz.png"/>

    Or convert it to base64 format and replace the reference in the html with a data URI so that <img src="cid:image002.gif@01D44EB0.904DB790"/> becomes something like this <img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU5ErkJggg=="/>.

    I don't know if there is a java library that can do this automatically.

    The JavaMail api website provides samples that you can read to learn how to use it. You can check msgshow.java from the samples to see how you can use the api to retrieve the content of a message.

    Here is a simple example program that downloads the last message from a gmail inbox to a local directory (it may have bugs. don't forget to put your own account and password and replace "/tmp/messages" with a valid directory on your computer).

    import javax.mail.*;
    import java.io.File;
    import java.io.IOException;
    import java.nio.file.Files;
    import java.util.Properties;
    
    public class MessageDownloader {
        private File destDir;
    
        public MessageDownloader(File destDir){
            this.destDir = destDir;
        }
    
        public void download(Part message, String basename) throws MessagingException, IOException {
            System.out.println("Type : " + message.getContentType());
            if(message.isMimeType("text/plain")) {
                downloadTextPart((String) message.getContent(), basename + ".txt");
            }else if(message.isMimeType("text/html")) {
                downloadTextPart((String) message.getContent(), basename + ".html");
            }else if(message.isMimeType("image/*") || Part.ATTACHMENT.equalsIgnoreCase(message.getDisposition())){
                downloadDataPart(message, basename);
            }else if(message.isMimeType("multipart/*")){
                downloadMultiPart((Multipart) message.getContent(), basename);
            }else{
                System.out.println("Unrecognized type");
            }
        }
    
        private void downloadDataPart(Part dataPart, String basename) throws IOException, MessagingException {
            File dataFile = new File(destDir, basename + "_" + dataPart.getFileName());
            Files.copy(dataPart.getInputStream(), dataFile.toPath());
        }
        private void downloadTextPart(String textContent, String filename) throws MessagingException, IOException{
            File textFile = new File(destDir, filename);
            Files.writeString(textFile.toPath(), textContent);
        }
    
        private void downloadMultiPart(Multipart multiPartMessage, String basename) throws MessagingException, IOException {
            for(int partIdx = 0; partIdx < multiPartMessage.getCount(); partIdx++){
                BodyPart part = multiPartMessage.getBodyPart(partIdx);
                download(part, String.format("%s_%d_", basename, partIdx));
            }
        }
    
        public static void main(String[] args) throws MessagingException, IOException {
            Store store = getStore();
    
            Folder folder = store.getFolder("Inbox");
            folder.open(Folder.READ_ONLY);
    
            MessageDownloader msgDownloader = new MessageDownloader(new File("/tmp/messages"));
    
            Message lastMessage = folder.getMessage(folder.getMessageCount()-1);
            msgDownloader.download(lastMessage, "last_message");
    
            folder.close();
            store.close();
        }
    
        private static Store getStore() throws MessagingException {
            Properties props = new Properties();
            props.setProperty("mail.smtp.ssl.enable", "true");
            Session session = Session.getInstance(props, null);
            Store store = session.getStore("imaps");
            store.connect("imap.gmail.com", "account@gmail.com","password");
            return store;
        }
    }