pythonrubyjsonemailpst

Convert Outlook PST to json using libpst


I have an Outlook PST file, and I'd like to get a json of the emails, e.g. something like

{"emails": [
{"from": "alice@example.com",
 "to": "bob@example.com",
 "bcc": "eve@example.com",
 "subject": "mitm",
 "content": "be careful!"
}, ...]}

I've thought using readpst to convert to MH format and then scan it in a ruby/python/bash script, is there a better way?

Unfortunately the ruby-msg gem doesn't work on my PST files (and looks like it wasn't updated since 2014).


Solution

  • I found a way to do it in 2 stages, first convert to mbox and then to json:

    # requires installing libpst
    pst2json my.pst
    # or you can specify a custom output dir and an outlook mail folder,
    # e.g. Inbox, Sent, etc.
    pst2json -o email/ -f Inbox my.pst
    

    Where pst2json is my script and mbox2json is slightly modified from Mining the Social Web.

    pst2json:

    #!/usr/bin/env bash
    
    usage(){
        echo "usage: $(basename $0) [-o <output-dir>] [-f <folder>] <pst-file>"
        echo "default output-dir: email/mbox-all/<pst-file>"
        echo "default folder: Inbox"
        exit 1
    }
    
    which readpst || { echo "Error: libpst not installed"; exit 1; }
    folder=Inbox
    
    while (( $# > 0 )); do
        [[ -n "$pst_file" ]] && usage
        case "$1" in
            -o)
                if [[ -n "$2" ]]; then
                    out_dir="$2"
                    shift 2
                else
                    usage
                fi
                ;;
            -f)
                if [[ -n "$2" ]]; then
                    folder="$2"
                    shift 2
                else
                    usage
                fi
                ;;
            *)
                pst_file="$1"
                shift
        esac
    done
    
    default_out_dir="email/mbox-all/$(basename $pst_file)"
    out_dir=${out_dir:-"$default_out_dir"}
    mkdir -p "$out_dir"
    readpst -o "$out_dir" "$pst_file"
    [[ -f "$out_dir/$folder" ]] || { echo "Error: folder $folder is missing or empty."; exit 1; }
    res="$out_dir"/"$folder".json
    mbox2json "$out_dir/$folder" "$res" && echo "Success: result saved to $res"
    

    mbox2json (python 2.7):

    # -*- coding: utf-8 -*-
    
    import sys
    import mailbox
    import email
    import quopri
    import json
    from BeautifulSoup import BeautifulSoup
    
    MBOX = sys.argv[1]
    OUT_FILE = sys.argv[2]
    SKIP_HTML=True
    
    def cleanContent(msg):
    
        # Decode message from "quoted printable" format
    
        msg = quopri.decodestring(msg)
    
        # Strip out HTML tags, if any are present
    
        soup = BeautifulSoup(msg)
        return ''.join(soup.findAll(text=True))
    
    
    def jsonifyMessage(msg):
        json_msg = {'parts': []}
        for (k, v) in msg.items():
            json_msg[k] = v.decode('utf-8', 'ignore')
    
        # The To, CC, and Bcc fields, if present, could have multiple items
        # Note that not all of these fields are necessarily defined
    
        for k in ['To', 'Cc', 'Bcc']:
            if not json_msg.get(k):
                continue
            json_msg[k] = json_msg[k].replace('\n', '').replace('\t', '').replace('\r'
                    , '').replace(' ', '').decode('utf-8', 'ignore').split(',')
    
        try:
            for part in msg.walk():
                json_part = {}
                if part.get_content_maintype() == 'multipart':
                    continue
                type = part.get_content_type()
                if SKIP_HTML and type == 'text/html':
                    continue
                json_part['contentType'] = type
                content = part.get_payload(decode=False).decode('utf-8', 'ignore')
                json_part['content'] = cleanContent(content)
    
                json_msg['parts'].append(json_part)
        except Exception, e:
            sys.stderr.write('Skipping message - error encountered (%s)\n' % (str(e), ))
        finally:
            return json_msg
    
    # There's a lot of data to process, so use a generator to do it. See http://wiki.python.org/moin/Generators
    # Using a generator requires a trivial custom encoder be passed to json for serialization of objects
    class Encoder(json.JSONEncoder):
        def default(self, o):
            return {'emails': list(o)}
    
    
    # The generator itself...
    def gen_json_msgs(mb):
        while 1:
            msg = mb.next()
            if msg is None:
                break
            yield jsonifyMessage(msg)
    
    mbox = mailbox.UnixMailbox(open(MBOX, 'rb'), email.message_from_file)
    json.dump(gen_json_msgs(mbox),open(OUT_FILE, 'wb'), indent=4, cls=Encoder)
    

    Now, it's possible to process the file easily. E.g. to get just the contents of the emails:

    jq '.emails[] | .parts[] | .content' < out/Inbox.json