I'm currently putting together some code to extract a variety of files that are embedded in a Word document using Python, but I'm having particular trouble figuring out how to restore an embedded Outlook .msg file back to its original (usable) .msg form after extracting it as an oleObject.bin file. Does anyone have an idea how to do this?
It's pretty straight forward to restore PDF files and the zipfile library has built in tools to deal with zip files in .bin form, but I'm really scratching my head on these .msg files. I can't find a way to carve out the original file from all the added binary data. Any help or thoughts on this would be appreciated!
I essentially want to do the same thing as this question but for .msg files instead of PDFs: How can I decode a .bin into a .pdf
Edit: This is the error I get when I try to just rename the file extension of the .bin to .msg
OLE Objects, If correctly embedded (not linked) are simply all the same as their source. So you can run them in their application and save them from that application. Thus the text will save in Notepad. The Zip will not need save as its a folder thus simply needs MOVE from its temporary location. And for a MSG it will be saveable from Outlook if you trust it to open.
If you don't have Outlook it can open in NotePad too (but will only be salvageable as plain text AND RTF if included). Here we see the Fax Sample entry from Me to You with complimentary message Hello World!
If we save the RTF we can see the RTF body content in WordPad (and thus auto-print to PDF using Write /PT ....
)
If you want to pull all the bins use TAR -xf to unpack the .docX
hello - docx.zip\word\embeddings
These will include (as you observed) from another question, headings and trailers. Of course you will not know which is which, without look inside and remove the header/trailer but a Zip will start with PK
A .MSG will start with the DOC signature
The start of a MSG file will be marked with ÐÏ à
which in hex should be something like D0 cF 11 e0
i.e its a "DocFile"
the end of a msg has 16 bit FEFF FFFF ... padding so ends say
þÿÿÿýÿÿÿÿÿÿÿÿ ...lots more ÿÿ ... ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
The bin has more data so the end of that block is dirty with 16bit filename and path
ÿÿÿÿÿÿÿÿT C : \ U s e r s \ n a m e \ A p p D a t a \ L o c a l \ T e m p \ { A 0 9 5 A 1 6 4 - 2 B 3 6 - 4 9 0 5 - A 2 9 4 - E 5 B C C B 9 5 B 9 B 5 } \ H e l l o ( 2 ) . m s g H e l l o . m s g C : \ U s e r s \ n a m e \ D o c u m e n t s \ H e l l o . m s g
unsure if the T
is significant in some cases or just buffer debris so you need to check.