pdfuploadbytedocxmagic-numbers

PDF and DOCX Magic Numbers


I read the first byte to differentiate file types but both PDF and DOCX has a "0x50" magic number. How do I handle this circumstance?


Solution

  • PDF files don't have a "magic" byte they start with. If you read the PDF specification you'll see they have to start with "%PDF", but in practice many PDF files do not.

    1. Just looking for a %PDF header to identify PDF files is highly unreliable, a valid PDF file is a file you can parse (that at least has a trailer, cross-reference table and so forth).

    2. There was a suggestion once that PDF files contain binary data before the %PDF header to make sure they were treated as binary files. As a result PDF readers at one point started accepting a certain number of binary bytes (random bytes) before the %PDF header. Such files cannot be detected by a simple magic number or string of magic numbers.