There are some existing posts out there that talk about "how to detect if a document is password protected".
This is probably the most comprehensive of these links for MS Office docs: Detecting a password-protected document (The code is written in C#).
I am in a Java application and I want to be able to detect if a PDF, XLS, XLSX, DOC, DOCX or ZIP file is password protected or not.
So I immediately reached for Apache Tika.
I cannot seem to find a way to detect if a document is password protected while guaranteeing that it does not parse the entire document and does not at any point load the entire document into memory.
What I'm thinking is I set up a content handler (I have an example here: https://github.com/nddipiazza/tika-fork/blob/master/tika-fork-main/src/main/java/org/apache/tika/fork/main/TikaBodyContentHandler.java) where i stop parsing after 64K or something like that.
Is there an easier way?
Solution: Used tika api to parse the document with writeLimit=1000
chars are something small to get a small sample of the content. In this way you get a "sample" of the content assuring you that the file is not encrypted but you at the same time did not scan the entire file.
Depending on the Tika parser that was used, typically won't load the entire thing into memory by doing this, as Tika operates using streams, not loading entire bytes into memory.