I'm running the stock Apache Tika 1.24.1 Server (tika-server-1.24.1.jar). My ASP.NET MVC web app then gets the parsed documents back from Tika using this VB.net code:
httpWebRequest = HttpWebRequest.Create("http://localhost:9998/tika")
httpWebRequest.Method = "PUT"
httpWebRequest.Accept = "text/plain"
httpWebRequest.UseDefaultCredentials = True
httpWebRequest.GetRequestStream.Write(fileContents, 0, fileContents.Count)
httpWebResponse = httpWebRequest.GetResponse
Using contentResponseStream As New StreamReader(_httpWebResponse.GetResponseStream)
tikaTextContents = contentResponseStream.ReadToEnd()
End Using
That part works (the parsed text is returned).
However, when the Tika server parses certain PDF files, it adds extra spaces in some places. I noticed in this Tika ticket that there's a potential solution (setEnableAutoSpace). https://issues.apache.org/jira/browse/TIKA-724
My question: Is there any way to set setEnableAutoSpace from the Tika web interface (or possibly to set it when you parse the file)? Or is the only option to tinker with the Java code if you want to turn this option on?
Thanks!
In order to set any of the options from PDFParserConfig when making a request to the Tika Server, you need to send a HTTP Header that is prefixed with X-Tika-PDF
and then the setting you want to control
So, to turn on the enabledAutoSpace
option when making a request, you should send the header
X-Tika-PDFenableAutoSpace: true
If enabling that option only partly fixes your PDF text problem, you should have a look at the Tika Troubleshooting PDFs wiki page for next steps. Depending on the software used to generate them, and the options picked, PDFs can be hard....