javagoogle-cloud-storageapache-tika

Uploading a file to Google Cloud Storage is corrupted only when using Tika to get Content-Type


I am using the Google Cloud Storage Java SDK (V.2.20.1) to upload files to my Bucket. I am trying to set the Content-Type of the file, which I am using Apache Tika to detect. The issue is that if the use the Content-Type returned by Tika, even though it is correct, when the file is uploaded it is corrupted and I cannot view it. If I manually set the Content-Type, the same value that Tika returned, then it uploads and I can view the file without issue.

This code does not work, I verify that the content type is exactly matching applicaiton/pdf but it is corrupted on upload and I cannot view.

Tika tika = new Tika();
String contentType = tika.detect(inputStream);
System.out.println(contentType); //"application/pdf"

if("application/pdf".equals(contentType)) {
     return bucket.create(Utilities.formatDirectoryName(directory) + name, inputStream, contentType);
} else {
     System.out.println("INVALID TYPE");
     return null;
}

This code does work by manually setting the Content-Type. The file uploads and I can view it without issue.

String contentType = "application/pdf";
System.out.println(contentType); //"application/pdf"
if("application/pdf".equals(contentType)) {
     return bucket.create(Utilities.formatDirectoryName(directory) + name, inputStream, contentType);
} else {
     System.out.println("INVALID TYPE");
     return null;
}

When I view the information on the Cloud Storage UI everything shows correctly for both of the methods I listed above. Content-Type, size, etc. The difference is when I download the file to view, one does not work (corrupted) and the other one does work (views correctly).

enter image description here

I have run this test multiple times to ensure it wasn't just a weird upload glitch, but its consistent every time. I have also tried this with different types of files such as Power Points. Same result of using Tika vs manually setting the Content-Type. This is driving me crazy, please help!


Solution

  • Turns out using Tika messes with the InputStream marker, so once I run detect I can't re-use that InputStream to upload.

    So instead I turn the InputStream into a byte[] then I can use that for detecting the type as well as saving

    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    inputStream.transferTo(baos);
                
    byte[] byteData = baos.toByteArray();
    Tika tika = new Tika();
    tika.detect(byteData);