javaamazon-s3apache-tika

How to determine appropriate file extension from MIME Type in Java


I am uploading files to an Amazon s3 bucket and have access to the InputStream and a String containing the MIME Type of the file but not the original file name. It's up to me to actually create the file name and extension before pushing the file up to S3. Is there a library or convenient way to determine the appropriate extension to use from the MIME Type?

I've seen some references to the Apache Tika library but that seems like overkill and I haven't been able to get it to successfully detect file extensions yet. From what I've been able to gather it seems like this code should work, but I'm just getting an empty string when my type variable is "image/jpeg"

    MimeType mimeType = null;
    try {
        mimeType = new MimeTypes().forName(type);
    } catch (MimeTypeException e) {
        Logger.error("Couldn't Detect Mime Type for type: " + type, e);
    }

    if (mimeType != null) {
        String extension = mimeType.getExtension();
        //do something with the extension
    }

Solution

  • As some of the commentors have pointed out, there is no universal 1:1 mapping between mimetypes and file extensions... Some mimetypes have more than one possible extension, many extensions are shared by multiple mimetypes, and some mimetypes have no extension.

    Wherever possible, you're much better off storing the mimetype and using that going forward, and forgetting about the extension.

    That said, if you do want to get the most common file extension for a given mimetype, then Tika is a good way to go. Apache Tika has a very large set of mimetypes it knows about, and for many of these it also knows mime magic for detection, common extensions, descriptions etc.

    If you want to get the most common extension for a JPEG file, then as shown in this Apache Tika unit test you just need to do something like:

      MimeTypes allTypes = MimeTypes.getDefaultMimeTypes();
      MimeType jpeg = allTypes.forName("image/jpeg");
      String jpegExt = jpeg.getExtension(); // .jpg
      assertEquals(".jpg", jpeg.getExtension());
    

    The key thing is that you need to load up the xml file that's bundled in the Tika jar to get the definitions of all the mimetypes. If you might be dealing with custom mimetypes too, then Tika supports those, and change line one to be:

      TikaConfig config = TikaConfig.getDefaultConfig();
      MimeTypes allTypes = config.getMimeRepository();
    

    By using the TikaConfig method to get the MimeTypes, Tika will also check your classpath for custom mimetype defintions, and include those too.