groovyformatmime-typesidentifybytestream

in java: How can i identify file extension/file type from file in binary data (BYTEA) with mime_type 'application/octet-stream' and save to disk?


From a database migration, we have a data-dump in a postgreSQL database. The task is to write a script in java or groovy in order to read out the files in the correct format and save them to the server. For some files, the mime_type is specified (e.g. application/pdf, image/png) in a column 'mime_type'. In these cases, i was able to save them in the correct format. (-> read them in as byte array input stream, save them as a file in the corresponding format).

But 90% of the files have mime_type "application/octet-stream". In these cases, i suspect the type or format of the file is just unkown. In order to be able to save them in the correct format (e.g. pdf of png), i'd somehow need to be able to find out what type the files could be. And then convert them accordingly.

I already tried the URLConnection.guessContentTypeFromStream(inputStream) method, but this did not recognise the mime_type.

        GroovyRowResult[] fileInBinary = sql.rows("""
        SELECT * FROM table_name WHERE id = 123456
        """)


        def temp = []

        fileInBinary.each { GroovyRowResult row ->
        temp << row.data
        }

        InputStream byteArrayInputStream = new ByteArrayInputStream((byte[]) temp.flatten(), temp.flatten().size())

        def mimeType = java.net.URLConnection.guessContentTypeFromStream(byteArrayInputStream)
        println "guessed mime Type is: " + mimeType

..followed by code to save the file, which works when mime_type specifies file format, e.g. pdf, word, or png

This code works, except for the guessContentType-Part.

Files are saved in binary data format (BYTEA, postgres equivalent to BLOB).

Does anyone know how to find out which format these application/octet-stream-files originally had, extract them from the binary data-columns (BYTEA-format) in the database and save/read the files to disk, as e.g. pdf, docx, or png?

Many thanks in advance!


Solution

  • If you can not rely on features of your database, you would have to use a library. The detection features of the JDK are more or less bound to the operating system you are using, so a library again will give more stable results.

    Here is an example using Apache Tika:

    tika.detector.detect(TikaInputStream.get(row.data), new Metadata()) will give you the mimetype. It can work directly on the data of the resultset. If you only need to detect the mimetype on many blos and don't need the actual data, then consider only reading the first few blocks of data, that are enough to detect the type.

    Complete example:

    @Grapes([
    @Grab('org.postgresql:postgresql:+'),
    @Grab('org.apache.tika:tika-core:1.25'),
    ])
    @GrabConfig(systemClassLoader=true)
    import groovy.sql.Sql
    import org.apache.tika.config.TikaConfig
    import org.apache.tika.io.TikaInputStream
    import org.apache.tika.metadata.Metadata
    
    def db = Sql.newInstance("jdbc:postgresql://localhost/so65682432", "postgres", "postgres")
    def tika = new TikaConfig()
    db.eachRow("SELECT * FROM test") { row ->
      println tika.detector.detect(TikaInputStream.get(row.image), new Metadata())
    }