rubyrubyzipruby-2.3

find whether a zipped file is text or binary without unzipping it


I'm creating a ruby script which goes through several zip files and validates the content of any xml files within. To optimise my script, I'm using the ruby-zip gem to open the zip files without extracting them.

My initial thought was to use filemagic to determine the MIME-type of the files, but the filemagic gem takes a file path and all I have are these Entry and InputStream classes which are unique to ruby-zip.

Is there a good way to determine the filetype without extracting? Ultimately I need to identify xml files, but I can get away with identifying plain-text files and using a regex to look for the


Solution

  • the filemagic gem takes a file path

    The filemagic gem's file method takes a file path, but file isn't the only method it has. A glance at the docs reveals it has an io method, too.

    all I have are these Entry and InputStream classes which are unique to ruby-zip

    I wouldn't say InputStream is "unique to ruby-zip." From the docs (emphasis mine):

    A InputStream inherits IOExtras::AbstractInputStream in order to provide an IO-like interface for reading from a single zip entry

    So FileMagic has an io method and Zip::InputStream is IO-like. That leads us to a pretty straightforward solution:

    require 'filemagic'
    require 'zip'
    
    Zip::InputStream.open('/path/to/file.zip') do |io|
      entry = io.get_next_entry
    
      FileMagic.open(:mime) do |fm|
        p fm.io(entry.get_input_stream)
      end
    end