ruby-on-railsrubyziproo

Extracting excel data after unzipping using rubyzip


I'm trying to get spreadsheet data from zipped .xlsx files. I'm using rubyzip to access the contents of the zipfile

Zip::File.open(file_path) do |zip_file|
   zip_file.each do |entry|
     *process entry*
   end
end

My problem is that rubyzip gives a Zip::Entry object, which, I cant get to work with gems like roo or creek.

I've done something similar, but with .csv file. This was as simple as CSV.parse(entry.get_input_stream.read). However, that just gives me a string of encoded gibberish when using it on an .xlsx file.

I've looked around and the closest answer I got was temporarily extracting the files, but I want to avoid doing this since the files can get pretty large.

Does anyone have any suggestions? Thanks in advance.


Solution

  • So what you need to do is convert the stream into an IO object that Roo can understand.

    To determine if the object passed to Roo::Spreadsheet.open is a "stream" Roo uses the following method:

     def is_stream?(filename_or_stream)
        filename_or_stream.respond_to?(:seek)
      end
    

    Since a Zip::InputStream does not respond to seek you cannot use this object directly. To get around this we simply need an object that does respond to seek (like a StringIO)

    We can just read the input stream into the StringIO directly:

    stream = StringIO.new(entry.get_input_stream.read)
    

    Or the Zip library also provides a method to copy a Zip::InputStream to another IO object through the IOExtras module, which I think reads fairly nicely as well.

    Knowing all of the above we can implement as follows:

    Zip::File.open(file_path) do |zip_file|
       zip_file.each do |entry|
         # make sure Roo can handle the file (at least based on the extension)
         ext = File.extname(entry.name)&.to_sym 
         next unless Roo::CLASS_FOR_EXTENSION[ext]
         # stream = StringIO.new(entry.get_input_stream.read)
         ::Zip::IOExtras.copy_stream(stream = StringIO.new, entry.get_input_stream)
         spreadsheet = Roo::Spreadsheet.open(stream, extension: ext)
         # process file
       end
    end