ruby-on-rails-3mongodb

Update MongoDB with Array from CSV Join Table


I want to add an embedded Majors array in my MongoDB Colleges collection like so:

{ "_id" : ObjectID("abc123"), 
"code" : 123456, 
"name" : "Stanford University", 

"majors" :["Agriculture", "Business", "Computer Science"... ] }

I have a CSV that, essentially, is a CollegesMajors join table (in the SQL frame of reference). It looks like:

Code-----MajorCode----Level

123456----98765---------2

123456----99999---------2

How do I embed these majors (I can translate MajorCode to it's actual name first) into my MongoDB Colleges collection?

There must be some way that I can (pseudo-code):

db.colleges.update { match the code, update with array }

Maybe I need to turn the CSV into a JSON file first?

Thanks in advance!

UPDATE!

This is the error I get when attempting Tilo's instructions - this is probably related to my lack of familiarity with the rails console. It thinks the CSV.read instruction is loading a nil object.

ruby-1.9.2-head :024 > csvAA = CSV.read( '/Users/administrator/dropbox/schoolninja/1College_Data/2009formatted/college_majors.csv' );
ruby-1.9.2-head :025 >   headersA = csvAA.shift;
ruby-1.9.2-head :026 >   csvAH = csvAA.map {|row| Hash[*headersA.zip(row).flatten] };
ruby-1.9.2-head :027 >   csvAH.each do |rowH|
ruby-1.9.2-head :028 >     c = College.find_by_code( rowH['Code'] )
ruby-1.9.2-head :029?>     c.majors ||= []
ruby-1.9.2-head :030?>   c.majors << ( rowH['title'] )
ruby-1.9.2-head :031?>   c.save
ruby-1.9.2-head :032?>   end
SyntaxError: (irb):24: unknown regexp options - adtratr
(irb):24: syntax error, unexpected tCONSTANT, expecting ')'
...opbox/schoolninja/1College_Data/2009formatted/college_majors...
...                               ^
(irb):24: syntax error, unexpected tIDENTIFIER, expecting $end
...nja/1College_Data/2009formatted/college_majors.csv );
...                               ^
    from /Users/administrator/.rvm/gems/ruby-1.9.2-head/gems/railties-3.0.9/lib/rails/commands/console.rb:44:in `start'
    from /Users/administrator/.rvm/gems/ruby-1.9.2-head/gems/railties-3.0.9/lib/rails/commands/console.rb:8:in `start'
    from /Users/administrator/.rvm/gems/ruby-1.9.2-head/gems/railties-3.0.9/lib/rails/commands.rb:23:in `<top (required)>'
    from script/rails:6:in `require'
    from script/rails:6:in `<main>'

Solution

  • UPDATE:

    Please check out the new Ruby Gem "smarter_csv" in case you want to update MongoDB from records of CSV-files. It has lots of helpful features, including chunking of large files, and returns the results in chunks as arrays of hashes.

    See also:

    Updated Answer:

    require 'smarter_csv'
    filename = '/tmp/college_majors.csv'
    # The :key_mapping renames one column and ignores the column "Level"
    # Each line of the CSV-file is converted into a hash with keys: :major_code, :code
    n = SmarterCSV.process(filename, {:chunk_size => 10,
                           :key_mapping => {:level => nil} }) do |chunk|
      # We're passing a block in to process each resulting hash / row (block takes array of hashes).
      # If the CSV-file is large, we can process it in parallel in chunks (using Resque Workers).
      # For this we would extract the following block into a Resque worker and instead just create 
      #   a new Resque job for each chunk.
    
      chunk.each do |hash|
        c = College.find_by_code( hash[:code] )
        c.majors ||= []
        c.majors << majorcode_to_name( hash[:major_code] )
        c.save
      end
    end
    

    Previous Answer:

    Assuming you already have a method majorcode_to_name() :

    require 'csv'
    
    csvAA = CSV.read( csv_filename )   # returns an array of arrays
    # => [["Code", "MajorCode", "Level"], ["123456", "98765", "2"], ["123456", "99999", "2"]] 
    
    headersA = csvAA.shift   # extract the headers into an array
    # => ["Code", "MajorCode", "Level"]
    
    # turn the "array of arrays" which contain the CVS data, into an "Array of Hashes":
    csvAH = csvAA.map {|row| Hash[*headersA.zip(row).flatten] }
    # => [{"Code"=>"123456", "MajorCode"=>"98765", "Level"=>"2"}, {"Code"=>"123456", "MajorCode"=>"99999", "Level"=>"2"}] 
    
    csvAH.each do |rowH|
      c = College.find_by_code( rowH['Code'] )
      c.majors ||= []                                 # initialize as empty array if it doesn't exist
      c.majors << majorcode_to_name( rowH['MajorCode'] )
      c.save
    end
    

    If your CSV file is really large, you may want to modify this code a little, so you don't have to read the whole data into RAM

    require 'csv'
    
    headersA = nil
    CSV.foreach do |row|
      if headersA.nil?
        headersA = row 
      else
        rowH = Hash[*headersA.zip(row).flatten]
        
        c = College.find_by_code( rowH['Code'] )
        c.majors ||= []                                 # initialize as empty array if it doesn't exist
        c.majors << majorcode_to_name( rowH['MajorCode'] )
        c.save
      end
    end
    

    EDIT: I should have mentioned why I converted the result of CSV.read into an array of hashes.. The main reason is to make the code independent of the order of the headers in the CSV file - which makes the code a bit more robust.