htmlrubyweb-scrapingnokogiri

Web Scraping with Nokogiri::HTML and Ruby - Output to CSV issue


I have a script that scrapes HTML article pages of a webshop. I'm testing with a set of 22 pages of which 5 article pages have a product description and the others don't.

This code puts the right info on screen:

if doc.at_css('.product_description')
  doc.css('div > .product_description > p').each do |description|
    puts description
  end
  else
    puts "no description"
end

But now I'm stuck on how to get this correctly to output the found product descriptions to an array from where I'm writing them to a CSV file.

Tried several options, but none of them works so far. If I replace the puts description for @description << description.content, then all the descriptions of the articles end up in the upper lines in the CSV although they do not belong to the articles in that line.

When I also replace "no description" for @description = "no description" then the first 14 lines in my CSV recieve 1 letter of "no description" each. Looks funny, but it is not exactly what I need.

If more code is needed, just shout!

This is the CSV code I use in the script:

    CSV.open("artinfo.csv", "wb") do |row|
    row << ["category", "sub-category", "sub-sub-category", "price", "serial number",  "title", "description"]
    (0..@prices.length - 1).each do |index|
    row << [
            @categories[index], 
            @subcategories[index], 
            @subsubcategories[index], 
            @prices[index],
            @serial_numbers[index], 
            @title[index],
            @description[index]]
     end 
    end  

Solution

  • It sounds like your data isn't lined up properly. If it were you should be able to do:

    CSV.open("artinfo.csv", "w") do |csv|
      csv << ["category", "sub-category", "sub-sub-category", "price", "serial number",  "title", "description"]
      [@categories, @subcategories, @subsubcategories, @prices, @serial_numbers, @title, @description].transpose.each do |row|
        csv << row
      end 
    end