rubyweb-scrapingcookiesnokogirimechanize

Parsing with Ruby, Nokogiri & Mechanize java cookies links in a webpage


everyone.

I need to parse a webpage which has java cookies set for every link. I can parse the normal search and every product is shown and imported to a mysql database.

I was able to scrape from a search result every product with its elements with this code:

This is what I have:

    require 'rubygems'
    require 'logger'
    require 'mechanize'
    require 'mysql2'
    
    agent = WWW::Mechanize.new{|a| a.log = Logger.new(STDERR) }
    #agent.set_proxy('a-proxy', '8080')
    agent.read_timeout = 60
    
    def add_cookie(agent, uri, cookie)
      uri = URI.parse(uri)
      Mechanize::Cookie.parse(uri, cookie) do |cookie|
        agent.cookie_jar.add(uri, cookie)
      end
    end
    
    
    # get main page
    page = agent.get "http://www.site.com.mx"
    
    # get login form
    form = page.forms.first
    form.correo_ingresar = "user"
    form.password = "password"
    
    # submit login form
    page = agent.submit form
    
    # parse cookies
    myarray = page.body.scan(/SetCookie\(\"(.+)\", \"(.+)\"\)/)
    
    # set session cookies
    myarray.each do |item|
      add_cookie(agent, 'http://www.site.com.mx', "#{item[0]}=#{item[1]}; path=/; domain=www.site.com.mx")
    end
    # show 1000 search results per page
    add_cookie(agent, 'http://www.site.com.mx', "tampag=1000; path=/; domain=www.site.com.mx")
    
    # order results
    add_cookie(agent, 'http://www.site.com.mx', "orden_articulos=existencias asc; path=/; domain=www.site.com.mx")
    
    # section results
    add_cookie (agent, 'http://www.site.com.mx', "codigoseccion_buscar=14; path=/; domain=www.site.com.mx")
    
    # get main page
    page = agent.get "http://www.site.com.mx/tienda/index.php"
    
    search_form = page.forms.first
    
    search_result = agent.submit search_form
    
    doc = Nokogiri::HTML(search_result.body)
    
    rows = doc.css("table.articulos tr")
    
    i = 0
    details = rows.collect do |row|
      detail = {}
      [
        [:sku, 'td[3]/text()'],
        [:desc, 'td[4]/text()'],
        [:qty, 'td[5]/text()'],
        [:qty2, 'td[5]/p/b/text()'],
        [:price, 'td[6]/text()']
      ].collect do |name, xpath|
        detail[name] = row.at_xpath(xpath).to_s.strip
      end
      i = i + 1
      detail
    end
    
    # walk through paginator links
    links = doc.css("a.paginar").map {|l| "http://www.site.com.mx#{l['href']}"}.uniq!
    
    links.each do |l|
        page = agent.get l
    
        doc = Nokogiri::HTML(page.body)
    
        rows = doc.css("table.articulos tr")
    
        rows.each do |row|
            detail = {}
            [
                    [:sku, 'td[3]/text()'],
                    [:desc, 'td[4]/text()'],
                    [:qty, 'td[5]/text()'],
                    [:qty2, 'td[5]/p/b/text()'],
                    [:price, 'td[6]/text()']
            ].collect do |name, xpath|
                    detail[name] = row.at_xpath(xpath).to_s.strip
            end
            details << detail
        end
    end
    
    # update db
    client = Mysql2::Client.new(:host => "localhost", :username => "myusername", :password => "mypassword", :database => "mydatabase")
    
    details.each do |d|
        if d[:sku] != ""
            price = d[:price].split
    
            if price[1] == "D"
                currency = 144
            else
                currency = 168
            end
    
            cost = price[0].gsub(",", "").to_f
    
            if d[:qty] == ""
                qty = d[:qty2]
            else
                qty = d[:qty]
            end 
    
            results = client.query("SELECT * FROM jos_vm_product WHERE product_sku = '#{d[:sku]}' LIMIT 1;")
            if results.count == 1
                product = results.first
    
                            client.query("UPDATE jos_vm_product SET product_sku = '#{d[:sku]}', product_name = '#{d[:desc]}', product_desc = '#{d[:desc]}', product_in_stock = '#{qty}' WHERE product_id = 
    #{product['product_id']};")
    
                client.query("UPDATE jos_vm_product_price SET product_price = '#{cost}', product_currency = '#{currency}' WHERE product_id = '#{product['product_id']}';")
            else
                client.query("INSERT INTO jos_vm_product(product_sku, product_name, product_desc, product_in_stock) VALUES('#{d[:sku]}', '#{d[:desc]}', '#{d[:desc]}', '#{qty}');")
                last_id = client.last_id
    
                client.query("INSERT INTO jos_vm_product_price(product_id, product_price, product_currency) VALUES('#{last_id}', '#{cost}', #{currency});")
            end
        end
    end

Now I dont want to search I want to parse from the Categories list:
link to main page:http://www.site.com.mx/tienda/articulos.php?opcion=lineas&seccion_mostrar=11 this shows a table like this (everything contains links) The top name: ACCESORIOS is a link to the category ACCESORIOS, and the bold names listed bellow is the subcategories, and the ones bellow the bold names are brands. If I click on ACCESORIOS it will show every brand and every subcategory mixed up, and so on.

ACCESORIOS
Accesorios Multimedia(6)
ACTECK DE MEXICO (5), MANHATTAN (1)
Accesorios P/impres. Punto De Venta(1)
EPSON CORPORATION (1)
Accesorios Para Cableados De Patch Panels(1)
INTELLINET NETWORK SOLUTIONS (1)
Accesorios Para Camaras Digitales(1)
MANHATTAN (1)
Accesorios Para Computadoras De Escritorio(32)
ACTECK DE MEXICO (2), GENERICA (1), MANHATTAN (28), TARGUS (1)
Accesorios Para Computadoras Portatiles(60)
ACTECK DE MEXICO (3), GENIUS (2), HP COMERCIAL (2), HP IMPRESION (1), MANHATTAN (17), PERFECT CHOICES (32), SOLIDEX (1), TARGUS (1), TECH ZONE (1)
Accesorios Para Ipod(3)
ACTECK DE MEXICO (1), PERFECT CHOICES (2)
Accesorios Para Mesas(3)
MANHATTAN (2), PERFECT CHOICES (1)
Accesorios Para Redes(13)
INTELLINET NETWORK SOLUTIONS (5), MANHATTAN (8)
Accesoriso Para Celulares(14)
BLACKBERRY (14)
Adaptador Bluetooth(6)
ACTECK DE MEXICO (1), MANHATTAN (2), PERFECT CHOICES (3)
Adaptadores Para Mouse Y Teclado(3)
MANHATTAN (2), PERFECT CHOICES (1)
Audifono/diademas Y Microfonos(49)
ACTECK DE MEXICO (14), BTO (1), GENIUS (3), LOGITECH (2), MANHATTAN (11), PERFECT CHOICES (18)

Here is the code for the Table that has cookies for each link, that is why I have been having a hard time scraping this.

    <table width="95%" cellspacing="0" cellpadding="3" border="0">
    <tbody>
    <tr>
    <td valign="top" align="left" style="font-family: verdana; font-size: 12px" colspan="2"><a onClick="fijar_filtro('codigoseccion_buscar','11')" href="javascript:void(0)" class="busquedas"><b>ACCESORIOS</b></a></td>
    </tr>
    <tr>
    <td width="20" valign="top" align="left"></td>
    <td valign="top" align="left" style="font-family: verdana; font-size: 12px"><a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','338')" href="javascript:void(0)" class="busquedas"><b>Accesorios Multimedia</b>(6)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','338');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (5)</a>, <a onClick="SetCookie('codigolinea_buscar','338');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (1)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','540')" href="javascript:void(0)" class="busquedas"><b>Accesorios P/impres. Punto De Venta</b>(1)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','540');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','106');" href="javascript:void(0)" class="busquedas">EPSON CORPORATION (1)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','542')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Cableados De Patch Panels</b>(1)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','542');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','635');" href="javascript:void(0)" class="busquedas">INTELLINET NETWORK SOLUTIONS (1)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','361')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Camaras Digitales</b>(1)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','361');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (1)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','277')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Computadoras De Escritorio</b>(32)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (2)</a>, <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','530');" href="javascript:void(0)" class="busquedas">GENERICA (1)</a>, <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (28)</a>, <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','586');" href="javascript:void(0)" class="busquedas">TARGUS (1)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','357')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Computadoras Portatiles</b>(60)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (3)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','167');" href="javascript:void(0)" class="busquedas">GENIUS (2)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','694');" href="javascript:void(0)" class="busquedas">HP COMERCIAL (2)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','107');" href="javascript:void(0)" class="busquedas">HP IMPRESION (1)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (17)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (32)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','212');" href="javascript:void(0)" class="busquedas">SOLIDEX (1)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','586');" href="javascript:void(0)" class="busquedas">TARGUS (1)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','691');" href="javascript:void(0)" class="busquedas">TECH ZONE (1)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1302')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Ipod</b>(3)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','1302');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (1)</a>, <a onClick="SetCookie('codigolinea_buscar','1302');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (2)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1175')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Mesas</b>(3)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','1175');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (2)</a>, <a onClick="SetCookie('codigolinea_buscar','1175');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (1)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','292')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Redes</b>(13)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','292');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','635');" href="javascript:void(0)" class="busquedas">INTELLINET NETWORK SOLUTIONS (5)</a>, <a onClick="SetCookie('codigolinea_buscar','292');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (8)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1378')" href="javascript:void(0)" class="busquedas"><b>Accesoriso Para Celulares</b>(14)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','1378');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','714');" href="javascript:void(0)" class="busquedas">BLACKBERRY (14)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1313')" href="javascript:void(0)" class="busquedas"><b>Adaptador Bluetooth</b>(6)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','1313');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (1)</a>, <a onClick="SetCookie('codigolinea_buscar','1313');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (2)</a>, <a onClick="SetCookie('codigolinea_buscar','1313');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (3)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','555')" href="javascript:void(0)" class="busquedas"><b>Adaptadores Para Mouse Y Teclado</b>(3)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','555');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (2)</a>, <a onClick="SetCookie('codigolinea_buscar','555');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (1)</a><br>
    </td>
    </tr>
    </tbody>
    </table>

so the question is what do I add to my code to be able to access every link? if it uses java cookies.

Cookies used:
Name , Value Ranges
codigoseccion_buscar, 11-30
codigomarca_buscar, 100-736
codigolinea_buscar, 15-1385


Solution

  • I managed to scrape one of those links contents by adding cookies to my Ruby code:

        # set cookies
        add_cookie(agent, 'http://www.site.com.mx', "codigoseccion_buscar=11; path=/; domain=www.site.com.mx")
    
        add_cookie(agent, 'http://www.site.com.mx', "codigolinea_buscar=; path=/; domain=www.site.com.mx")
    
        add_cookie(agent, 'http://www.site.com.mx', "codigomarca_buscar=; path=/; domain=www.site.com.mx")
    
        add_cookie(agent, 'http://www.site.com.mx', "textobuscar=; path=/; domain=www.site.com.mx")
    

    weird thing was that if I only added one of those cookies it would not work. so I had to add all , even tho they dont have any values, because every link has a cookie, so that way it would delete or clear saved cookie.

    now I need to scrape those cookies use it as variable and do a loop or something, anybody can help me?

    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','542')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Cableados De Patch Panels</b>(1)</a><br>