web-scrapingscrapyscrapy-splash

Scrapy web without <a> nodes / href attributes


Trust you are doing well! I need your support please, I'm trying to scrape this web page: https://servicio.mapa.gob.es/regfiweb# Once you enter, you must go to:

  1. Buscadores.
  2. Productos.

I'd like to download all the pdf files, but there aren't nodes with href attributes. Instead there are buttons with data-id(s) thats triggered via javascript to initiate pdf download.

I've trying but totally unsuccessful.

Do you have an idea?

I want to navigate page by pate and download the pdf of every page.

My code:

import scrapy


class SimpleSpider(scrapy.Spider):
    name = "simple"
    # allowed_domains = ["x"]
    start_urls = ["https://servicio.mapa.gob.es/regfiweb#"]
    
    def parse(self, response):
        for book in response.css('.col'):
            title = book.css('span ::text').get()
            link = response.urljoin(
                # book.css('a.pdf ::attr(href)').get()
                book.css('a::attr(href)').get()
            )
            yield {
                'Title': title,
                'file_urls': [link]
            }

Solution

  • If you open devtools and go to the network tab you can see the url to the "products" page. You can go to here, fill the form and submit it. DevTools

    If you stay on the page and hit the search button you can see how the page requests the result. I find it easier, si that's what I'm gonna do. You can see the numbers in the end, to me it looks like unix timestamp. Search page url

    You can verify it (for example with cyberchef). cyberchef unix timestamp

    If we inspect the pdf download button we'll see this:

    function(n) {
      n.stopImmediatePropagation();
      n.stopPropagation();
      n.preventDefault();
      var i = $("#exportFichaProductoPdf").val(),
        t = {};
      t.idProducto = parseInt(this.dataset.id);
      exportPdf(i, t)
    }
    

    After looking for the exportPdf function this is what you'll find:

    exportPdf = function (n, t) {
            $.ajax({
                url: n,
                data: { dataDto: t },
                type: 'POST',
                xhrFields: { responseType: 'blob' },
                success: function (n, t, i) {
                    var s, u;
                    console.log(n);
                    var e = new Blob([n], { type: 'application/pdf' }), r = document.createElement('a'), o = '', f = i.getResponseHeader('Content-Disposition');
                    f && f.indexOf('attachment') !== -1 && (s = /filename[^;=\n]*=((['"]).*?\2|[^;\n]*)/, u = s.exec(f), u != null && u[1] && (o = u[1].replace(/['"]/g, '')));
                    /^((?!chrome|android).)*safari/i.test(navigator.userAgent) ? window.open(window.URL.createObjectURL(e), '_blank') : (r.href = window.URL.createObjectURL(e), r.target = '_blank', r.download = o, document.body.appendChild(r), r.click());
                },
                beforeSend: function () {
                    $('#loadingDiv').show();
                },
                complete: function () {
                    $('#loadingDiv').hide();
                },
                error: function (n) {
                    console.log(n.status);
                    console.log(n.responseText);
                    var t = document.createElement('div');
                    t.innerHTML = 'En este momentos el documento no está disponible, estamos trabajando para que pueda obtenerlo lo antes posible.';
                    swal({
                        title: 'Documento no disponible',
                        content: t,
                        icon: 'warning'
                    });
                }
            });
        }
    

    So basically we need a url and the id so we could recreate the request:

    import scrapy
    import os
    import time
    
    
    def get_timestamp_ms():
        return int(time.time() * 1000)
    
    
    class SimpleSpider(scrapy.Spider):
        name = "simple"
        allowed_domains = ["servicio.mapa.gob.es"]
        base_url = "https://servicio.mapa.gob.es/regfiweb/Productos/ProductosGrid?NombreComercial=&Titular=&NumRegistro=&Fabricante=&IdSustancia=-1&IdEstado=1&IdAmbito=undefined&IdPlaga=-1&IdFuncion=-1&IdCultivo=-1&IdSistemaCultivo=-1&IdTipoUsuario=-1&AncestrosCultivos=false&AncestrosPlagas=false&FecRenoDesde=&FecRenoHasta=&FecCaduDesde=&FecCaduHasta=&FecModiDesde=&FecModiHasta=&FecInscDesde=&FecInscHasta=&FecLimiDesde=&FecLimiHasta=&productosGrid-page={}&_={}"
        base_dir = "downloads"
    
        def start_requests(self):
            page = 1
            yield scrapy.Request(url=self.base_url.format(str(page), str(get_timestamp_ms())), cb_kwargs={'page': page})
    
        def parse(self, response, page):
            from scrapy.shell import open_in_browser
            open_in_browser(response)
            for book in response.xpath('//tr[not(ancestor::thead)]'):
                title = book.xpath('./td[5]//text()').get(default="")
                file_id = book.xpath('./td[last()]/button/@data-id').get()
                if file_id:
                    yield scrapy.FormRequest(url="https://servicio.mapa.gob.es/regfiweb/Productos/ExportFichaProductoPdf", formdata={"idProducto": file_id}, method="POST", cb_kwargs={"title": f"{title}_{file_id}"}, callback=self.download_pdf, dont_filter=True)
    
            # pagination
            page += 1
            num_pages = response.xpath('//button[@data-page][last()]/@data-page').get(default=0)
            if page < int(num_pages):
                yield scrapy.Request(url=self.base_url.format(str(page), str(get_timestamp_ms())), cb_kwargs={'page': page})
    
        def download_pdf(self, response, title):
            filename = os.path.join(self.base_dir, title+".pdf")
            with open(filename, 'wb') as f:
                f.write(response.body)
    

    You need to ensure that the base_dir exists before you run the code.

    This is just an example, you may want to add pagination and search queries.