Scrapy web without <a> nodes / href attributes

Trust you are doing well! I need your support please, I'm trying to scrape this web page: https://servicio.mapa.gob.es/regfiweb# Once you enter, you must go to:

Buscadores.
Productos.

I'd like to download all the pdf files, but there aren't nodes with href attributes. Instead there are buttons with data-id(s) thats triggered via javascript to initiate pdf download.

I've trying but totally unsuccessful.

Do you have an idea?

I want to navigate page by pate and download the pdf of every page.

My code:

import scrapy


class SimpleSpider(scrapy.Spider):
    name = "simple"
    # allowed_domains = ["x"]
    start_urls = ["https://servicio.mapa.gob.es/regfiweb#"]
    
    def parse(self, response):
        for book in response.css('.col'):
            title = book.css('span ::text').get()
            link = response.urljoin(
                # book.css('a.pdf ::attr(href)').get()
                book.css('a::attr(href)').get()
            )
            yield {
                'Title': title,
                'file_urls': [link]
            }

Solution

If you open devtools and go to the network tab you can see the url to the "products" page. You can go to here, fill the form and submit it.

If you stay on the page and hit the search button you can see how the page requests the result. I find it easier, si that's what I'm gonna do. You can see the numbers in the end, to me it looks like unix timestamp.

You can verify it (for example with cyberchef).

If we inspect the pdf download button we'll see this:

function(n) {
  n.stopImmediatePropagation();
  n.stopPropagation();
  n.preventDefault();
  var i = $("#exportFichaProductoPdf").val(),
    t = {};
  t.idProducto = parseInt(this.dataset.id);
  exportPdf(i, t)
}

After looking for the exportPdf function this is what you'll find:

exportPdf = function (n, t) {
        $.ajax({
            url: n,
            data: { dataDto: t },
            type: 'POST',
            xhrFields: { responseType: 'blob' },
            success: function (n, t, i) {
                var s, u;
                console.log(n);
                var e = new Blob([n], { type: 'application/pdf' }), r = document.createElement('a'), o = '', f = i.getResponseHeader('Content-Disposition');
                f && f.indexOf('attachment') !== -1 && (s = /filename[^;=\n]*=((['"]).*?\2|[^;\n]*)/, u = s.exec(f), u != null && u[1] && (o = u[1].replace(/['"]/g, '')));
                /^((?!chrome|android).)*safari/i.test(navigator.userAgent) ? window.open(window.URL.createObjectURL(e), '_blank') : (r.href = window.URL.createObjectURL(e), r.target = '_blank', r.download = o, document.body.appendChild(r), r.click());
            },
            beforeSend: function () {
                $('#loadingDiv').show();
            },
            complete: function () {
                $('#loadingDiv').hide();
            },
            error: function (n) {
                console.log(n.status);
                console.log(n.responseText);
                var t = document.createElement('div');
                t.innerHTML = 'En este momentos el documento no está disponible, estamos trabajando para que pueda obtenerlo lo antes posible.';
                swal({
                    title: 'Documento no disponible',
                    content: t,
                    icon: 'warning'
                });
            }
        });
    }

So basically we need a url and the id so we could recreate the request:

import scrapy
import os
import time


def get_timestamp_ms():
    return int(time.time() * 1000)


class SimpleSpider(scrapy.Spider):
    name = "simple"
    allowed_domains = ["servicio.mapa.gob.es"]
    base_url = "https://servicio.mapa.gob.es/regfiweb/Productos/ProductosGrid?NombreComercial=&Titular=&NumRegistro=&Fabricante=&IdSustancia=-1&IdEstado=1&IdAmbito=undefined&IdPlaga=-1&IdFuncion=-1&IdCultivo=-1&IdSistemaCultivo=-1&IdTipoUsuario=-1&AncestrosCultivos=false&AncestrosPlagas=false&FecRenoDesde=&FecRenoHasta=&FecCaduDesde=&FecCaduHasta=&FecModiDesde=&FecModiHasta=&FecInscDesde=&FecInscHasta=&FecLimiDesde=&FecLimiHasta=&productosGrid-page={}&_={}"
    base_dir = "downloads"

    def start_requests(self):
        page = 1
        yield scrapy.Request(url=self.base_url.format(str(page), str(get_timestamp_ms())), cb_kwargs={'page': page})

    def parse(self, response, page):
        from scrapy.shell import open_in_browser
        open_in_browser(response)
        for book in response.xpath('//tr[not(ancestor::thead)]'):
            title = book.xpath('./td[5]//text()').get(default="")
            file_id = book.xpath('./td[last()]/button/@data-id').get()
            if file_id:
                yield scrapy.FormRequest(url="https://servicio.mapa.gob.es/regfiweb/Productos/ExportFichaProductoPdf", formdata={"idProducto": file_id}, method="POST", cb_kwargs={"title": f"{title}_{file_id}"}, callback=self.download_pdf, dont_filter=True)

        # pagination
        page += 1
        num_pages = response.xpath('//button[@data-page][last()]/@data-page').get(default=0)
        if page < int(num_pages):
            yield scrapy.Request(url=self.base_url.format(str(page), str(get_timestamp_ms())), cb_kwargs={'page': page})

    def download_pdf(self, response, title):
        filename = os.path.join(self.base_dir, title+".pdf")
        with open(filename, 'wb') as f:
            f.write(response.body)

You need to ensure that the base_dir exists before you run the code.

This is just an example, you may want to add pagination and search queries.