pythonweb-scrapingpython-requestsscrapy

Scrapy pass value using meta never updated


I was trying to scrape some websites. I already got the data and I tried to pass the value using meta={}. But the problem appears when I use yield scrapy.Request which goes to the next function. I was send to the next function is a new URL and using meta to pass the JSON value. I got new URL but not with JSON data, JSON never updated. Just passed the same values. I have no idea what's going on.

You can see my code, I was trying to pass the JSON values, but I got only the same JSON but the URL is updated with the new URL.

def first_function(self, response):
    value_json = self.get_json() #i got the json from here
    for key, value in value_json.items(): #loop the json
        for values in value:

            # values will show me as dictionay of dictionary
            # thats like {"blabla":{"key1":"value1","key1":"value2"}}
            # I gave the conditions as below to get just a value or to make sure if that key ("blabla") is exist
            # if the condition is true, i will get the value {"key1":"value1","key1":"value2"}

            if values == "blabla":
                get_url = "http://www.example.com/"
                yield Request(
                    url=get_url+values["id_url"],
                    meta={"data_rest":values}, 
                    callback=self.second_function
                )

def second_function(self, response):
    # ============== PROBLEM =====================
    # The problem is here!
    # I always got a new url from first_function, my logic is if I got a new url, i should get a new json
    # but the json or "data_rest" never updated. Always send the same json to this function
    # ============== PROBLEM =====================

    josn_data = response.meta["data_rest"]
    names = response.css() #get the tag in here
    for get_data in names:
        sub_url = get_data.css("a::attr(href)").extract()
        for loop_url_menu in sub_url:
            yield scrapy.Request(
                url=loop_url_menu, 
                headers=self.session,
                meta = {
                    'dont_redirect': True,
                    'handle_httpstatus_list': [302]
                }, callback=self.next_function
            )

Solution

  • Good news!! I can solved it. we just have to append the value for the first after that we yield the value.

    def first_function(self, response):
        temp = []
        value_json = self.get_json() #i got the json from here
        for key, value in value_json.items(): #loop the json
            for values in value:
                if values == "blabla":
                    get_url = "http://www.example.com/"
                    temp.append(values)
    
        yield Request(
            url=get_url+values["id_url"],
            meta={"data_rest":temp}, 
            callback=self.second_function
        )