python-3.xscrapyscrapinghub

scrapinghub upload and use file


I uploaded my spider on scrapyhub. I understand how to upload with my *.txt file, but how do I use it?

My setup.py file looks like

setup(
    name         = 'project',
    version      = '1.0',
    packages     = find_packages(),
    package_data={
        'youtube_crawl': ['resources/Names.txt']
    },
    entry_points = {'scrapy': ['settings = youtube_crawl.settings']},
)

Then I want to use this Name.txt.

before uploading my spider looks like

def parse(self, response):
        with open('resources/Names.txt','rt') as f:
            for link in f:
                url = "https://www.youtube.com/results?search_query={}".format(link)
                name = link.replace('+',' ')
                yield Request(url, meta={'name':name}, callback=self.parse_page, dont_filter=True)

So my question is: how I can use my file on scraping hub?

I tried this code but don't understand how it works, and how integrate it with my code =)

data = pkgutil.get_data("youtube_crawl", "resources/Names.txt")

The function returns a binary string that is the contents of the specified resource.


Solution

  • This line of code:

    data = pkgutil.get_data("youtube_crawl", "resources/Names.txt")
    

    is equivalent to this block:

    with open('resources/Names.txt') as f:
        data = f.read()
    f.closed
    

    So now you could read the binary string line by line:

    def parse(self, response):
        data = pkgutil.get_data("youtube_crawl", "resources/Names.txt")
    
        for link in data.split('\n'):
            url = "https://www.youtube.com/results?search_query={}".format(link)
            name = link.replace('+',' ')
            yield Request(url,
                          meta={'name':name},
                          callback=self.parse_page,
                          dont_filter=True)
    

    Take a look at Python 3 pkgutil or inputoutput doc pages for more details.