pythonpython-3.xscraper

Retrieving URL out of a file for scraping


I have made a scraper and I would like to make the function "page_link = """ scan each URL that are saved in a JSON,XML or SQL file.

Could someone point me in the direction so I can learn how to make this dynamic instead of static?

You don't have to give me the answer, just point me towards where I can learn more about what I should do. I'm still learning.

    from bs4 import BeautifulSoup
import requests
print('step 1')
#get url
page_link = "<random website with info>"
print('step 2')
#open page
page_response = requests.get(page_link, timeout=1)
print('step 3')
#parse page
page_content = BeautifulSoup(page_response.content, "html.parser")
print('step 4')
#naam van de pagina
naam = page_content.find_all(class_='<random class>')[0].decode_contents()
print('step 5')
#printen
print(naam)

Solution

  • JSON seems like the right tool for the job. XML and SQL are a bit heavy-handed for the simple functionality that you need. Furthermore, Python has built-in json reading/writing functionality (json is similar enough to a Python dict in a lot of respects).

    Just maintain a list of sites you want to hit in a json file similar to this one (put it in a file called test.json):

    {
        "sites": ["www.google.com",
                  "www.facebook.com",
                  "www.example.com"]
    }
    

    Then do your scraping for each of these sites:

    import json
    with open('test.json') as my_json:
        json_dict = json.load(my_json)
    for website in json_dict["sites"]:
        print("About to scrape: ", website)
    
        # do scraping
        page_link = website
        ...
    

    this outputs (if you remove the ...):

    About to scrape:  www.google.com
    About to scrape:  www.facebook.com
    About to scrape:  www.example.com
    

    Just put the rest of the logic you want to use to do the scraping (like you have above in the question) in under # do scraping comment.