[SOLVED] Retrieving URL out of a file for scraping

Retrieving URL out of a file for scraping

I have made a scraper and I would like to make the function "page_link = """ scan each URL that are saved in a JSON,XML or SQL file.

Could someone point me in the direction so I can learn how to make this dynamic instead of static?

You don't have to give me the answer, just point me towards where I can learn more about what I should do. I'm still learning.

    from bs4 import BeautifulSoup
import requests
print('step 1')
#get url
page_link = "<random website with info>"
print('step 2')
#open page
page_response = requests.get(page_link, timeout=1)
print('step 3')
#parse page
page_content = BeautifulSoup(page_response.content, "html.parser")
print('step 4')
#naam van de pagina
naam = page_content.find_all(class_='<random class>')[0].decode_contents()
print('step 5')
#printen
print(naam)

Solution

JSON seems like the right tool for the job. XML and SQL are a bit heavy-handed for the simple functionality that you need. Furthermore, Python has built-in json reading/writing functionality (json is similar enough to a Python dict in a lot of respects).

Just maintain a list of sites you want to hit in a json file similar to this one (put it in a file called test.json):

{
    "sites": ["www.google.com",
              "www.facebook.com",
              "www.example.com"]
}

Then do your scraping for each of these sites:

import json
with open('test.json') as my_json:
    json_dict = json.load(my_json)
for website in json_dict["sites"]:
    print("About to scrape: ", website)

    # do scraping
    page_link = website
    ...

this outputs (if you remove the ...):

About to scrape:  www.google.com
About to scrape:  www.facebook.com
About to scrape:  www.example.com

Just put the rest of the logic you want to use to do the scraping (like you have above in the question) in under # do scraping comment.