I have made a scraper and I would like to make the function "page_link = """ scan each URL that are saved in a JSON,XML or SQL file.
Could someone point me in the direction so I can learn how to make this dynamic instead of static?
You don't have to give me the answer, just point me towards where I can learn more about what I should do. I'm still learning.
from bs4 import BeautifulSoup
import requests
print('step 1')
#get url
page_link = "<random website with info>"
print('step 2')
#open page
page_response = requests.get(page_link, timeout=1)
print('step 3')
#parse page
page_content = BeautifulSoup(page_response.content, "html.parser")
print('step 4')
#naam van de pagina
naam = page_content.find_all(class_='<random class>')[0].decode_contents()
print('step 5')
#printen
print(naam)
JSON seems like the right tool for the job. XML and SQL are a bit heavy-handed for the simple functionality that you need. Furthermore, Python has built-in json reading/writing functionality (json is similar enough to a Python dict
in a lot of respects).
Just maintain a list of sites you want to hit in a json file similar to this one (put it in a file called test.json
):
{
"sites": ["www.google.com",
"www.facebook.com",
"www.example.com"]
}
Then do your scraping for each of these sites:
import json
with open('test.json') as my_json:
json_dict = json.load(my_json)
for website in json_dict["sites"]:
print("About to scrape: ", website)
# do scraping
page_link = website
...
this outputs (if you remove the ...
):
About to scrape: www.google.com
About to scrape: www.facebook.com
About to scrape: www.example.com
Just put the rest of the logic you want to use to do the scraping (like you have above in the question) in under # do scraping
comment.