please read before you comment. im trying to make a simple website scraper but i run into this error, it scrapes urls for scripts and its apparently capturing :
SCRIPT URL : https://accounts.google.com/ServiceLogin?service=youtube&uilel=3&passive=true&continue=https%3A%2F%2Fwww.youtube.com%2Fsignin%3Faction_handle_signin%3Dtrue%26app%3Ddesktop%26hl%3Den-GB%26next%3D%252Fsignin_passive%26feature%3Dpassive&hl=en-GB" style="display: none"><input id="search"... etc
using this script:
import re
import requests
sitemap = requests.get("https://youtube.com").text
javascriptmatches = re.findall("src=\"(https?://.*?\.js)\??.*?\"",sitemap)
for x in javascriptmatches:
print(x)
update: i'm using the built in "re" library.
For some reason its ignoring the "src=\"
part of the regex string, the script url that you mention that doesn't end with .js
actually does end with that after 20k characters, after finding the end of the actual 9th match.
You can use this instead to find the src
from <scripts/>
from HTML websites:
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
javascriptmatches = []
def handle_starttag(self, tag, attrs):
if tag == "script":
for i in attrs:
if "src" in i:
self.javascriptmatches.append(i[1])
return super().handle_starttag(tag, attrs)
def handle_endtag(self, tag):
return super().handle_endtag(tag)
def handle_data(self, data):
return super().handle_data(data)
if __name__ == "__main__":
sitemap = requests.get("https://youtube.com").text
Parser = MyHTMLParser()
Parser.feed(sitemap)
print(Parser.javascriptmatches)
This code handles the sitemap as an HTML file and not a string only appending to the match list if a tag <script>
has a src
attribute.