pythonhtmlextract

regex returns more data than it should


please read before you comment. im trying to make a simple website scraper but i run into this error, it scrapes urls for scripts and its apparently capturing :

SCRIPT URL : https://accounts.google.com/ServiceLogin?service=youtube&uilel=3&passive=true&continue=https%3A%2F%2Fwww.youtube.com%2Fsignin%3Faction_handle_signin%3Dtrue%26app%3Ddesktop%26hl%3Den-GB%26next%3D%252Fsignin_passive%26feature%3Dpassive&hl=en-GB" style="display: none"><input id="search"... etc

using this script:

import re
import requests
sitemap = requests.get("https://youtube.com").text
javascriptmatches = re.findall("src=\"(https?://.*?\.js)\??.*?\"",sitemap)
for x in javascriptmatches:
  print(x)

update: i'm using the built in "re" library.


Solution

  • For some reason its ignoring the "src=\" part of the regex string, the script url that you mention that doesn't end with .js actually does end with that after 20k characters, after finding the end of the actual 9th match.

    You can use this instead to find the src from <scripts/> from HTML websites:

    from html.parser import HTMLParser
    
    class MyHTMLParser(HTMLParser):
    
        javascriptmatches = []
    
        def handle_starttag(self, tag, attrs):
            if tag == "script":
                for i in attrs:
                    if "src" in i:
                        self.javascriptmatches.append(i[1])
    
            return super().handle_starttag(tag, attrs)
        
        def handle_endtag(self, tag):
            return super().handle_endtag(tag)
        
        def handle_data(self, data):
            return super().handle_data(data)
    
    if __name__ == "__main__":
        sitemap = requests.get("https://youtube.com").text
        Parser = MyHTMLParser()
        Parser.feed(sitemap)
        print(Parser.javascriptmatches)
    

    This code handles the sitemap as an HTML file and not a string only appending to the match list if a tag <script> has a src attribute.