pythonurlweb-scrapingbeautifulsoupweb-crawler

How extract all URLs in a website using BeautifulSoup


I'm working on a project that require to extract all links from a website, with using this code I'll get all of links from single URL:

import requests
from bs4 import BeautifulSoup, SoupStrainer

source_code = requests.get('https://stackoverflow.com/')
soup = BeautifulSoup(source_code.content, 'lxml')
links = []

for link in soup.find_all('a'):
    links.append(str(link))

problem is that if I want to extract all URLs, I have to write another for loop and then another one ... . I want to extract all URLs that are exist in this website and in this website's sub domains. is there any way to do this without writing nested for? and even with writing nested for, I don't know how many for should I use to get all URLs.


Solution

  • WoW, it takes about 30 min to find a solution, I found a simple and efficient way to do this, As @αԋɱҽԃ-αмєяιcαη mentioned, some time if your website linked to a BIG website like google, etc, it wont be stop until you memory get full of data. so there are steps that you should consider.

    1. make a while loop to seek thorough your website to extract all of urls
    2. use Exceptions handling to prevent crashes
    3. remove duplicates and separate the urls
    4. set a limitation to number of urls, like when 1000 urls found
    5. stop while loop to prevent your PC's memory getting full

    here a sample code and it should works fine, I actually tested it and it was fun fore me:

    import requests
    from bs4 import BeautifulSoup
    import re
    import time
    
    source_code = requests.get('https://stackoverflow.com/')
    soup = BeautifulSoup(source_code.content, 'lxml')
    data = []
    links = []
    
    
    def remove_duplicates(l): # remove duplicates and unURL string
        for item in l:
            match = re.search("(?P<url>https?://[^\s]+)", item)
            if match is not None:
                links.append((match.group("url")))
    
    
    for link in soup.find_all('a', href=True):
        data.append(str(link.get('href')))
    flag = True
    remove_duplicates(data)
    while flag:
        try:
            for link in links:
                for j in soup.find_all('a', href=True):
                    temp = []
                    source_code = requests.get(link)
                    soup = BeautifulSoup(source_code.content, 'lxml')
                    temp.append(str(j.get('href')))
                    remove_duplicates(temp)
                    
                    if len(links) > 162: # set limitation to number of URLs
                        break
                if len(links) > 162:
                    break
            if len(links) > 162:
                break
        except Exception as e:
            print(e)
            if len(links) > 162:
                break
    
    for url in links:
        print(url)
    

    and the output will be:

    https://stackoverflow.com
    https://www.stackoverflowbusiness.com/talent
    https://www.stackoverflowbusiness.com/advertising
    https://stackoverflow.com/users/login?ssrc=head&returnurl=https%3a%2f%2fstackoverflow.com%2f
    https://stackoverflow.com/users/signup?ssrc=head&returnurl=%2fusers%2fstory%2fcurrent
    https://stackoverflow.com
    https://stackoverflow.com
    https://stackoverflow.com/help
    https://chat.stackoverflow.com
    https://meta.stackoverflow.com
    https://stackoverflow.com/users/signup?ssrc=site_switcher&returnurl=%2fusers%2fstory%2fcurrent
    https://stackoverflow.com/users/login?ssrc=site_switcher&returnurl=https%3a%2f%2fstackoverflow.com%2f
    https://stackexchange.com/sites
    https://stackoverflow.blog
    https://stackoverflow.com/legal/cookie-policy
    https://stackoverflow.com/legal/privacy-policy
    https://stackoverflow.com/legal/terms-of-service/public
    https://stackoverflow.com/teams
    https://stackoverflow.com/teams
    https://www.stackoverflowbusiness.com/talent
    https://www.stackoverflowbusiness.com/advertising
    https://www.g2.com/products/stack-overflow-for-teams/
    https://www.g2.com/products/stack-overflow-for-teams/
    https://www.fastcompany.com/most-innovative-companies/2019/sectors/enterprise
    https://www.stackoverflowbusiness.com/talent
    https://www.stackoverflowbusiness.com/advertising
    https://stackoverflow.com/questions/55884514/what-is-the-incentive-for-curl-to-release-the-library-for-free/55885729#55885729
    https://insights.stackoverflow.com/
    https://stackoverflow.com
    https://stackoverflow.com
    https://stackoverflow.com/jobs
    https://stackoverflow.com/jobs/directory/developer-jobs
    https://stackoverflow.com/jobs/salary
    https://www.stackoverflowbusiness.com
    https://stackoverflow.com/teams
    https://www.stackoverflowbusiness.com/talent
    https://www.stackoverflowbusiness.com/advertising
    https://stackoverflow.com/enterprise
    https://stackoverflow.com/company/about
    https://stackoverflow.com/company/about
    https://stackoverflow.com/company/press
    https://stackoverflow.com/company/work-here
    https://stackoverflow.com/legal
    https://stackoverflow.com/legal/privacy-policy
    https://stackoverflow.com/company/contact
    https://stackexchange.com
    https://stackoverflow.com
    https://serverfault.com
    https://superuser.com
    https://webapps.stackexchange.com
    https://askubuntu.com
    https://webmasters.stackexchange.com
    https://gamedev.stackexchange.com
    https://tex.stackexchange.com
    https://softwareengineering.stackexchange.com
    https://unix.stackexchange.com
    https://apple.stackexchange.com
    https://wordpress.stackexchange.com
    https://gis.stackexchange.com
    https://electronics.stackexchange.com
    https://android.stackexchange.com
    https://security.stackexchange.com
    https://dba.stackexchange.com
    https://drupal.stackexchange.com
    https://sharepoint.stackexchange.com
    https://ux.stackexchange.com
    https://mathematica.stackexchange.com
    https://salesforce.stackexchange.com
    https://expressionengine.stackexchange.com
    https://pt.stackoverflow.com
    https://blender.stackexchange.com
    https://networkengineering.stackexchange.com
    https://crypto.stackexchange.com
    https://codereview.stackexchange.com
    https://magento.stackexchange.com
    https://softwarerecs.stackexchange.com
    https://dsp.stackexchange.com
    https://emacs.stackexchange.com
    https://raspberrypi.stackexchange.com
    https://ru.stackoverflow.com
    https://codegolf.stackexchange.com
    https://es.stackoverflow.com
    https://ethereum.stackexchange.com
    https://datascience.stackexchange.com
    https://arduino.stackexchange.com
    https://bitcoin.stackexchange.com
    https://sqa.stackexchange.com
    https://sound.stackexchange.com
    https://windowsphone.stackexchange.com
    https://stackexchange.com/sites#technology
    https://photo.stackexchange.com
    https://scifi.stackexchange.com
    https://graphicdesign.stackexchange.com
    https://movies.stackexchange.com
    https://music.stackexchange.com
    https://worldbuilding.stackexchange.com
    https://video.stackexchange.com
    https://cooking.stackexchange.com
    https://diy.stackexchange.com
    https://money.stackexchange.com
    https://academia.stackexchange.com
    https://law.stackexchange.com
    https://fitness.stackexchange.com
    https://gardening.stackexchange.com
    https://parenting.stackexchange.com
    https://stackexchange.com/sites#lifearts
    https://english.stackexchange.com
    https://skeptics.stackexchange.com
    https://judaism.stackexchange.com
    https://travel.stackexchange.com
    https://christianity.stackexchange.com
    https://ell.stackexchange.com
    https://japanese.stackexchange.com
    https://chinese.stackexchange.com
    https://french.stackexchange.com
    https://german.stackexchange.com
    https://hermeneutics.stackexchange.com
    https://history.stackexchange.com
    https://spanish.stackexchange.com
    https://islam.stackexchange.com
    https://rus.stackexchange.com
    https://russian.stackexchange.com
    https://gaming.stackexchange.com
    https://bicycles.stackexchange.com
    https://rpg.stackexchange.com
    https://anime.stackexchange.com
    https://puzzling.stackexchange.com
    https://mechanics.stackexchange.com
    https://boardgames.stackexchange.com
    https://bricks.stackexchange.com
    https://homebrew.stackexchange.com
    https://martialarts.stackexchange.com
    https://outdoors.stackexchange.com
    https://poker.stackexchange.com
    https://chess.stackexchange.com
    https://sports.stackexchange.com
    https://stackexchange.com/sites#culturerecreation
    https://mathoverflow.net
    https://math.stackexchange.com
    https://stats.stackexchange.com
    https://cstheory.stackexchange.com
    https://physics.stackexchange.com
    https://chemistry.stackexchange.com
    https://biology.stackexchange.com
    https://cs.stackexchange.com
    https://philosophy.stackexchange.com
    https://linguistics.stackexchange.com
    https://psychology.stackexchange.com
    https://scicomp.stackexchange.com
    https://stackexchange.com/sites#science
    https://meta.stackexchange.com
    https://stackapps.com
    https://api.stackexchange.com
    https://data.stackexchange.com
    https://stackoverflow.blog?blb=1
    https://www.facebook.com/officialstackoverflow/
    https://twitter.com/stackoverflow
    https://linkedin.com/company/stack-overflow
    https://creativecommons.org/licenses/by-sa/4.0/
    https://stackoverflow.blog/2009/06/25/attribution-required/
    https://stackoverflow.com
    https://www.stackoverflowbusiness.com/talent
    https://www.stackoverflowbusiness.com/advertising
    
    Process finished with exit code 0
    

    I set the limitation to 162, you can increase it as many as you want and you ram allowed.