pythongoogle-searchpython-requests-html

How to get google search page html code using python?


I try to extract the google search page HTML code in python. I use requests module in python.

from bs4 import BeautifulSoup

url = "https://www.google.com/search?q=how+to+get+google+search+page+source+code+by+python"

resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
print(soup)
search = soup.find_all('div',class_="yuRUbf")
print(search)

But I can't find any of this class_="yuRUbf" in the code. I think it do not give me the source code. Now how can I do this work.

I also used resp.content but it didn't work. I also selenium but it didn't work.


Solution

  • You can use SelectorGadget Chrome extension to easily get CSS selectors by clicking on the desired element in your browser (not always work perfectly if the website is rendered via JavaScript).

    To collect information from all pages you can use non-token pagination with while True loop. The while loop is an endless loop, the exit from which in our case is the presence of a switch button to the next page, namely the CSS selector ".d6cvqb a[id=pnnext]":

    if soup.select_one('.d6cvqb a[id=pnnext]'):
            params["start"] += 10
    else:
        break
    

    Also you can exit the loop by using the limit on the number of search pages:

    if page_num == page_limit:
       break
    

    Check code in the online IDE.

    from bs4 import BeautifulSoup
    import requests, json, lxml
    
    query = "how to get google search page source code by python"
    
    # https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
    params = {
        "q": query,          # query example
        "hl": "en",          # language
        "gl": "uk",          # country of the search, UK -> United Kingdom
        "start": 0,          # number page by default up to 0
        #"num": 100          # parameter defines the maximum number of results to return.
    }
    
    # https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
    }
    
    page_limit = 5           # page limit
    page_num = 0
    
    data = []
    
    while True:
        page_num += 1
        print(f"page: {page_num}")
            
        html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
        soup = BeautifulSoup(html.text, 'lxml')
        
        for result in soup.select(".tF2Cxc"):
            title = result.select_one(".DKV0Md").text
            try:
               snippet = result.select_one(".lEBKkf span").text
            except:
               snippet = None
            links = result.select_one(".yuRUbf a")["href"]
          
            data.append({
              "title": title,
              "snippet": snippet,
              "links": links
            })
    
        if page_num == page_limit:
            break
        if soup.select_one(".d6cvqb a[id=pnnext]"):
            params["start"] += 10
        else:
            break
    print(json.dumps(data, indent=2, ensure_ascii=False))
    

    Example output:

    [
      {
        "title": "How To Build a Website With Python - Digital.com",
        "snippet": "Examples of Sites Created Using Python · Google: The most popular search engine in the world uses Python · Instagram: Python was used to create the backend of ...",
        "links": "https://digital.com/how-to-create-a-website/with-python/"
      },
      {
        "title": "Google Search Operators: 40 Commands to Know in 2023 ...",
        "snippet": "30 Mar 2022 — ",
        "links": "https://kinsta.com/blog/google-search-operators/"
      },
      {
        "title": "Python From Scratch: Create a Dynamic Website - Code",
        "snippet": "19 Feb 2022 — ",
        "links": "https://code.tutsplus.com/articles/python-from-scratch-create-a-dynamic-website--net-22787"
      },
      {
        "title": "How to Use Python to Analyze Google Search Results at Scale",
        "snippet": "21 Dec 2020 — ",
        "links": "https://www.semrush.com/blog/analyzing-search-engine-results-pages/"
      },
      other results ...
    ]
    

    Or you can also use Google Search Engine Results API from SerpApi. It's a paid API with the free plan. The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.

    Code example:

    from serpapi import GoogleSearch
    from urllib.parse import urlsplit, parse_qsl
    import json, os
    
    query = "how to get google search page source code by python"
    params = {
      "api_key": "...",                # serpapi key
      "engine": "google",              # serpapi parser engine
      "q": query,                      # search query
      "num": "100"                     # number of results per page (100 per page in this case)
      # other search parameters: https://serpapi.com/search-api#api-parameters
    }
    
    search = GoogleSearch(params)      # where data extraction happens
    
    organic_results_data = []
    page_num = 0
    
    while True:
        results = search.get_dict()    # JSON -> Python dictionary
        
        page_num += 1
        
        for result in results["organic_results"]:
            organic_results_data.append({
                "title": result.get("title"),
                "snippet": result.get("snippet"),
                "link": result.get("link")
            })
        
        if "next_link" in results.get("serpapi_pagination", []):
            search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
        else:
            break
        
    print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))
    

    Output:

    [
      {
        "title": "How To Work with Web Data Using Requests and Beautiful ...",
        "snippet": "This tutorial will go over how to work with the Requests and Beautiful Soup Python packages in order to make use of data from web pages.",
        "link": "https://www.digitalocean.com/community/tutorials/how-to-work-with-web-data-using-requests-and-beautiful-soup-with-python-3"
      },
      {
        "title": "google search - Simply Python",
        "snippet": "I have included part of the code for the noun phrase detection (Under pattern_parsing.py). ... Run google search and obtain page source for the images.",
        "link": "https://simply-python.com/tag/google-search/"
      },
      {
        "title": "Web Scraping Using Selenium Python - Analytics Vidhya",
        "snippet": "Step 2 – Install Chrome Driver · Step 2 – Install Chrome Driver · Step 3 – Specify search URL",
        "link": "https://www.analyticsvidhya.com/blog/2020/08/web-scraping-selenium-with-python/"
      },
      other results ...
    ]
    

    There's a 13 ways to scrape any public data from any website blog post if you want to know more about website scraping.