pythonweb-scrapingpython-requestspython-requests-html

HTML content is not same while reading from python requests library


HTML code from brower: enter image description here

Html code from python requests library:

<p class="text-muted ">
            <span class="certificate">12</span>
            <span class="ghost">|</span> 
            <span class="runtime">192 min</span>
            <span class="ghost">|</span> 
            <span class="genre">Action, Adventure, Fantasy</span>
    </p>

Code:

import requests

base_url = "https://www.imdb.com"
search_url = base_url + "/search/title/?"
params = {
    "title_type": "feature",
    "release_date": "2022-01-01,2022-12-31",  # Movies released in the past 1 year
    "start": 1  # Starting page number
}

# Send GET request to IMDb search page
# response = urllib.request.urlopen(search_url + urllib.parse.urlencode(params))
response = requests.get(search_url, params=params)
print((response.text))

How to get the exact html code? I have tried urllib.request with no help.


Solution

  • Try to set Accept-Language HTTP header to en-US:

    import requests
    from bs4 import BeautifulSoup
    
    base_url = "https://www.imdb.com"
    search_url = base_url + "/search/title/"
    params = {
        "title_type": "feature",
        "release_date": "2022-01-01,2022-12-31",  # Movies released in the past 1 year
        "start": 1,
    }
    
    headers =  {
        'Accept-Language': 'en-US,en;q=0.5'
    }
    
    
    response = requests.get(search_url, params=params, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    for title in soup.select('h3')[:10]:
        print(f"{title.get_text(strip=True, separator=' '):<60} {title.find_next(class_='certificate').text:<10}")
    

    Prints:

    1. Avatar: The Way of Water (2022)                           PG-13     
    2. The Blackening (2022)                                     R         
    3. X (II) (2022)                                             R         
    4. Sisu (2022)                                               R         
    5. A Man Called Otto (2022)                                  PG-13     
    6. Top Gun: Maverick (2022)                                  PG-13     
    7. Chevalier (2022)                                          PG-13     
    8. The Batman (2022)                                         PG-13     
    9. Sanctuary (I) (2022)                                      R         
    10. Everything Everywhere All at Once (2022)                 R         
    

    For example for de-DE header I get:

    headers =  {
        'Accept-Language': 'de-DE,de;q=0.5'
    }
    
    ...
    

    Prints:

    1. Avatar: The Way of Water (2022)                           12        
    2. The Blackening (2022)                                     R         
    3. X (II) (2022)                                             16        
    4. Sisu: Rache ist süss (2022)                               18        
    5. Ein Mann namens Otto (2022)                               12        
    6. Top Gun: Maverick (2022)                                  12        
    7. Chevalier (2022)                                          PG-13     
    8. The Batman (2022)                                         12        
    9. Sanctuary (I) (2022)                                      R         
    10. Everything Everywhere All at Once (2022)                 16