python-3.xbeautifulsoupurllibhtml-parser

urllib.error.HTTPError: HTTP Error 302


I am trying to Parse a website using Python3.6 using the HTML parser, but it throws ab error as follows:

urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop. The last 30x error message was: Found The code I wrote is as below: {

from urllib.request import urlopen as uo
from bs4 import BeautifulSoup
import ssl

# Ignore SSL Certification
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter--')
html = uo(url,context = ctx).read()

soup = BeautifulSoup(html,"html.parser")

print(soup)
#retrieve all the anchor tags
#tags = soup('a')

}

Can someone tell me why is it throwing this error , what it means and how to solve this error?


Solution

  • As stated in the comments:

    That site sets a cookie and then redirects to /Home.aspx.

    To avoid the loop of redirects on this site, you must have 24 chars ASP.NET_SessionId cookie set.

    import urllib.request
    opener = urllib.request.build_opener()
    opener.addheaders.append(('Cookie', 'ASP.NET_SessionId=garbagegarbagegarbagelol'))
    f = opener.open("http://apnakhata.raj.nic.in/")
    html = f.read()
    

    However, I'd just use requests.

    import requests
    
    r = requests.get('http://apnakhata.raj.nic.in/')
    html = r.text
    

    It saves cookies to a RequestsCookieJar by default. After the initial request, only one redirect happens. You can see it here:

    >>> r.history[0]
    [<Response [302]>]
    
    >>> r.history[0].cookies
    <RequestsCookieJar[Cookie(version=0, name='ASP.NET_SessionId', value='ph0chopmjlpi1dg0f3xtbacu', port=None, port_specified=False, domain='apnakhata.raj.nic.in', domain_specified=False, domain_initial_dot=False, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False)]>
    

    To scrape the page, you can use requests_html created by the same author.

    from requests_html import HTMLSession
    session = HTMLSession()
    r = session.get('http://apnakhata.raj.nic.in/')
    

    Getting links is extremely easy:

    >>> r.html.absolute_links
    {'http://apnakhata.raj.nic.in/',
    'http://apnakhata.raj.nic.in/Cyberlist.aspx',
    ...
    'http://apnakhata.raj.nic.in/rev_phone.aspx'}