pythonweb-scrapingbeautifulsouppython-requeststimeoutexception

For loop web scraping a website brings up timeouterror, newconnectionerror and a requests.exceptions.ConnectionError


Apologies, I am a beginning at Python and webscraping.

I am web scraping wugniu.com to extract readings for characters that I input. I made a list of 10273 characters to format into the URL and bring up the page with readings, then I used the Requests module to return the source code, then Beautiful Soup to return all the audio ids (as their strings contain the readings for the input characters - I couldn't use the text that comes up in the table as they are svgs). Then I tried to output the characters and their readings to out.txt.

# -*- coding: utf-8 -*-
import requests, time
from bs4 import BeautifulSoup
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

characters = [
#characters go here
]

output = open("out.txt", "a", encoding="utf-8")

tic = time.perf_counter()

for char in characters:
    # Characters from the list are formatted into the url 
    url = "https://wugniu.com/search?char=%s&table=wenzhou" % char

    page = requests.get(url, verify=False)
    soup = BeautifulSoup(page.text, 'html.parser')

    for audio_tag in soup.find_all('audio'):
        audio_id = audio_tag.get('id').replace("0-","")
        #output.write(char)
        #output.write("  ")
        #output.write(audio_id)
        #output.write("\n")
        print(i)
        time.sleep(60)

output.close()
toc = time.perf_counter()
duration = int(toc) - int(tic)
print("Took %d seconds" % duration)

out.txt is the output file I tried to output the results to. I measured the time the process took to measure performance.

However, after 50 or so loops, I get this in the cmd:

Traceback (most recent call last):                                                                                       
 File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connection.py", line 169, in _new_conn           
conn = connection.create_connection(                                                                                 
 File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\util\connection.py", line 96, in create_connection                                                                                                                       
raise err                                                                                                             
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\util\connection.py", line 86, in create_connection                                                                                                                           
sock.connect(sa)                                                                                                    
TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond                                                                                                                                                   
During handling of the above exception, another exception occurred:                                                                                                                                                                                     
Traceback (most recent call last):                                                                                       
File"C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connectionpool.py", line 699, in urlopen         httplib_response = self._make_request(                                                                                
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connectionpool.py", line 382, in _make_request                                                                                                                               
self._validate_conn(conn)                                                                                            
 File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connectionpool.py", line 1010, in _validate_conn                                                                                                                             
conn.connect()                                                                                                       
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connection.py", line 353, in connect             
conn = self._new_conn()                                                                                                   
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connection.py", line 181, in _new_conn           
raise NewConnectionError(urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x000002035D5F9040>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond                                                                                                                                         
During handling of the above exception, another exception occurred:                                                                                                                                                                                     
Traceback (most recent call last):                                                                                        
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\adapters.py", line 439, in send                 
resp = conn.urlopen(                                                                                                  
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connectionpool.py", line 755, in urlopen         
retries = retries.increment(                                                                                          
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\util\retry.py", line 573, in increment           
raise MaxRetryError(_pool, url, error or ResponseError(cause))                                                      urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='wugniu.com', port=443): Max retries exceeded with url: /search?char=%E8%87%B4&table=wenzhou (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002035D5F9040>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))                                                                                                                                                                                                                                      
During handling of the above exception, another exception occurred:                                                                                                                                                                             
Traceback (most recent call last):                                                                                        
File "C:\Users\[user]\Documents\wenzhou-ime\test.py", line 3282, in <module>                                               page = requests.get(url, verify=False)                                                                                
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\api.py", line 76, in get                        
return request('get', url, params=params, **kwargs)                                                                   File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\api.py", line 61, in request                    
return session.request(method=method, url=url, **kwargs)                                                              File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\sessions.py", line 542, in request              
resp = self.send(prep, **send_kwargs)                                                                                 File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\sessions.py", line 655, in send                 
r = adapter.send(request, **kwargs)                                                                                   File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\adapters.py", line 516, in send                 
raise ConnectionError(e, request=request)                                                                                                                                   
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='wugniu.com', port=443): Max retries exceeded with url: /search?char=%E8%87%B4&table=wenzhou (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002035D5F9040>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))     

I tried to fix this by adding time.sleep(60) but the errors still happened. When I made this script yesterday, I was able to run it with a list of up to 1500 characters with no errors. Could someone please help me with this? Thanks.


Solution

  • That's completely normal behavior expected. as that's related to Chicken-Egg Issue.

    Imagine that you open the Firefox browser and then you open google.com, And then you close it and repeat the circle!

    That's count as a DDOS attack and all modern servers will block your requests and flag your IP as that's really hurting their bandwidth!

    The logical and right approach is to use the same session instead of keep creating multiple sessions. As that's will not be shown under TCP-Syn Flood Flag. Check Legal tcp-flags.

    On the other side, you really need to use Context-Manager instead of keep remember your variables.

    Example:

    output = open("out.txt", "a", encoding="utf-8")
    output.close()
    

    Can be handled via With as below:

    with open('out.txt', 'w', newline='', encoding='utf-8') as output:
        # here you can do your operation.
    

    and once you be out the with then your file will be closed automatically!

    Also, consider using the new format string instead of the old

    url = "https://wugniu.com/search?char=%s&table=wenzhou" % char
    

    Can be:

    "https://wugniu.com/search?char={}&table=wenzhou".format(char)
    

    I'll not use a professional code here, I've made it simple for you to can understand the concept.

    Pay attention to how I picked up the desired element and how I wrote it to the file. and the difference speed from lxml and html.parser can be found here

    import requests
    from bs4 import BeautifulSoup
    import urllib3
    
    urllib3.disable_warnings()
    
    
    def main(url, chars):
        with open('result.txt', 'w', newline='', encoding='utf-8') as f, requests.Session() as req:
            req.verify = False
            for char in chars:
                print(f"Extracting {char}")
                r = req.get(url.format(char))
                soup = BeautifulSoup(r.text, 'lxml')
                target = [x['id'][2:] for x in soup.select('audio[id^="0-"]')]
                print(target)
                f.write(f'{char}\n{str(target)}\n')
    
    
    if __name__ == "__main__":
        chars = ['核']
        main('https://wugniu.com/search?char={}&table=wenzhou', chars)
    

    Also as to follow the Python Dry Principle You can set req.verify = False instead of keep setting verify = False on each request.

    Next Step: You should take a look at Threading or AsyncProgrammingg in order to enhance your code operation time as in real-world projects we aren't using a normal for loop (count as really slow) while you can send a bunch of URLs and wait for a response.