pythonparsing

I get TimeoutError: [WinError 10060] when I parsing my school website


# -*- coding: UTF-8 -*-

import urllib.request
import re
import os

os.system("cls")

url=input("Url Link : ")

if(url[0:8]=="https://"):
   url=url[:4]+url[5:]

if(url[0:7]!="http://"):
    url="http://"+url

value=urllib.request.urlopen(url).read().decode('UTF8')
par='<title>(.+?)</title>'

result=re.findall(par,value) 
print(result)

It is title parsing program. It works well when parsing like Google, Gmail site. When try to parsing my school website the error comes out. It is the problem in school? Or in my code?


Solution

  • Using Python Requests (http://docs.python-requests.org/en/latest/) I was able to download http://jakjeon.icems.kr/main.do without error although some of the text was garbled due to inability to install the Korean code page (949) for Windows.

    Due to the encoding error when redirecting output to a file, I enhanced the script to write its output directly to a file with UTF-8 encoding. Here is the new script:

    import requests
    
    url='http://jakjeon.icems.kr/main.do'
    r = requests.get(url)
    print(r.status_code)
    print(r.headers['content-type'])
    print(r.encoding)
    fout = open('URLDownloadDemo.output.txt', mode='wt', encoding='UTF-8')
    fout.write(r.text)
    fout.close()
    

    Running this worked perfectly (no errors) and the output file contained Korean alphabet symbols identical to those in the source of the web page.

    The new script is available at https://raw.githubusercontent.com/zalacer/projects-tn/master/URLDownloadDemo/URLDownloadDemo2.py and its output file is at https://raw.githubusercontent.com/zalacer/projects-tn/master/URLDownloadDemo/URLDownloadDemo.output.txt.