# -*- coding: UTF-8 -*-
import urllib.request
import re
import os
os.system("cls")
url=input("Url Link : ")
if(url[0:8]=="https://"):
url=url[:4]+url[5:]
if(url[0:7]!="http://"):
url="http://"+url
value=urllib.request.urlopen(url).read().decode('UTF8')
par='<title>(.+?)</title>'
result=re.findall(par,value)
print(result)
It is title parsing program. It works well when parsing like Google, Gmail site. When try to parsing my school website the error comes out. It is the problem in school? Or in my code?
Using Python Requests (http://docs.python-requests.org/en/latest/) I was able to download http://jakjeon.icems.kr/main.do without error although some of the text was garbled due to inability to install the Korean code page (949) for Windows.
Due to the encoding error when redirecting output to a file, I enhanced the script to write its output directly to a file with UTF-8 encoding. Here is the new script:
import requests
url='http://jakjeon.icems.kr/main.do'
r = requests.get(url)
print(r.status_code)
print(r.headers['content-type'])
print(r.encoding)
fout = open('URLDownloadDemo.output.txt', mode='wt', encoding='UTF-8')
fout.write(r.text)
fout.close()
Running this worked perfectly (no errors) and the output file contained Korean alphabet symbols identical to those in the source of the web page.
The new script is available at https://raw.githubusercontent.com/zalacer/projects-tn/master/URLDownloadDemo/URLDownloadDemo2.py and its output file is at https://raw.githubusercontent.com/zalacer/projects-tn/master/URLDownloadDemo/URLDownloadDemo.output.txt.