I need to convert relative URLs from a HTML page to absolute ones. I'm using pyquery for parsing.
For instance, this page http://govp.info/o-gorode/gorozhane has relative URLs in the source code, like
<a href="o-gorode/gorozhane?page=2">2</a>
(this is the pagination link at the bottom of the page). I'm trying to use make_links_absolute()
:
import requests
from pyquery import PyQuery as pq
page_url = 'http://govp.info/o-gorode/gorozhane'
resp = requests.get(page_url)
page = pq(resp.text)
page.make_links_absolute(page_url)
but it seems that this breaks the relative links:
print(page.find('a[href*="?page=2"]').attr['href'])
# prints http://govp.info/o-gorode/o-gorode/gorozhane?page=2
# expected value http://govp.info/o-gorode/gorozhane?page=2
As you can see there is doubled o-gorode
in the middle of the final URL that definitely will produce 404 error.
Internally pyquery uses urljoin
from the standard urllib.parse
module, somewhat like this:
from urllib.parse import urljoin
urljoin('http://example.com/one/', 'two')
# -> 'http://example.com/one/two'
It's ok, but there are a lot of sites that have, hmm, unusual relative links with a full path.
And in this case urljoin
will give us an invalid absolute link:
urljoin('http://govp.info/o-gorode/gorozhane', 'o-gorode/gorozhane?page=2')
# -> 'http://govp.info/o-gorode/o-gorode/gorozhane?page=2'
I believe such relative links are not very valid, but Google Chrome has no problem to deal with them; so I guess this is kind of normal across the web.
Are there any advice how to solve this problem? I tried furl
but it does the same join.
In this particular case, the page in question contains
<base href="http://govp.info/"/>
which instructs the browser to use this for resolving any relative links. The <base>
element is optional, but if it's there, you must use it instead of the page's actual URL.
In order to do as the browser does, extract the base href and use it in make_links_absolute()
.
import requests
from pyquery import PyQuery as pq
page_url = 'http://govp.info/o-gorode/gorozhane'
resp = requests.get(page_url)
page = pq(resp.text)
base = page.find('base').attr['href']
if base is None:
base = page_url # the page's own URL is the fallback
page.make_links_absolute(base)
for a in page.find('a'):
if 'href' in a.attrib and 'govp.info' in a.attrib['href']:
print(a.attrib['href'])
prints
http://govp.info/assets/images/map.png http://govp.info/podpiska.html http://govp.info/ http://govp.info/#order ... http://govp.info/o-gorode/gorozhane http://govp.info/o-gorode/gorozhane?page=2 http://govp.info/o-gorode/gorozhane?page=3 http://govp.info/o-gorode/gorozhane?page=4 http://govp.info/o-gorode/gorozhane?page=5 http://govp.info/o-gorode/gorozhane?page=6 http://govp.info/o-gorode/gorozhane?page=2 http://govp.info/o-gorode/gorozhane?page=17 http://govp.info/bannerclick/264 ... http://doska.govp.info/cat-biznes-uslugi/ http://doska.govp.info/cat-transport/legkovye-avtomobili/ http://doska.govp.info/ http://govp.info/
which seems to be correct.