I want to crawl the content of the webpage - http://www.pgmfi.org/. But if we visit the page, we will see it redirects to the page - http://twiki.pgmfi.org/bin/view.
When I tried to crawl the content from the URL (http://www.pgmfi.org/) using jsoup
or crawler4j
, I got the following content.
Looking for PGMFI.ORG Home ? Please wait redirecting to: http://twiki.pgmfi.org
But I want to get more information about the webpage from that redirected webpage (http://twiki.pgmfi.org/bin/view). When I run a simple code using jsoup
, I found the following.
String url = "http://www.pgmfi.org/";
Response response = Jsoup.connect(url).followRedirects(false).execute();
System.out.println(response.statusCode() + " : " + response.url());
//check if URL is redirect?
System.out.println("Is URL going to redirect : " + response.hasHeader("location"));
System.out.println("Target : " + response.header("location"));
Output:
200 : http://www.pgmfi.org/
Is URL going to redirect : false
Target : null
So, the redirection is obviously not straight-forward. My question - is there any way, I can get the url to which the page is redirecting without parsing the html body?
I prefer a solution using crawler4j
. Even a solution in jsoup
is fine for me.
crawler4j
does not support extracting URLs from meta-refresh
. However, crawler4j
provides the respective meta-tags (see HTMLParseData
), so you could enhance the visit(...)
method to add the extracted URL to the Frontier
object in WebCrawler
via schedule(...)
.
However, Frontier
has only private
access in WebCrawler
and is therefore not available for your concrete sub-class. For modifying this, youl would either need to (a) fork it or (b) use the Reflection API to change the access modifier.
Another way would be to open an issue on the official issue tracker here.