javajsoupcrawler4j

Get content of a webpage which is redirected to another webpage


I want to crawl the content of the webpage - http://www.pgmfi.org/. But if we visit the page, we will see it redirects to the page - http://twiki.pgmfi.org/bin/view.

When I tried to crawl the content from the URL (http://www.pgmfi.org/) using jsoup or crawler4j, I got the following content.

Looking for PGMFI.ORG Home ? Please wait redirecting to: http://twiki.pgmfi.org

But I want to get more information about the webpage from that redirected webpage (http://twiki.pgmfi.org/bin/view). When I run a simple code using jsoup, I found the following.

String url = "http://www.pgmfi.org/";

Response response = Jsoup.connect(url).followRedirects(false).execute();
System.out.println(response.statusCode() + " : " + response.url());

//check if URL is redirect?
System.out.println("Is URL going to redirect : " + response.hasHeader("location"));
System.out.println("Target : " + response.header("location"));

Output:

200 : http://www.pgmfi.org/
Is URL going to redirect : false
Target : null

So, the redirection is obviously not straight-forward. My question - is there any way, I can get the url to which the page is redirecting without parsing the html body?

I prefer a solution using crawler4j. Even a solution in jsoup is fine for me.


Solution

  • crawler4j does not support extracting URLs from meta-refresh. However, crawler4j provides the respective meta-tags (see HTMLParseData), so you could enhance the visit(...) method to add the extracted URL to the Frontier object in WebCrawler via schedule(...).

    However, Frontier has only private access in WebCrawler and is therefore not available for your concrete sub-class. For modifying this, youl would either need to (a) fork it or (b) use the Reflection API to change the access modifier.

    Another way would be to open an issue on the official issue tracker here.