I am using crawler 4J to crawl user profile on gitHub for instance I want to crawl url: https://github.com/search?q=java+location:India&p=1 for now I am adding this hard coded url in my crawler controller like:
String url = "https://github.com/search?q=java+location:India&p=1"; controller.addSeed(url);
When crawler 4J starts the URL Crawled is : https://github.com/search?q=java%2Blocation%3AIndia&p=1
which gives me error page. What should I do, I have tried giving encoded url but that doesn't work either.
I had to eventually make the slightest of changes to crawler4J source code: File Name: URLCanonicalizer.java Method : percentEncodeRfc3986
Just commented the first line in this method and I was able to crawl and fetch my results
//string = string.replace("+", "%2B");
In my url there was + character and that was being replaced by %2B and I was getting a error page,I wonder why they have specifically replaced + character before encoding the entire URL.