javahttpweb-crawlerhttprequestcrawler4j

Crawler4j authentication not working


I'm trying to use the FormAuthInfo authentication from Crawler4J to crawler into a specific LinkedIn page. This page can only be rendered, when I am correctly logged.

This is my Controller with the access URLs:

public class Controller {

public static void main(String[] args) throws Exception {

    String crawlStorageFolder = "/data/";
    int numberOfCrawlers = 1;

    CrawlConfig config = new CrawlConfig();
    config.setCrawlStorageFolder(crawlStorageFolder);

    PageFetcher pageFetcher = new PageFetcher(config);
    RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
    RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
    CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

    String formUsername = "session_key";
    String formPassword = "session_password";
    String session_user = "email@email.com";
    String session_password = "myPasswordHere";
    String urlLogin = "https://www.linkedin.com/uas/login";
    AuthInfo formAuthInfo = new FormAuthInfo(session_password, session_user, urlLogin, formUsername, formPassword);

    config.addAuthInfo(formAuthInfo);
    config.setMaxDepthOfCrawling(0);

    controller.addSeed("https://www.linkedin.com/vsearch/f?keywords=java");

    controller.start(Crawler.class, numberOfCrawlers);
    controller.shutdown();
}

}

And this is my Crawler class:

public class Crawler extends WebCrawler {
private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|gif|jpg" + "|png|mp3|mp3|zip|gz))$");

@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
    String href = url.getURL().toLowerCase();
    return !FILTERS.matcher(href).matches() && href.startsWith("https://www.linkedin.com");
}

@Override
public void visit(Page page) {
    String url = page.getWebURL().getURL();
    System.out.println("URL: " + url);

    if (page.getParseData() instanceof HtmlParseData) {
        HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
        String text = htmlParseData.getText();
        String html = htmlParseData.getHtml();
        System.out.println(html);
        Set<WebURL> links = htmlParseData.getOutgoingUrls();

        System.out.println("Text length: " + text.length());
        System.out.println("Html length: " + html.length());
        System.out.println("Number of outgoing links: " + links.size());
    }
}

}

When I run this app using the Auth, I get these errors:

    ADVERTÊNCIA: Cookie rejected [JSESSIONID="ajax:3637761943332982524", version:1, domain:.www.linkedin.com, path:/, expiry:null] Illegal domain attribute ".www.linkedin.com". Domain of origin: "www.linkedin.com"
jun 22, 2016 10:59:14 AM org.apache.http.client.protocol.ResponseProcessCookies processCookies

ADVERTÊNCIA: Cookie rejected [lang="v=2&lang=en-us", version:1, domain:linkedin.com, path:/, expiry:null] Domain attribute "linkedin.com" violates RFC 2109: domain must start with a dot
jun 22, 2016 10:59:14 AM org.apache.http.client.protocol.ResponseProcessCookies processCookies

ADVERTÊNCIA: Invalid cookie header: "Set-Cookie: lidc="b=TGST09:g=87:u=1:i=1466603959:t=1466690359:s=AQEc3R_6kIhooZN1RsDNkO2DaYEqzUWp"; Expires=Thu, 23 Jun 2016 13:59:19 GMT; domain=.linkedin.com; Path=/". Invalid 'expires' attribute: Thu, 23 Jun 2016 13:59:19 GMT
jun 22, 2016 10:59:14 AM org.apache.http.client.protocol.ResponseProcessCookies processCookies

ADVERTÊNCIA: Cookie rejected [JSESSIONID="ajax:4912042947175739413", version:1, domain:.www.linkedin.com, path:/, expiry:null] Illegal domain attribute ".www.linkedin.com". Domain of origin: "www.linkedin.com"
jun 22, 2016 10:59:14 AM org.apache.http.client.protocol.ResponseProcessCookies processCookies

ADVERTÊNCIA: Cookie rejected [lang="v=2&lang=en-us", version:1, domain:linkedin.com, path:/, expiry:null] Domain attribute "linkedin.com" violates RFC 2109: domain must start with a dot
jun 22, 2016 10:59:14 AM org.apache.http.client.protocol.ResponseProcessCookies processCookies

ADVERTÊNCIA: Invalid cookie header: "Set-Cookie: lidc="b=TGST09:g=87:u=1:i=1466603960:t=1466690360:s=AQE100NLG_uPIcJSJ7GLtRVkH7j_Ylu9"; Expires=Thu, 23 Jun 2016 13:59:20 GMT; domain=.linkedin.com; Path=/". Invalid 'expires' attribute: Thu, 23 Jun 2016 13:59:20 GMT
jun 22, 2016 10:59:14 AM org.apache.http.client.protocol.ResponseProcessCookies processCookies

ADVERTÊNCIA: Invalid cookie header: "Set-Cookie: lidc="b=TGST09:g=87:u=1:i=1466603960:t=1466690360:s=AQE100NLG_uPIcJSJ7GLtRVkH7j_Ylu9"; Expires=Thu, 23 Jun 2016 13:59:20 GMT; domain=.linkedin.com; Path=/". Invalid 'expires' attribute: Thu, 23 Jun 2016 13:59:20 GMT

Is this something related to the way how my http client deal with the cookie returned by LInkedIn?

Any suggestions? Thanks!


Solution

  • First of all: This is not a problem of crawler4j. It is a problem of Linkedin, which they did not fix for a long time according to the latest google entries.

    However, your approach will not work because crawler4j respects crawler ethics. If you look at robots.txt, you will see, that the crawler will not crawl anything.