javaweb-scrapingmaven-repository

Can't access maven repo through URLConnection or HttpsURLConnection (403)


Purpose, just a POC (for now) to automatically and periodically find some CVE tags in the maven repository.

I can access maven just fine through browser and mvn, but am unable to do the same via Java, what am I missing? I've tried UrlConnection, HttpsURLConnection, with and without GET, Content-type, User-Agent, and Accept, it always returns a 403 for all addresses that I try, the same code works fine on other websites like "cve.mitre.org" or "nvd.nist.gov", but fails for "https://mvnrepository.com/artifact/log4j/apache-log4j-extras/1.2.17".

My URL is been built dynamically, with the start "**https://mvnrepository.com/artifact/**", then adding the group, name, and version are added, turning it into a valid address like "https://mvnrepository.com/artifact/log4j/apache-log4j-extras/1.2.17"

    System.setProperty("https.proxyHost", "xxxx");
    System.setProperty("https.proxyPort", "xxxx");

    String content = null;
    try {
        URL obj = new URL(address);
        HttpsURLConnection con = (HttpsURLConnection) obj.openConnection();
        con.setRequestMethod("GET");
        con.setRequestProperty("Content-Type", "application/json");
        con.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36");
        con.setRequestProperty("Accept", "*/*");

        con.connect();
        
        BufferedReader br;
        
        if (con.getResponseCode() < 300) {
            br = new BufferedReader(new InputStreamReader(con.getInputStream(), StandardCharsets.UTF_8));
        } else {
            br = new BufferedReader(new InputStreamReader(con.getErrorStream(), StandardCharsets.UTF_8));
        }            

        final StringBuilder sb = new StringBuilder();
        String line;
        while ((line = br.readLine()) != null) {
            sb.append(line);
        }
        br.close();

Solution

  • This web use anti-bot security CloudFlare.
    How to bypass CloudFlare bot protection?
    It depends.... Sometimes it is very difficult task or impossible. That you need to do, is simulate a real user with the browser.
    With htmlunit browser you can bypass it in this case only and with a good IP address. (i use my own ip address and did only one request)

    You need maven dependency:

    <dependency>
        <groupId>net.sourceforge.htmlunit</groupId>
        <artifactId>htmlunit</artifactId>
        <version>2.57.0</version>
    </dependency>
    

    Here you have some java example:

    import com.gargoylesoftware.htmlunit.WebClient;
    import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
    import com.gargoylesoftware.htmlunit.html.HtmlPage;
    import java.io.IOException;
    import java.net.URL;
    import java.util.List;
    
    public class Maven {
    
        public static void main(String[] args) throws IOException {
    
            try (final WebClient webClient = new WebClient()) {
                webClient.getOptions().setJavaScriptEnabled(false);
                URL target = new URL("https://mvnrepository.com/artifact/log4j/apache-log4j-extras/1.2.17");
                final HtmlPage page = webClient.getPage(target);
                List<HtmlAnchor> elements = page.getByXPath("//a[contains(@class, 'vuln')]");
                elements.forEach(element -> System.out.println(element.getTextContent()));
            }
        }
    }
    
    

    OUTPUT:

    CVE-2022-23305
    CVE-2022-23302
    CVE-2021-4104
    CVE-2019-17571
    View 1 more ...
    4 vulnerabilities 
    

    I hope I have been able to help you.