javaconnectionjsoup

Unstable download speed (connection) when scraping webpages using Jsoup


I am currently scraping webpages with a program I wrote in Java using Jsoup. The program needs to scrape about 450 urls (it's not that much).

The problem is that I am getting very unstable download speed when scraping.

For example, the first 7 urls would get scraped instantly within 2 seconds with a download speed of nearly 1MB/s, but then the download speed suddenly reduces down to 0.4KB/s or EVEN 0KB/s, causing the program to take 13 seconds to scrape a url. This sort of fluctuation occurs constantly and is resulting in severe slow-down of the program.

This is not my internet connection issue because it is happening with both my work and home wireless internet. And even when the current network speed indicator displays that the download speed is 0KB/s, if I open up a webpage through my browser the page loads instantly with the download speed suddenly increasing again to 1MB/s. But this increase causes no effect on my program and the program is still very slow to scrape the urls.

What could be the problem? Is there anything I need to configure to ensure a constant download speed for my scraping program?


Solution

  • This is a standard problem strongly related to external resources (the urls you need to download).

    To solve it you can create a pool of threads that make simultaneous downloads of different resources. Downloading simultaneously give you more possibilities to reach the limit of your bandwith also if a single download is very slow.

    Total download time will decrease to a value closer to your bandwith capabilities.

    Here a basic minimal example.

    ServerSocket serverSocket = new ServerSocket(portNumber);
    Socket clientSocket = serverSocket.accept();
    

    Remember to start first the server and after the client.


    Here are some links to the Thread Pools that is possible to use as a start point to implement a thread pool to download many urls at the same time: