springasynchronouscrawler4j

How to send crawler4j data to CrawlerManager?


I'm working with a project where user can search some websites and look for pictures which have unique identifier.

public class ImageCrawler extends WebCrawler {

private static final Pattern filters = Pattern.compile(
        ".*(\\.(css|js|mid|mp2|mp3|mp4|wav|avi|mov|mpeg|ram|m4v|pdf" +
                "|rm|smil|wmv|swf|wma|zip|rar|gz))$");

private static final Pattern imgPatterns = Pattern.compile(".*(\\.(bmp|gif|jpe?g|png|tiff?))$");

public ImageCrawler() {
}

@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
    String href = url.getURL().toLowerCase();
    if (filters.matcher(href).matches()) {
        return false;
    }

    if (imgPatterns.matcher(href).matches()) {
        return true;
    }

    return false;
}

@Override
public void visit(Page page) {
    String url = page.getWebURL().getURL();

    byte[] imageBytes = page.getContentData();
    String imageBase64 = Base64.getEncoder().encodeToString(imageBytes);
    try {
        SecurityContextHolder.getContext().setAuthentication(new UsernamePasswordAuthenticationToken(urlScan.getOwner(), null));
        DecodePictureResponse decodePictureResponse = decodePictureService.decodePicture(imageBase64);
        URLScanResult urlScanResult = new URLScanResult();
        urlScanResult.setPicture(pictureRepository.findByUuid(decodePictureResponse.getPictureDTO().getUuid()).get());
        urlScanResult.setIntegrity(decodePictureResponse.isIntegrity());
        urlScanResult.setPictureUrl(url);
        urlScanResult.setUrlScan(urlScan);
        urlScan.getResults().add(urlScanResult);
        urlScanRepository.save(urlScan);
    }

    } catch (ResourceNotFoundException ex) {
        //Picture is not in our database
    }
}

Crawlers will be run independently. ImageCrawlerManager class, which is singletone, run crawlers.

public class ImageCrawlerManager {

private static ImageCrawlerManager instance = null;


private ImageCrawlerManager(){
}

public synchronized static ImageCrawlerManager getInstance()
{
    if (instance == null)
    {
        instance = new ImageCrawlerManager();
    }
    return instance;
}

@Transactional(propagation=Propagation.REQUIRED)
@PersistenceContext(type = PersistenceContextType.EXTENDED)
public void startCrawler(URLScan urlScan, DecodePictureService decodePictureService, URLScanRepository urlScanRepository, PictureRepository pictureRepository){

    try {
        CrawlConfig config = new CrawlConfig();
        config.setCrawlStorageFolder("/tmp");
        config.setIncludeBinaryContentInCrawling(true);

        PageFetcher pageFetcher = new PageFetcher(config);
        RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
        RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);

        CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
        controller.addSeed(urlScan.getUrl());

        controller.start(ImageCrawler.class, 1);
        urlScan.setStatus(URLScanStatus.FINISHED);
        urlScanRepository.save(urlScan);
    } catch (Exception e) {
        e.printStackTrace();
        urlScan.setStatus(URLScanStatus.FAILED);
        urlScan.setFailedReason(e.getMessage());
        urlScanRepository.save(urlScan);
    }
}

How to send every image data to manager which decode this image, get the initiator of search and save results to database? In code above I can run multiple crawlers and save it to database. But unfortunately when i run two crawlers simultaneously, I can store two search results but all of them are connected to the crawler which was run first.


Solution

  • You should inject your database service into your ẀebCrawler instances and not use a singleton to manage the result of your web-crawl.

    crawler4j supports a custom CrawlController.WebCrawlerFactory (see here for reference), which can be used with Spring to inject your database service into a ImageCrawler instance.

    Every single crawler thread should be responsible for the whole process you described with (e.g. by using some specific services for it):

    decode this image, get the initiator of search and save results to database

    Setting it up like this, your database will be the only source of truth and you will not have to deal with synchronizing crawler-states between different instances or user-sessions.