This is for pages which require browser to render. I have a long list of urls and I don't know which will work with httpProtocol and which require selenium protocol. Is there a way to automatically handle this.
First try a url with httpProtocol as it is faster , if it requires shift to selenium protocol for that url.
There’s no built-in way to do this directly in Apache StormCrawler. However, you can achieve it by using Metadata together with the DelegatorProtocol implementation.
Using the delegator, you can decide which protocol to use either through regular expressions or via metadata. If you already know which protocol a URL should use, you can inject the appropriate metadata at seed time.
If you need to determine the protocol dynamically, you’ll need to implement some heuristics and create a small "shunt2 implementation (similar to how StormCrawler handles Tika). The idea is to re-schedule the URL with a metadata flag such as "needs_headless". The DelegatorProtocol will check that flag and then route the request to the Selenium protocol.
Also note that StormCrawler now supports Playwright, which is generally preferable to Selenium for this type of headless crawling.