We are using Elastic Enterprise Search's App Search Web Crawler. We have observed that it does not crawl and index all the contents of an HTML page.
We suspected that this could be due to a HTML response being too big or <body>
tag being too big.
This is not the case the max limits of body content and HTML responses is well within the default limits.
Yet the crawler is crawling only little content on the page. We assumed this could be due to broken/unclosed div tags but that is also not the case.
We validated our HTML response, there are no unclosed divs.
We also observed the crawler logs using Kibana, it says Success 200. but when we actually check page content it's not even half crawled. About 20% of the content is crawled by the crawler.
I believe the webcrawler is using Apache Tika behind the scenes. I parsed the html content in my local code using psvm Java code that uses Apache Tika. I faced no problems with HTML. I could scan all the HTML content.
Why is this happening? What could be the reason the Webcrawler is not indexing full page content? The Crawler is new so not many people are using it so there are not many forums to check for already answered questions.
We eventually solved this by fixing our HTML content. It appears the Elastic Enterprise Search's Web Crawler terminates crawling or extracting page content if it encounters some tag or text that it does not understand.
For example in our case it was a simple HTML comment inside a script, but inside that comment was
<script>
var s = s.contains("<!--cq")"...
It considered this <!--cq to be an HTML tag and looked for a closing tag but we didn't have one as this is just a script. Unfortunately data-elastic-exclude did not work either we had to remove this condition.
Turns out the parser Elastic WebCrawler is using is very fragile.