indexingsearch-engineweb-crawlercxmlheritrix

How do i exclude everything but text/html from a heritrix crawl?


On: Heritrix Usecases there is an Use Case for "Only Store Successful HTML Pages"

My Problem: i dont know how to implement it in my cxml File. Especially: Adding the ContentTypeRegExpFilter to the ARCWriterProcessor => set its regexp setting to text/html.*. ... There is no ContentTypeRegExpFilter in the sample cxml Files.


Solution

  • The use cases you cite are somewhat out of date and refer to Heritrix 1.x (filters have been replaced with decide rules, very different configuration framework). Still the basic concept is the same.

    The cxml file is basically a Spring configuration file. You need to configure the property shouldProcessRule on the ARCWriter bean to be the ContentTypeMatchesRegexDecideRule

    A possible ARCWriter configuration:

      <bean id="warcWriter" class="org.archive.modules.writer.ARCWriterProcessor">
        <property name="shouldProcessRule">
          <bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
            <property name="decision" value="ACCEPT" />
            <property name="regex" value="^text/html.*">
          </bean>
        </property>
        <!-- Other properties that need to be set ... -->
      </bean>
    

    This will cause the Processor to only process those items that match the DecideRule, which in turn only passes those whose content type (mime type) matches the provided regular expression.

    Be careful about the 'decision' setting. Are you ruling things in our out? (My example rules things in, anything not matching is ruled out).

    As shouldProcessRule is inherited from Processor, this can be applied to any processor.

    More information about configuring Heritrix 3 can be found on the Heritrix 3 Wiki (the user guide on crawler.archive.org is about Heritrix 1)