springweb-crawlerheritrix

Use of Heritrix's HtmlFormCredential and CredentialStore


I am attempting to add authentication to my Heritrix configuration. My .cxml file has the following:

<bean id="preconditions" class="org.archive.crawler.prefetch.PreconditionEnforcer">
  <property name="credentialStore">
    <ref bean="credentialStore" />
  </property>
</bean>
<bean id="fetchHttp" class="org.archive.modules.fetcher.FetchHTTP">
  <property name="credentialStore">
    <ref bean="credentialStore" />
  </property>
  <property name="shouldProcessRule">
    <bean class="org.archive.modules.deciderules.DecideRuleSequence">
      <property name="rules">
        <list>
          <bean class="org.archive.modules.deciderules.recrawl.IdenticalDigestDecideRule">
          <property name="decision" value="REJECT" />
            </bean>
            <bean class="org.archive.modules.deciderules.ResourceNoLongerThanDecideRule">
            <property name="contentLengthThreshold" value="54" />
            <property name="useHeaderLength" value="true" />
            <property name="decision" value="REJECT" />
          </bean>
        </list>
      </property>
    </bean>
  </property>
</bean>
<bean id="exampleCredential" class="org.archive.modules.credential.HtmlFormCredential">
  <property name="domain" value="example.com" />
  <property name="loginUri" value="https://example.com/user?destination=%2f" />
  <property name="formItems">
    <map>
      <!-- username/password -->
      <entry key="name" value="something@something.com"/>
      <entry key="pass" value="genericpassword"/>
      <!-- hidden inputs -->
      <entry key="form_build_id" value="form-asdf" />
      <entry key="form_id" value="user_login" />
      <!-- submit -->
      <entry key="op" value="submit"/>
    </map>
  </property>
</bean>
<bean id="credentialStore" class="org.archive.modules.credential.CredentialStore">
  <property name="credentials">
  <map>
    <entry key="exampleCredential" value-ref="exampleCredential" />
  </map>
  </property>
</bean>

I also set the logging for FetchHTTP and PreconditionEnforcer to FINE, but nothing seems to be happening. No logging output is appearing from either modules and the pages that are pulled down clearly are those of an un-authenticated view. I find it somewhat unclear as to how to use the CredentialStore, considering that I've spent a good amount of time reading through the specifications, which are patchy at best when it comes to authentication and websites. Anyone know how to set up authentication in Heritrix, please help.

Update: Logging didn't work because eclipse didn't know about my HERITRIX_HOME variable, so it never even read the logging configuration file. I changed the bean exampleCredential's domain property from:

<property name="domain" value="example.com" />

to:

<property name="domain" value="www.example.com" />

and now the login page is enqueued, but now the logger spits out the following for all queued files:

org.archive.crawler.prefetch.PreconditionEnforcer.innerProcessResult() PolitenessEnforcer doesn't understand uri's of type dns (ignoring)
org.archive.modules.deciderules.ResourceNoLongerThanDecideRule.evaluate() Error: Missing HttpMethod object in CrawlURI. dns:secure.www.example.com

and none of the files are downloaded or crawled. So though I made progress, it didn't lead me anywhere. There is not much logging information to go off of.


Solution

  • I also asked this question on the Heritrix forms: http://tech.groups.yahoo.com/group/archive-crawler/message/8235 and Noah Levitt had the idea to add the login page as a seed to my crawl. Everything now seems to be working without much issue. My conclusion is that I had everything set up correctly in my config file, but was missing the actual page seed that I needed.