selenium-chromedriverstormcrawler

How do you set up Stormcrawler to run with chromedriver instead of phantomJS?


The tutorial here describes how to set up Stormcrawler to run with phantomJS, but phantomJS doesn't seem capable of sourcing and executing outlinking javascript pages (e.g., javascript code that's linked to outside of the immediate page's context). Chromedriver appears to be able to handle this case, however. How can I set up Stormcrawler to run with chromedriver instead of phantomJS?


Solution

  • The basic set of steps you need to follow are:

    1. Install latest versions of Chrome and Chromedriver (below based on the tutorial here):
      # Install Google Chrome
      wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
      sudo apt install ./google-chrome-stable_current_amd64.deb
      
      # Install Chromedriver
      PLATFORM=linux64  # Adjust as necessary depending on your system
      VERSION=$(curl http://chromedriver.storage.googleapis.com/LATEST_RELEASE)
      curl -O http://chromedriver.storage.googleapis.com/$VERSION/chromedriver_$PLATFORM.zip
      unzip chromedriver_linux64.zip
      
      # Move executable into your path
      cp chromedriver /usr/bin/
      
    2. Specify the following selenium settings in your crawler configuration file (based on snippet from @JulienNioche here), including the address and port at which chromedriver will be running:
      http.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"
      https.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"
      selenium.addresses: "http://localhost:9515"
      selenium.setScriptTimeout: 10000
      selenium.pageLoadTimeout: 1000
      selenium.implicitlyWait: 1000
      selenium.capabilities:
        goog:chromeOptions:
          args:
          - "--no-sandbox"
          - "--disable-dev-shm-usage"
          - "--headless"
          - "--disable-gpu"
      
    3. Rebuild your Stormcrawler maven package: mvn clean install; mvn clean package
      • Only necessary if you modified any of your source configuration files, but doesn't hurt to rebuild again anyway
    4. Start chromedriver in background (defaults to port 9515): chromedriver --headless &
    5. [Optional if connecting to Elasticsearch] Set up your ES indices, if not already done so
    6. Start your topology (first in local mode as shown here to test your setup; if it doesn't crash, then you should be good to go in remote mode):
      storm jar target/stormcrawler-1.0-SNAPSHOT.jar  org.apache.storm.flux.Flux --local es-crawler.flux --sleep 600000
      

    If things still don't work for you after following these steps, there may be an issue in one of your configuration files or a version incompatibility issue between one or more of the tools. In any case, I've provided a set of example configurations below that worked for me (as of the time of writing) which I hope may be of help in getting things working.


    Example Configurations (for stormcrawler-elasticsearch setup with chromedriver)

    Versions used at time of writing this answer:

    crawler-conf.yaml

    config:
      topology.workers: 3
      topology.message.timeout.secs: 3000
      topology.max.spout.pending: 100
      topology.debug: true
    
      fetcher.threads.number: 100
    
      # override the JVM parameters for the workers
      topology.worker.childopts: "-Xmx2g -Djava.net.preferIPv4Stack=true"
    
      # mandatory when using Flux
      topology.kryo.register:
        - com.digitalpebble.stormcrawler.Metadata
    
      # lists the metadata to persist to storage
      # these are not transfered to the outlinks
      metadata.persist:
       - _redirTo
       - error.cause
       - error.source
       - isSitemap
       - isFeed
    
      http.agent.name: "Anonymous Coward"
      http.agent.version: "1.0"
      http.agent.description: "built with StormCrawler 1.17"
      http.agent.url: "http://someorganization.com/"
      http.agent.email: "someone@someorganization.com"
    
      # The maximum number of bytes for returned HTTP response bodies.
      # The fetched page will be trimmed to 65KB in this case
      # Set -1 to disable the limit.
      http.content.limit: -1 # default 65536
    
      parsefilters.config.file: "parsefilters.json"
      urlfilters.config.file: "urlfilters.json"
    
      # revisit a page daily (value in minutes)
      # set it to -1 to never refetch a page
      fetchInterval.default: 1440
    
      # revisit a page with a fetch error after 2 hours (value in minutes)
      # set it to -1 to never refetch a page
      fetchInterval.fetch.error: 120
    
      # never revisit a page with an error (or set a value in minutes)
      fetchInterval.error: -1
    
      # configuration for the classes extending AbstractIndexerBolt
      # indexer.md.filter: "someKey=aValue"
      indexer.url.fieldname: "url"
      indexer.text.fieldname: "content"
      indexer.canonical.name: "canonical"
      indexer.md.mapping:
      - parse.title=title
      - parse.keywords=keywords
      - parse.description=description
      - domain=domain
    
      # Metrics consumers:
      topology.metrics.consumer.register:
         - class: "org.apache.storm.metric.LoggingMetricsConsumer"
           parallelism.hint: 1
    
      http.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"
      https.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"
      selenium.addresses: "http://localhost:9515"
      selenium.setScriptTimeout: 10000
      selenium.pageLoadTimeout: 1000
      selenium.implicitlyWait: 1000
      selenium.capabilities:
        goog:chromeOptions:
          args:
          - "--nosandbox"
          - "--disable-dev-shm-usage"
          - "--headless"
          - "--disable-gpu"
    

    es-conf.yaml

    config:
      # ES indexer bolt
      es.indexer.addresses: "localhost"
      es.indexer.index.name: "content"
      # es.indexer.pipeline: "_PIPELINE_"
      es.indexer.create: false
      es.indexer.bulkActions: 100
      es.indexer.flushInterval: "2s"
      es.indexer.concurrentRequests: 1
    
      # ES metricsConsumer
      es.metrics.addresses: "http://localhost:9200"
      es.metrics.index.name: "metrics"
    
      # ES spout and persistence bolt
      es.status.addresses: "http://localhost:9200"
      es.status.index.name: "status"
      es.status.routing: true
      es.status.routing.fieldname: "key"
      es.status.bulkActions: 500
      es.status.flushInterval: "5s"
      es.status.concurrentRequests: 1
    
        # spout config #
    
      # time in secs for which the URLs will be considered for fetching after a ack of fail
      spout.ttl.purgatory: 30
    
      # Min time (in msecs) to allow between 2 successive queries to ES
      spout.min.delay.queries: 2000
    
      # Delay since previous query date (in secs) after which the nextFetchDate value will be reset to the current time
      spout.reset.fetchdate.after: 120
    
      es.status.max.buckets: 50
      es.status.max.urls.per.bucket: 2
      # field to group the URLs into buckets
      es.status.bucket.field: "key"
      # fields to sort the URLs within a bucket
      es.status.bucket.sort.field:
       - "nextFetchDate"
       - "url"
      # field to sort the buckets
      es.status.global.sort.field: "nextFetchDate"
    
      # CollapsingSpout : limits the deep paging by resetting the start offset for the ES query
      es.status.max.start.offset: 500
    
      # AggregationSpout : sampling improves the performance on large crawls
      es.status.sample: false
    
      # max allowed duration of a query in sec
      es.status.query.timeout: -1
    
      # AggregationSpout (expert): adds this value in mins to the latest date returned in the results and
      # use it as nextFetchDate
      es.status.recentDate.increase: -1
      es.status.recentDate.min.gap: -1
    
      topology.metrics.consumer.register:
           - class: "com.digitalpebble.stormcrawler.elasticsearch.metrics.MetricsConsumer"
             parallelism.hint: 1
    

    es-crawler.flux

    name: "crawler"
    
    includes:
        - resource: true
          file: "/crawler-default.yaml"
          override: false
    
        - resource: false
          file: "crawler-conf.yaml"
          override: true
    
        - resource: false
          file: "es-conf.yaml"
          override: true
    
    spouts:
      - id: "spout"
        className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout"
        parallelism: 10
    
      - id: "filespout"
        className: "com.digitalpebble.stormcrawler.spout.FileSpout"
        parallelism: 1
        constructorArgs:
          - "."
          - "seeds.txt"
          - true
    
    bolts:
      - id: "filter"
        className: "com.digitalpebble.stormcrawler.bolt.URLFilterBolt"
        parallelism: 3
      - id: "partitioner"
        className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
        parallelism: 3
      - id: "fetcher"
        className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
        parallelism: 3
      - id: "sitemap"
        className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
        parallelism: 3
      - id: "parse"
        className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
        parallelism: 12
      - id: "index"
        className: "com.digitalpebble.stormcrawler.elasticsearch.bolt.IndexerBolt"
        parallelism: 3
      - id: "status"
        className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
        parallelism: 3
      - id: "status_metrics"
        className: "com.digitalpebble.stormcrawler.elasticsearch.metrics.StatusMetricsBolt"
        parallelism: 3
    
    streams:
      - from: "spout"
        to: "partitioner"
        grouping:
          type: SHUFFLE
    
      - from: "spout"
        to: "status_metrics"
        grouping:
          type: SHUFFLE
    
      - from: "partitioner"
        to: "fetcher"
        grouping:
          type: FIELDS
          args: ["key"]
    
      - from: "fetcher"
        to: "sitemap"
        grouping:
          type: LOCAL_OR_SHUFFLE
    
      - from: "sitemap"
        to: "parse"
        grouping:
          type: LOCAL_OR_SHUFFLE
    
      - from: "parse"
        to: "index"
        grouping:
          type: LOCAL_OR_SHUFFLE
    
      - from: "fetcher"
        to: "status"
        grouping:
          type: FIELDS
          args: ["url"]
          streamId: "status"
    
      - from: "sitemap"
        to: "status"
        grouping:
          type: FIELDS
          args: ["url"]
          streamId: "status"
    
      - from: "parse"
        to: "status"
        grouping:
          type: FIELDS
          args: ["url"]
          streamId: "status"
    
      - from: "index"
        to: "status"
        grouping:
          type: FIELDS
          args: ["url"]
          streamId: "status"
    
      - from: "filespout"
        to: "filter"
        grouping:
          type: FIELDS
          args: ["url"]
          streamId: "status"
    
      - from: "filter"
        to: "status"
        grouping:
          streamId: "status"
          type: CUSTOM
          customClass:
            className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
            constructorArgs:
              - "byDomain"
    

    parsefilters.json

    {
      "com.digitalpebble.stormcrawler.parse.ParseFilters": [
        {
          "class": "com.digitalpebble.stormcrawler.parse.filter.XPathFilter",
          "name": "XPathFilter",
          "params": {
            "canonical": "//*[@rel=\"canonical\"]/@href",
            "parse.description": [
                "//*[@name=\"description\"]/@content",
                "//*[@name=\"Description\"]/@content"
             ],
            "parse.title": [
                "//TITLE",
                "//META[@name=\"title\"]/@content"
             ],
             "parse.keywords": "//META[@name=\"keywords\"]/@content"
          }
        },
        {
          "class": "com.digitalpebble.stormcrawler.parse.filter.LinkParseFilter",
          "name": "LinkParseFilter",
          "params": {
             "pattern": "//FRAME/@src"
           }
        },
        {
          "class": "com.digitalpebble.stormcrawler.parse.filter.DomainParseFilter",
          "name": "DomainParseFilter",
          "params": {
            "key": "domain",
            "byHost": false
           }
        },
        {
          "class": "com.digitalpebble.stormcrawler.parse.filter.CommaSeparatedToMultivaluedMetadata",
          "name": "CommaSeparatedToMultivaluedMetadata",
          "params": {
            "keys": ["parse.keywords"]
           }
        }
      ]
    }
    

    pom.xml

    <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    
        <modelVersion>4.0.0</modelVersion>
        <groupId>org.rcsb.crawler</groupId>
        <artifactId>stormcrawler</artifactId>
        <version>1.0-SNAPSHOT</version>
        <packaging>jar</packaging>
    
        <name>stormcrawler</name>
    
        <properties>
            <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
            <stormcrawler.version>1.17</stormcrawler.version>
        </properties>
    
        <build>
            <plugins>
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-compiler-plugin</artifactId>
                    <version>3.2</version>
                    <configuration>
                        <source>1.8</source>
                        <target>1.8</target>
                    </configuration>
                </plugin>
                <plugin>
                    <groupId>org.codehaus.mojo</groupId>
                    <artifactId>exec-maven-plugin</artifactId>
                    <version>1.3.2</version>
                    <executions>
                        <execution>
                            <goals>
                                <goal>exec</goal>
                            </goals>
                        </execution>
                    </executions>
                    <configuration>
                        <executable>java</executable>
                        <includeProjectDependencies>true</includeProjectDependencies>
                        <includePluginDependencies>false</includePluginDependencies>
                        <classpathScope>compile</classpathScope>
                    </configuration>
                </plugin>
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-shade-plugin</artifactId>
                    <version>1.3.3</version>
                    <executions>
                        <execution>
                            <phase>package</phase>
                            <goals>
                                <goal>shade</goal>
                            </goals>
                            <configuration>
                                <createDependencyReducedPom>false</createDependencyReducedPom>
                                <transformers>
                                    <transformer
                                        implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
                                    <transformer
                                        implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                        <mainClass>org.apache.storm.flux.Flux</mainClass>
                                        <manifestEntries>
                                            <Change></Change>
                                            <Build-Date></Build-Date>
                                        </manifestEntries>
                                    </transformer>
                                </transformers>
                                <!-- The filters below are necessary if you want to include the Tika
                                    module -->
                                <filters>
                                    <filter>
                                        <artifact>*:*</artifact>
                                        <excludes>
                                            <exclude>META-INF/*.SF</exclude>
                                            <exclude>META-INF/*.DSA</exclude>
                                            <exclude>META-INF/*.RSA</exclude>
                                        </excludes>
                                    </filter>
                                    <filter>
                                        <!-- https://issues.apache.org/jira/browse/STORM-2428 -->
                                        <artifact>org.apache.storm:flux-core</artifact>
                                        <excludes>
                                            <exclude>org/apache/commons/**</exclude>
                                            <exclude>org/apache/http/**</exclude>
                                            <exclude>org/yaml/**</exclude>
                                        </excludes>
                                    </filter>
                                </filters>
                            </configuration>
                        </execution>
                    </executions>
                </plugin>
            </plugins>
        </build>
    
        <dependencies>
            <dependency>
                <groupId>com.digitalpebble.stormcrawler</groupId>
                <artifactId>storm-crawler-core</artifactId>
                <version>${stormcrawler.version}</version>
            </dependency>
            <dependency>
                <groupId>com.digitalpebble.stormcrawler</groupId>
                <artifactId>storm-crawler-elasticsearch</artifactId>
                <version>${stormcrawler.version}</version>
            </dependency>
            <dependency>
                <groupId>org.apache.storm</groupId>
                <artifactId>storm-core</artifactId>
                <version>1.2.3</version>
                <scope>provided</scope>
            </dependency>
            <dependency>
                <groupId>org.apache.storm</groupId>
                <artifactId>flux-core</artifactId>
                <version>1.2.3</version>
            </dependency>
        </dependencies>
    </project>