google-cloud-platformmahoutmahout-recommendergoogle-cloud-dataproc

Apache Mahout on Dataproc?


Is Apache Mahout (https://mahout.apache.org/users/recommender/intro-itembased-hadoop.html) available on Google Dataproc?


Solution

  • Google Cloud Dataproc does not bundle Apache Mahout by default, but it is usable with Dataproc in a couple different ways.

    Bundled in an uber jar

    You can bundle it into your jar (using a Maven shade or assembly plugin or the equivalent in your build tool of choice), and run it as a regular Hadoop MapReduce or Spark job.

    As a client on the master node

    Mahout 0.11.0 is available as an Apache Bigtop package inside of Dataproc. If you run:

    sudo apt-get update
    sudo apt-get install mahout -y
    

    on the master node either after SSHing or in an initialization action, you should have the 'mahout' command with proper classpath.

    Important note on Spark versioning

    Mahout 0.11.0 only supports Spark 1.3, but Dataproc (1.0) ships with Spark 1.6.1. You could download or bundle Mahout 0.12.0. which came out last week, but even that only claims to support Spark 1.5. When there is a better solution for Spark compatibility, we will create a Mahout initialization action at https://github.com/GoogleCloudPlatform/dataproc-initialization-actions.