pysparkamazon-sagemakerapache-spark-ml

How to integrate spark.ml pipeline fitting and hyperparameter optimisation in AWS Sagemaker?


Here is a high-level picture of what I am trying to achieve: I want to train a LightGBM model with spark as a compute backend, all in SageMaker using their Training Job api. To clarify:

  1. I have to use LightGBM in general, there is no option here.
  2. The reason I need to use spark compute backend is because the training with the current dataset does not fit in memory anymore.
  3. I want to use SageMaker Training job setting so I could use SM Hyperparameter optimisation job to find the best hyperparameters for LightGBM. While LightGBM spark interface itself does offer some hyperparameter tuning capabilities, it does not offer Bayesian HP tuning.

Now, I know the general approach to running custom training in SM: build a container in a certain way, and then just pull it from ECR and kick-off a training job/hyperparameter tuning job through sagemaker.Estimator API. Now, in this case SM would handle resource provisioning for you, would create an instance and so on. What I am confused about is that essentially, to use spark compute backend, I would need to have an EMR cluster running, so the SDK would have to handle that as well. However, I do not see how this is possible with the API above.

Now, there is also that thing called Sagemaker Pyspark SDK. However, the provided SageMakerEstimator API from that package does not support on-the-fly cluster configuration either.

Does anyone know a way how to run a Sagemaker training job that would use an EMR cluster so that later the same job could be used for hyperparameter tuning activities?

One way I see is to run an EMR cluster in the background, and then just create a regular SM estimator job that would connect to the EMR cluster and do the training, essentially running a spark driver program in SM Estimator job.

Has anyone done anything similar in the past?

Thanks


Solution

  • Thanks for your questions. Here are answers:

    Your idea of using the Training job as a client to launch an EMR cluster is good and should work (if SM has the right permissions), and will indeed allow you to use SM Model Tuning. I'd recommend:

    In the same spirit, I once used SageMaker Training myself to launch Batch Transform jobs for the sole purpose of leveraging the bayesian search API to find an inference configuration that minimizes cost.