scalaapache-sparkmachine-learningplayframework-2.0predictionio

Play Framework with Spark MLib vs PredictionIO


Good morning,

currently I'm exploring my options for building an internal platform for the company I work for. Our team is responsible for the company's data warehouse and reporting.

As we evolve, we'll be developing an intranet to answer some of the company's necessities and, for some time now, I'm considering scala (and PlayFramework) as the way to go.

This will also envolve a lot of machine learning to cluster clients, predict sales evolution, and so on. This is when I've started to think in Spark ML and came across PredictionIO.

As we are shifting our skills towards data science, what will benefit and teach us/company most:

I'm not trying to open a question opinion based, rather then, learn from your experience / architectures / solutions.

Thank you


Solution

  • Both are good options: 1. use PredictionIO if you are new to ML, easy to start but it will limit you in a long run, 2. use spark if you have confidence in your data science and data engineering team, spark has excellent and easy to use api along with extensive ML library, saying that in order to put things into production, you will require some distributed spark knowledge - experience and it is tricky at times to make it efficient and reliable.

    Here are options:

    1. spark databricks cloud expensive but easy to use spark, no data engineering
    2. PredictionIO if you certain that their ML can solve all your business cases
    3. spark in google dataproc, easy managed cluster for 60% less than aws, still some engineering required

    In summary: PredictionIO for a quick fix, and spark for long term data - science / engineering development. You can start with databricks to minimise expertise overheads and move to dataproc as you go along to minimise costs