apache-sparkpysparkdatabricksazure-databrickscomputation-theory

What is the industry standard for number of clusters for a development team in Databricks?


I am a part of a team of 5 developers that work with gathering data, transforming, analyzing and predicting data in Azure Databricks (basically a combination of Data Science and Data Engineering). Up until now we have been working on relatively small data, so the team of 5 could easily work with a single cluster with 8 worker nodes in development. Even though we are 5 developers, usually we're at maximum 3 developers in Databricks at the same time.

Recently we started working with "Big Data" and thus we need to make use of Databricks' Apache Spark parallelization methods to improve run-times for our codes. However, a problem that quickly came to light is that with more than one developer running parallelizing codes on a single cluster, there will be queues that slow us down. Because of this we have been thinking about increasing the amount of clusters in our dev-environment so that multiple developers can work on codes that take use of the Spark parallelizing methods.

My question is this: What is the industry standard for number of clusters to have in a development environment? Do teams usually have a cluster per developer? That sounds like it could easily become quite expensive in terms of economic costs.


Solution

  • Usually I see following pattern:

    P.S. Really, integration tests, and similar things could be easily run as jobs that are less expensive