visual-studio-codepysparkdatabricksdatabricks-connect

Moving a Pyspark project development form Databricks UI to VSCode using databricks connect


I am inheriting a huge pyspark project and instead of using the the Databricks UI for development I would like to use vscode via databricks-connect. Because of this I am failing to determine the best practices for the following:

Changing the whole code base to fit my preferred developmental strategy does not seem to be justifiable. Any pointers on how I can circumvent this?


Solution

  • Just want to mention that Databricks connect is in the maintenance mode and will be replaced with the new solution later this year.

    But really, to migrate to the VSCode you don't need databricks-connect. There are few options here:

    Regarding the use of spark - in your code, especially you can replace them with SparkSession.getActiveSession() calls that will pull active Spark session from environment, in this case, you can instantiate it only in unit tests (I recommend to use pytest-spark package to simplify it) and then the rest of the code won't need SparkSession.builder.getOrCreate() as it will run on Databricks that will instantiate it (if you use notebooks as entry point). Problems with dbutils are also solvable, like described in this answer.