pysparkvscode-extensionsazure-databricksdatabricks-connectdatabricks-vscode-extension

Unable to run certain methods using Databricks extension within Visual Studio Code (Databricks Connect V2)


Following the instructions found in https://learn.microsoft.com/en-us/azure/databricks/dev-tools/vscode-ext/dev-tasks/databricks-connect, when I try to run the example codes provided (https://learn.microsoft.com/en-us/azure/databricks/dev-tools/vscode-ext/tutorial) in particuarly the 'show' method , I get the following error code in my VS Code terminal. Same error happens when I run it using a jupyter notebook.

Just wondering if anyone has come across such issue and has resolved it?

Following are some of the key points worth mentioning:

  1. The Databricks extension for VS Code I'm using is v1.1.3
  2. I'm using python version 3.10.4 in a virtual environment which is aligned with my Databricks cluster python version
  3. If I don't run the show() method but just simply print(type(customers)) or customers.printSchema() everything is fine and I get the right output within my VS Code terminal
  4. I'm using the 'Run Python File' option when dealing with .py file and 'debug cell' option when dealing with .ipynb file which as per the link above is using Databricks Connect

pyspark.errors.exceptions.connect.SparkConnectGrpcException: <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.UNIMPLEMENTED details = "Method not found: spark.connect.SparkConnectService/ReattachExecute" debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"Method not found: spark.connect.SparkConnectService/ReattachExecute", grpc_status:12, created_time:"2023-10-02T22:47:34.7298799+00:00"}"

from pyspark.sql import SparkSession
from pyspark.sql.types import *

spark = SparkSession.builder.getOrCreate()

schema = StructType([
   StructField('CustomerID', IntegerType(), False),
   StructField('FirstName',  StringType(),  False),
   StructField('LastName',   StringType(),  False)
])

data = [
   [ 1000, 'Mathijs', 'Oosterhout-Rijntjes' ],
   [ 1001, 'Joost',   'van Brunswijk' ],
   [ 1002, 'Stan',    'Bokenkamp' ]
]

customers = spark.createDataFrame(data, schema)
customers.show()

Solution

  • The version of DB Connect should match the cluster version. It's actually mentioned in the documentation:

    The Databricks Connect major and minor package version should match your Databricks Runtime version. Databricks recommends that you always use the most recent package of Databricks Connect that matches your Databricks Runtime version. For example, when you use a Databricks Runtime 14.0 cluster, you should also use the databricks-connect==14.0.* package.