apache-sparkpysparkspark2.4.4

Missing methods in PySpark 2.4's pyspark.sql.functions but still works in local environment


I'm using PySpark 2.4 and I noticed that the pyspark.sql.functions module is missing some methods like trim and col. In PyCharm, it shows as undefined. However, I have written some tasks using these methods and they run correctly in the local environment of PySpark 2.4, with the expected results. Why is that?

Here is my environment setup:

from pyspark.sql import SparkSession

def create_env():
    spark = SparkSession.builder \
        .appName("HiveTest") \
        .master("local") \
        .config("spark.sql.warehouse.dir", "/user/hive/warehouse") \
        .config("spark.hadoop.hive.metastore.uris", "thrift://master:9083") \
        .config("spark.hadoop.hive.exec.scratchdir", "/user/hive/tmp") \
        .enableHiveSupport() \
        .getOrCreate()
    spark.sparkContext.setLogLevel("ERROR")
    return spark

And here is an excerpt of my SparkSQL code:

df = spark.table("ods.t_ctp20_department_d").select(
    trim(col("departmentid")).alias("branch_id"),
    trim(col("departmentid")).alias("branch_no"),
    trim(col("departmentname")).alias("branch_name"),
    when(trim(col("departmentid")) == 'FU', '00')
    .when(length(trim(col("departmentid"))) == 2, 'FU')
    .when(length(trim(col("departmentid"))) == 4, substring(trim(col("departmentid")), 1, 2))
    .when(length(trim(col("departmentid"))) == 6, substring(trim(col("departmentid")), 1, 4))
    .otherwise(substring(trim(col("departmentid")), 1, 6)).alias("up_branch_no"),
    lit('0').alias("branch_type"),
    lit('00').alias("data_source"),
    col("brokerid").alias("brokers_id"),
    lit(busi_date).alias("ds_date")
)

I tried using the trim and col methods from the pyspark.sql.functions module in my PySpark 2.4 code. Surprisingly, even though my PyCharm IDE highlighted these methods as undefined, my code still executed successfully in the local PySpark 2.4 environment and produced the expected results.

I have a Python script that I run either by executing "python3 xx.python" or by using a remote interpreter in PyCharm. The remote interpreter is set up with only the pyspark2.4 package installed within a virtual environment.

When running the script in PyCharm, everything seems to run fine. However, I encounter an error stating that the function is not defined when accessing the pyspark2.4 API.

I would like to understand the reason behind this error. Is there any additional configuration required in PyCharm when using pyspark2.4? Thank you for your assistance!


Solution

  • This is because col, lit and some other functions are binded dynamically. This goes back to the very early versions of Spark, and looks like it comes to handle versions compatibility.