apache-sparkpyspark

Read previous Spark APIs


When working with previous Spark Versions, I am always confused when it comes to specifying column names: should I use String or a col object.

Example of regexp_replace from 3.1.2:

pyspark.sql.functions.regexp_replace(str, pattern, replacement)[source]

I was running a cluster with Version 3.1.2 and both works:

df1.withColumn("modality",F.regexp_replace(F.col("name"),"i","")).display()
df1.withColumn("modality",F.regexp_replace("name","i","")).display()

From the docu I would have assumed that only a String is allowed, but both works. How can I see in the API docu, if also a col object is allowed (in the latest api this is pretty clear, but not in the previous ones).


Solution

  • When you click on the source button of the 3.1.2 doc you find the source code of regexp_replace:

    def regexp_replace(str, pattern, replacement):
        r"""Replace all substrings of the specified string value that match regexp with rep.
    
        .. versionadded:: 1.5.0
    
        Examples
        --------
        >>> df = spark.createDataFrame([('100-200',)], ['str'])
        >>> df.select(regexp_replace('str', r'(\d+)', '--').alias('d')).collect()
        [Row(d='-----')]
        """
        sc = SparkContext._active_spark_context
        jc = sc._jvm.functions.regexp_replace(_to_java_column(str), pattern, replacement)
        return Column(jc)
    

    You see that the str argument is not used directly but wrapped within the _to_java_column function. The source code of _to_java_column clearly shows that it works with both column names (string) and column objects:

    def _to_java_column(col: "ColumnOrName") -> "JavaObject":
        if isinstance(col, Column):
            jcol = col._jc
        elif isinstance(col, str):
            jcol = _create_column_from_name(col)
        else:
            raise PySparkTypeError(
                errorClass="NOT_COLUMN_OR_STR",
                messageParameters={"arg_name": "col", "arg_type": type(col).__name__},
            )
        return jcol
    

    When browsing the source page of functions, you see that _to_java_column is omnipresent, which means that for most functions (or even all but I didn't check), both column names of column object can be used.