When working with previous Spark Versions, I am always confused when it comes to specifying column names: should I use String
or a col object
.
Example of regexp_replace from 3.1.2:
pyspark.sql.functions.regexp_replace(str, pattern, replacement)[source]
I was running a cluster with Version 3.1.2 and both works:
df1.withColumn("modality",F.regexp_replace(F.col("name"),"i","")).display()
df1.withColumn("modality",F.regexp_replace("name","i","")).display()
From the docu I would have assumed that only a String is allowed, but both works. How can I see in the API docu, if also a col object is allowed (in the latest api this is pretty clear, but not in the previous ones).
When you click on the source button of the 3.1.2 doc you find the source code of regexp_replace
:
def regexp_replace(str, pattern, replacement):
r"""Replace all substrings of the specified string value that match regexp with rep.
.. versionadded:: 1.5.0
Examples
--------
>>> df = spark.createDataFrame([('100-200',)], ['str'])
>>> df.select(regexp_replace('str', r'(\d+)', '--').alias('d')).collect()
[Row(d='-----')]
"""
sc = SparkContext._active_spark_context
jc = sc._jvm.functions.regexp_replace(_to_java_column(str), pattern, replacement)
return Column(jc)
You see that the str
argument is not used directly but wrapped within the _to_java_column
function. The source code of _to_java_column
clearly shows that it works with both column names (string) and column objects:
def _to_java_column(col: "ColumnOrName") -> "JavaObject":
if isinstance(col, Column):
jcol = col._jc
elif isinstance(col, str):
jcol = _create_column_from_name(col)
else:
raise PySparkTypeError(
errorClass="NOT_COLUMN_OR_STR",
messageParameters={"arg_name": "col", "arg_type": type(col).__name__},
)
return jcol
When browsing the source page of functions
, you see that _to_java_column
is omnipresent, which means that for most functions (or even all but I didn't check), both column names of column object can be used.