[SOLVED] Different dropDuplicates signature in Databricks and official py spark code

Different dropDuplicates signature in Databricks and official py spark code

I noticed that on Databricks pyspark calling help(Databricks) returns

dropDuplicates(self, *subset: Union[str, List[str]]) -> 'DataFrame'

which is different from the official version (without a start before the subset parameter)

dropDuplicates(self, subset: Union[str, List[str]]) -> 'DataFrame'

It's really weird because even the documentation placed on Databricks pages is consistent with official one.

But the implementation (how it works in practice) works like described in the in-code help.

Maybe some knows what happened here? (Databricks Cluster version: 15.4 LTS with pyspark 3.5.0)

Solution

I have tried it out with DBR 15.3 (pyspark 3.5.0) and it does seem like the difference occurred at version 15.4LTS. Also, the functionality is different, as the 15.3 version doesn't work according to the documentation with star notation.

The change can be found at this page, at the Apache Spark update SPARK-48482.

However, it is a bit odd that the star notation cannot even be found in Databricks' own pyspark documentation, so that might be the odd thing. It can be different at times, as they implement some updates earlier than official spark versions.