pythonpython-2.7pysparkapache-spark-sqlspss-modeler

Using df.withColumn() on multiple columns


I am working with python and pyspark to extend the SPSS Modeler.

I want to manipulate ~5000 columns and therefore use the following construct:

for target in targets:
    inputData = inputData.withColumn(target+appendString, function(target))

This is very slow. Is there a more efficent way to do this for all target columns?

targets contains a list of column names to be used, function(target) is a placeholder where I do stuff with different columns like adding and dividing.

I would be happy if you could help me :)

pandayo


Solution

  • try this :

    inputData.select(
        '*', 
        *(function(target).alias(target+appendString) for target in targets)
    )