I am working with python and pyspark to extend the SPSS Modeler.
I want to manipulate ~5000 columns and therefore use the following construct:
for target in targets:
inputData = inputData.withColumn(target+appendString, function(target))
This is very slow. Is there a more efficent way to do this for all target columns?
targets
contains a list of column names to be used, function(target)
is a placeholder where I do stuff with different columns like adding and dividing.
I would be happy if you could help me :)
pandayo
try this :
inputData.select(
'*',
*(function(target).alias(target+appendString) for target in targets)
)