python-2.7pysparkapache-spark-sqlapache-spark-2.3

create new column in pyspark dataframe using existing columns


I am trying to work with pyspark dataframes and I would like to know how I can create and populate new column using existing columns.

Lets say I have a dataframe that looks like this:

+-----+---+---+
|   _1| _2| _3|
+-----+---+---+
|x1-y1|  3| z1|
|x2-y2|  2| z2|
|x3-y3|  1| z3|
+-----+---+---+

I am looking for way to create a dataframe which looks like this:

+-----+---+---+----+--------+
|   _1| _2| _3|  _4|      _5|
+-----+---+---+----+--------+
|x1-y1|  3| z1|x1y1|x1=y1=z1|
|x2-y2|  2| z2|x2y2|x2=y2=z2|
|x3-y3|  1| z3|x3y3|x3=y3=z3|
+-----+---+---+----+--------+

_4 is just '-' removed from _1 and _5 uses values from _1 and _3

Thanks!


Solution

  • You can use pyspark.sql.functions to achieve it.

    from pyspark.sql import SparkSession
    import pyspark.sql.functions as F
    
    sqlContext = SparkSession.builder.appName("test").enableHiveSupport().getOrCreate()
    data = [('x1-y1', 3,'z1'),
            ('x2-y2', 2,'z2'),
            ('x3-y3', 1,'z3')]
    test_df = sqlContext.createDataFrame(data, schema=['_1', '_2', '_3'])
    
    test_df = test_df.withColumn('_4', F.regexp_replace('_1', '-', ''))
    test_df = test_df.withColumn('_5', F.concat(F.regexp_replace('_1', '-', '='),F.lit('='),F.col('_3')))
    test_df.show()
    
    +-----+---+---+----+--------+
    |   _1| _2| _3|  _4|      _5|
    +-----+---+---+----+--------+
    |x1-y1|  3| z1|x1y1|x1=y1=z1|
    |x2-y2|  2| z1|x2y2|x2=y2=z2|
    |x3-y3|  1| z1|x3y3|x3=y3=z3|
    +-----+---+---+----+--------+