pysparkaws-glueaws-glue-sparkaws-glue3.0

Cast Issue with AWS Glue 3.0 - Pyspark


I'm using Glue 3.0

data = [("Java", "6241499.16943521594684385382059800664452")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF()
df.show()
df.select(f.col("_2").cast("decimal(15,2)")).show()

I get the following result

+----+--------------------+
|  _1|                  _2|
+----+--------------------+
|Java|6241499.169435215...|
+----+--------------------+

+----+
|  _2|
+----+
|null|
+----+

locally with pyspark= "==3.2.1" there is no issue to cast the string to decimal() but the Glue job is not able to do so


Solution

  • The problem is with AWS Glue ! in order to encounter this, I used to convert my string before doing the cast

    def prepareStringDecimal(str_):
        """
        Pyspark UDF
        :param str_: "1234.123456789"
        :return: 1234.12345
        """
        arr = str(str_).split(".")
        if len(arr) > 1:
            return arr[0] + "." + arr[1][:5]
        else:
            return str_
    
    
    # convert function to UDF
    convertUDF = udf(lambda z: prepareStringDecimal(z), StringType())
    
    data = [("Java", "6241499.16943521594684385382059800664452")]
    df = spark.sparkContext.parallelize(data).toDF()
    df.show()
    df.select(convertUDF(f.col("_2")).cast("decimal(15,2)")).show()
    

    Output

    +----+--------------------+
    |  _1|                  _2|
    +----+--------------------+
    |Java|6241499.169435215...|
    +----+--------------------+
    
    +-----------------------------------+
    |CAST(<lambda>(_2) AS DECIMAL(15,2))|
    +-----------------------------------+
    |                         6241499.17|
    +-----------------------------------+
    

    Note: Obviously ! we can use Spark SQL Functions instead