pythondataframeapache-sparkpysparkrounding

How can I turn off rounding in Spark?


I have a dataframe and I'm doing this:

df = dataframe.withColumn("test", lit(0.4219759403))

I want to get just the first four numbers after the dot, without rounding.

When I cast to DecimalType, with .cast(DataTypes.createDecimalType(20,4) or even with round function, this number is rounded to 0.4220.

The only way that I found without rounding is applying the function format_number(), but this function gives me a string, and when I cast this string to DecimalType(20,4), the framework rounds the number again to 0.4220.

I need to convert this number to DecimalType(20,4) without rounding, and I expect to see 0.4219.


Solution

  • If you have numbers with more than 1 digit before the decimal point, the substr is not adapt. Instead, you can use a regex to always extract the first 4 decimal digits (if present).
    You can do this using regexp_extract

    df = dataframe.withColumn('rounded', F.regexp_extract(F.col('test'), '\d+\.\d{0,4}', 0))
    

    Example

    import pyspark.sql.functions as F
    
    dataframe = spark.createDataFrame([
        (0.4219759403, ),
        (0.4, ),
        (1.0, ),
        (0.5431293, ),
        (123.769859, )
    ], ['test'])
    df = dataframe.withColumn('rounded', F.regexp_extract(F.col('test'), '\d+\.\d{0,4}', 0))
    df.show()
    
    +------------+--------+
    |        test| rounded|
    +------------+--------+
    |0.4219759403|  0.4219|
    |         0.4|     0.4|
    |         1.0|     1.0|
    |   0.5431293|  0.5431|
    |  123.769859|123.7698|
    +------------+--------+