pysparkazure-databricksdataverse

Pyspark read data to dataframe as decimal


Hello I am trying to read dataverse data to databricks. Generally getting the Data over the API works fine.

But converting the data into a pyspark dataframe throws errors if the dataverse data includes DecimalType or MoneyType (both being decimal numbers)

I try to read this into the dataframe as DecimalType(), but get an error. Below is a made-up example of the data the API returns and me trying to read it as a dataframe.

sample_data = [
    ("Alice", 34, 1234.56),
    ("Bob", 45, 2345.67),
    ("Cathy", 29, 3456.78)
]

schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Salary", DecimalType(6, 2), True)
])

df = spark.createDataFrame(data=sample_data, schema=schema)

This throws the following error:

DecimalType(6,2) can not accept object 1234.56 in type float.

Is there any way to make it work. As a workaround I was able to read it as a Stringtype and cast it to Decimal after it is in the table

sample_data = [
    ("Alice", 34, 1234.56),
    ("Bob", 45, 2345.67),
    ("Cathy", 29, 3456.78)
]

# Step 3: Define the schema
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    #StructField("Salary", DecimalType(6, 2), True)
    StructField("Salary", StringType(), True)
])

# Step 4: Create the DataFrame
df = spark.createDataFrame(data=sample_data, schema=schema)

df = df.withColumn("Salary", psf.col("Salary").cast(DecimalType(6, 2)))

That way works but seems like an unneeded step

And on top of that the Dataverse API does not return Precision and Scale for its fields. I would go with the maximum(?) of 18 precision and 10 or 4 Scale (Decimal and Money respectively).


Solution

  • Yes DecimalType(6, 2) cannot accept float literals (1234.56) directly because pyspark.sql.types.DecimalType expects values of type decimal.Decimal, not float.

    In your workaround,you read the salary field as StringType, then cast it to DecimalType, which works because Spark will parse the string into a decimal internally.

    So you can try the below workaround to fix the error.

    from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DecimalType
    
    sample_data = [
        ("Alice", 34, 1234.56),
        ("Bob", 45, 2345.67),
        ("Cathy", 29, 3456.78)
    ]
    
    schema = StructType([
        StructField("Name", StringType(), True),
        StructField("Age", IntegerType(), True),
        StructField("Salary", DecimalType(6, 2), True)
    ])
    

    If your data comes from JSON or API responses and is still in float format, convert to Decimal, we will create a function which will help to convert the data to convert to decimal dynamically:

    from decimal import Decimal
    
    def convert_floats_to_decimal(records):
        return [
            (name, age, Decimal(salary))
            for name, age, salary in records
        ]
    
    df = spark.createDataFrame(data=convert_floats_to_decimal(sample_data), schema=schema)
    display(df)
    

    Output:

    enter image description here