Hello I am trying to read dataverse data to databricks. Generally getting the Data over the API works fine.
But converting the data into a pyspark dataframe throws errors if the dataverse data includes DecimalType or MoneyType (both being decimal numbers)
I try to read this into the dataframe as DecimalType(), but get an error. Below is a made-up example of the data the API returns and me trying to read it as a dataframe.
sample_data = [
("Alice", 34, 1234.56),
("Bob", 45, 2345.67),
("Cathy", 29, 3456.78)
]
schema = StructType([
StructField("Name", StringType(), True),
StructField("Age", IntegerType(), True),
StructField("Salary", DecimalType(6, 2), True)
])
df = spark.createDataFrame(data=sample_data, schema=schema)
This throws the following error:
DecimalType(6,2)
can not accept object 1234.56
in type float
.
Is there any way to make it work. As a workaround I was able to read it as a Stringtype and cast it to Decimal after it is in the table
sample_data = [
("Alice", 34, 1234.56),
("Bob", 45, 2345.67),
("Cathy", 29, 3456.78)
]
# Step 3: Define the schema
schema = StructType([
StructField("Name", StringType(), True),
StructField("Age", IntegerType(), True),
#StructField("Salary", DecimalType(6, 2), True)
StructField("Salary", StringType(), True)
])
# Step 4: Create the DataFrame
df = spark.createDataFrame(data=sample_data, schema=schema)
df = df.withColumn("Salary", psf.col("Salary").cast(DecimalType(6, 2)))
That way works but seems like an unneeded step
And on top of that the Dataverse API does not return Precision and Scale for its fields. I would go with the maximum(?) of 18 precision and 10 or 4 Scale (Decimal and Money respectively).
Yes DecimalType(6, 2)
cannot accept float literals (1234.56
) directly because pyspark.sql.types.DecimalType
expects values of type decimal.Decimal
, not float
.
In your workaround,you read the salary field as StringType
, then cast it to DecimalType
, which works because Spark will parse the string into a decimal internally.
So you can try the below workaround to fix the error.
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DecimalType
sample_data = [
("Alice", 34, 1234.56),
("Bob", 45, 2345.67),
("Cathy", 29, 3456.78)
]
schema = StructType([
StructField("Name", StringType(), True),
StructField("Age", IntegerType(), True),
StructField("Salary", DecimalType(6, 2), True)
])
If your data comes from JSON or API responses and is still in float format, convert to Decimal
, we will create a function which will help to convert the data to convert to decimal dynamically:
from decimal import Decimal
def convert_floats_to_decimal(records):
return [
(name, age, Decimal(salary))
for name, age, salary in records
]
df = spark.createDataFrame(data=convert_floats_to_decimal(sample_data), schema=schema)
display(df)
Output: