am trying to do some cost comparison between AWS Glue and Databricks hosted on an AWS environment. For the comparison, I have chosen m4.xlarge which is equivalent of 1 DPU in AWS Glue (4 vCPUs/16GB memory).
Assuming I have an pyspark job thats expected to run for 1 hour daily for 30 days with 5DPUs. My cost estimator as per AWS is as follows:
glue cost estimator : 5 DPUs x 30.00 hours x 0.44 USD per DPU-Hour = 66.00 USD (Apache Spark ETL job cost)
Databricks cost estimator : This gives an monthly estimate of 74 USD
Am concerned if we have to pay any EC2 cost to AWS for the 6 nodes in addition to this 73 USD. This is due to the note added in the estimate "This Pricing Calculator provides only an estimate of your Databricks cost. Your actual cost depends on your actual usage. Also, the estimated cost doesn't include cost for any required AWS services (e.g. EC2 instances)."
That will be an additional 36 USD approximately for this instance type/count, in addition to databricks cost. Can someone please clarify so we can make a decision to go with AWS Glue or Databricks. I know in databricks we can choose any instance type, but the question is if i pay EC2 cost seperately. Thanks
The answer is yes.
You should pay for all the infrastructure used directly by Databricks.
As mentioned in the footnote you added: This Pricing Calculator provides only an estimate of your Databricks cost. Your actual cost depends on your actual usage. Also, the estimated cost doesn't include the cost for any required AWS services (e.g. EC2 instances).
Think of it as a software license on top of hardware costs that you would pay anyway, whether you use the software or not.
This point was verified with Databricks Solution Architect that accompanies our company while implementing the Databricks solution.