apache-sparkhadoopyandex

Yandex Dataproc Architecture: Purpose of "Data" Nodes?


I've been exploring Spark using Google Dataproc, where the standard architecture comprises master and worker nodes. On Google Dataproc, the master nodes typically house the hdfs Namenode and yarn ResourceManager, and worker nodes contain hdfs Datanode and yarn NodeManager.

However, when I set up a cluster with Yandex Dataproc, the suggested architecture includes master, data, and perform nodes. I'm curious about the role and advantage of these "data" nodes. What components are likely hosted on them? Given that data nodes will consume CPU and RAM resources, this design choice seems to potentially escalate costs. Unfortunately, I couldn't find detailed explanations in the Yandex documentation.

Can anyone shed light on the rationale behind this architecture in Yandex Dataproc?


Solution

  • It's the exact same architecture as GCP.

    Master nodes run Namenode and/or ResourceManager,

    Data nodes run literal HDFS datanodes.

    https://cloud.yandex.com/en/docs/data-proc/concepts/

    Compute nodes will have the a highest associated cost, followed by the master nodes. Storing data blocks does not require high cpu/men, and network throughput should be prioritized for those.