azureamazon-emrazure-hdinsight

Persistent and transient EMR equivalent clusters in azure and HDInsight


I wanted to know if I create an HDInsight cluster on Azure is that fully reliable to work on. For example, I don't want to see after a while one of node is wiped out or deleted and I lose my data.

I know we would have two types of cluster on EMR like transient and persistent but I still have doubt even the persistent cluster could at some point lose nodes data.

Does this happen to Azure HDInsight as well? I would like to know opinion from those guys who have experiences on this?

Thanks


Solution

  • Azure HDInsight clusters are similar to Persistent clusters in EMR.

    On-demand HDInsight Hadoop clusters are similar to transient clusters in EMR.

    AWS to Azure services comparison:

    AWS Service Azure Service Description
    EMR Azure Data Explorer Fully managed, low latency. distributed big data analytics platform to run complex queries across petabytes of data.
    EMR Databricks Apache Spark-based analytics platform.
    EMR HDInsight Managed Hadoop senhce. Deploy and manage Hadoop clusters in Azure.
    EMR Data Lake Storage Massively scalable, secure data lake functionality built on Azure Blob Storage.

    Azure HDInsight follows a strong separation of compute and storage-- as such the recommendation is to store your data either in Azure Storage blobs and Azure Data Lake Store, or a combination of the two. Both provide an HDFS compatible file system that persists data even if the cluster is deleted.

    The benefit of this approach is:

    enter image description here

    For more details, refer Azure Storage overview in HDInsight and Use Azure storage with Azure HDInsight clusters