I wanted to know if I create an HDInsight cluster on Azure is that fully reliable to work on. For example, I don't want to see after a while one of node is wiped out or deleted and I lose my data.
I know we would have two types of cluster on EMR like transient and persistent but I still have doubt even the persistent cluster could at some point lose nodes data.
Does this happen to Azure HDInsight as well? I would like to know opinion from those guys who have experiences on this?
Thanks
Azure HDInsight clusters are similar to Persistent clusters in EMR.
On-demand HDInsight Hadoop clusters are similar to transient clusters in EMR.
AWS to Azure services comparison:
AWS Service | Azure Service | Description |
---|---|---|
EMR | Azure Data Explorer | Fully managed, low latency. distributed big data analytics platform to run complex queries across petabytes of data. |
EMR | Databricks | Apache Spark-based analytics platform. |
EMR | HDInsight | Managed Hadoop senhce. Deploy and manage Hadoop clusters in Azure. |
EMR | Data Lake Storage | Massively scalable, secure data lake functionality built on Azure Blob Storage. |
Azure HDInsight follows a strong separation of compute and storage-- as such the recommendation is to store your data either in Azure Storage blobs and Azure Data Lake Store, or a combination of the two. Both provide an HDFS compatible file system that persists data even if the cluster is deleted.
The benefit of this approach is:
For more details, refer Azure Storage overview in HDInsight and Use Azure storage with Azure HDInsight clusters