amazon-web-services amazon-ec2 cloudera cloudera-cdh cloudera-director

Architect a Cloudera CDH cluster on AWS: instances and storage

I have some doubts about a deployment of CDH on AWS. I read the reference architecture doc and other material I found on Cloudera Engineering Blog but I need some more suggestions about it.

1) Is the CDH deployment available only for some kind of instances or I can deploy it on all the AWS instance types?

2) Assuming I want to create a cluster that will be active 24x7. For a long-running cluster I understood it's better to have a cluster based on local-storage instances. If we consider a cluster of 2PBs I think that d2.8xlarge should be the best choice for the datanodes. About the Master Nodes: - if I want to deploy only 3 Master Nodes, is it better to have them as local-storage instances too or as EBS attached instances to be able to react quickly to a possible Master Node failure? - are there some best practice about the master node instance type (EBS or local-storage)? About the Data Nodes: - if a data node fails, Has the CDH some automated mechanism to automatically spin-up a new instance and connect it to the cluster in order to restore the cluster without down-times? Have we to build a script from scratch to do this thing? About the Edge Nodes: - are there some best practice about the instance type (EBS or local-storage)?

3) If I want to do a backup of the cluster on S3: - when I do a distcp from the CDH to S3, can I move the data directly on Glacier instead of the normal S3? If I have some compression applied on the data (e.g. snappy, gzip, etc.) and I do a distcp to S3: - Is the space occupied on S3 the same or the distcp command decompress the data for the copy?

If I have a cluster based on EBS attached instances: - is it possible to snapshot the disks and re-attach a datanode with the EBS disks rebuilt from the snapshot?

4) If I have the Data Nodes deployed as r4.8xlarge and I need more horsepower, is it possible to scale-up the cluster from r4.8xlarge to a r4.16xlarge on-the-fly? Attaching and detaching the disks in few mins?

Thanks a lot for the clarifications, I hope my doubts will help also other users.

Solution

1) There's no explicit restriction on instance types where CDH components will work, but you'd need to pick types with a minimum of horsepower. For example, I don't expect that a micro size instance would work for much of anything. A type that is too small will generally cause daemons to run out of memory. The reference architecture has suggested instance types for certain situations.

2) You should stick with EBS for the root volume of instance types. There are a few reasons, including that newer instance types don't even support local instance storage for the root disk.

CDH doesn't have a mechanism for replacing data nodes when they fail. You could roll something yourself, possibly with help from Cloudera Director.

3) You can set up lifecycle rules for data in S3 to migrate it from the standard storage class into Glacier over time, or you can just write directly to Glacier; it doesn't look like direct Glacier access can be done through the s3a connector. I'm pretty sure distcp and S3 won't fiddle with compression; what you copy is opaque to S3 for sure. You can snapshot EBS volumes (root or additionally attached), then detach them and re-attach them to a different instance; this isn't necessarily a great way to back up datanodes vs. the distcp route, because each datanode is unique and has changing data as the cluster runs.

4) You can resize EBS-backed EC2 instances without detaching and re-attaching disks. You do have to stop an instance to resize it.