[SOLVED] Falcon vs Wandisco Non-stop

Falcon vs Wandisco Non-stop

Use case is: I need to copy all my data from a HDFS cluster to another cluster with the same set up of masters and slaves and I will release the previous cluster and start running my jobs in the new cluster.

I have read about Apache Falcon and Wandisco non-stop Hadoop which helps in this mirroring. But I'm not sure of other advantages they offer when I have them as part of my eco system (which will be more advantageous? - It will depend on my use case but I want to know if there are any in particular). Can an analogy of Falcon vs Wandisco be provided, depending on your experience with them?

Solution

(Disclaimer: I work at WANdisco.)

My view is that the products are complementary. Falcon does a lot of things besides data transfer, like setting up data workflow stages. WANdisco's products do active-active data replication (which means that data can be used equivalently from both the source and target clusters).

In your use case, if you use Falcon then you're actually using DistCP to copy data to your new cluster. You might do an initial transfer to get the bulk of the data over, and then at some point you need to do a final cutover to pick up all the deltas, and then you can let applications run on the new cluster.

If you did the data transfer with WANdisco's products, you could use both clusters at the same time as the replication engine coordinates the changes using a Paxos algorithm. That might make an incremental migration easier.

Other scenarios where you'll notice a difference between a continuous active-active replication compared to DistCP are things like backup and disaster recovery and ingesting into multiple data centers. Hope that helps.