hadoopclojureparallel-processingcascalog

clojure: parallel processing using multiple computers


i have 500 directories, and 1000 files (each about 3-4k lines) for each directory. i want to run the same clojure program (already written) on each of these files. i have 4 octa-core servers. what is a good way to distribute the processes across these cores? cascalog (hadoop + clojure)?

basically, the program reads a file, uses a 3rd party Java jar to do computations, and inserts the results into a DB

note that: 1. being able to use 3rd party libraries/jar is mandatory 2. there is no querying of any sorts


Solution

  • Because there is no "reduce" stage to your overall process as I understand it, it makes sense to put 125 of the directories on each server and then spend the rest you time trying to make this program process them faster. Up to the point where you saturate the DB of course.

    Most of the "big-data" tools available (Hadoop, Storm) focus on processes that need both very powerful map and reduce operations, with perhaps multiple stages of each. Your case all you really need is a decent way to keep track of which jobs passed and which didn't. I'm as bad as anyone (and worse than many) at predicting development times, though in this case I'd say it would an even chance that rewriting your process on one of the map-reduce-esque tools will take longer than adding a monitoring process to keep track of which jobs finished and which failed so you can rerun the failed ones later (preferably automatically).