pythonjupyter-notebookgpucluster-computinghpc

DIY HPC cluster to run Jupyter/Python notebooks


I recently migrated my Python / Jupyter work from a macbook to a refurbrished Gen 8 HP rackmounted server (192GB DDR3 2 x 8C Xeon E5-2600), which I got off amazon for $400. The extra CPU cores have dramatically improved the speed of fitting my models particularly for decision tree ensembles that I tend to use a lot. I am now thinking of buying additional servers from that era (early-mid 2010s) (either dual or quad-socket intel xeon E5, E7 v1/v2) and wiring them up as a small HPC cluster in my apartment. Here's what I need help deciding:

  1. Is this a bad idea? Am I better off buying a GPU (like a gtx 1080). The reason I am reluctant to go the GPU route is that I tend to rely on sklearn a lot (that's pretty much the only thing I know and use). And from what I understand model training on gpus is not currently a part of the sklearn ecosystem. All my code is written in numpy/pandas/sklearn. So, there will be a steep learning curve and backward compatibility issues. Am I wrong about this?

  2. Assuming (1) is true and CPUs are indeed better for me in the short term. How do I build the cluster and run Jupyter notebooks on it. Is it as simple as buying an additional server. Designating one of the servers as the head node. Connecting the servers through ethernet. Installing Centos / Rocks on both machines. And starting the Jupyter server with IPython Parallel (?).

  3. Assuming (2) is true, or at least partly true. What other hardware / software do I need to get? Do I need an ethernet switch? Or if I am connecting only two machines, there's no need for it? Or do I need a minimum of three machines to utilize the extra CPU cores and thus need a switch? Do I need to install Centos / Rocks? Or are there better, more modern alternatives for the software layer. For context, right now I use openSUSE on the HP server, and I am pretty much a rookie when it comes to operating systems and networking.

  4. How homogeneous should my hardware be? Can I mix and match different frequency CPUs and memory across the machines? For example, having 1600 MHz DDR3 memory in one machine, 1333 MHz DDR3 in another? Or using 2.9 GHz E5-2600v1 and 2.6 GHz E5-2600v2 CPUs?

  5. Should I be worried about power? I.e. can I safely plug three rackmounted servers in the same power strip in my apartment? There's one outlet that I know if I plug my hairdryer in, the lights go out. So I should probably avoid that one :) Seriously, how do I run 2-3 multi-CPU machines under load and avoid tripping the circuit breaker?

Thank you.


Solution

    1. Nvidia's rapids.ai implements a fair bit of sklearn on gpus. Whether that is the part you use, only you can say.

    2. Using Jupiter notebooks for production is known to be a mistake.

    3. You don't need a switch unless latency is a serious issue, it rarely is.

    4. Completely irrelevant.

    5. For old hardware of the sort you are considering, you will be having VERY high power bills. But worse, since you will have many not-so-new machines, the probability of some component failing at any given time is high, so unless you seek a future in computer maintenance, this is not a great idea. A better idea is: develop your idea on your macbook/existing cluster, then rent an AWS spot instance (or two or three) for a couple of days. Cheaper, no muss, no fuss. everything just works.