ROCKS Tames Beowulf Clusters

I set up a two-node Beowulf cluster at my house one time, but back then I had no clue what to do with it. It's tough for an administrator to make effective use of a cluster, being that there's no usable parallelized password cracker that I know of. Now I'm in an environment where other people can make use of a cluster, and I volunteered to try to set one up using Rocks 3.2.0.

System administration and cluster administration are very different things. The tools are different, the problems are different (though many are related to system-level stuff that we can sink our teeth into), and the options for implementing a cluster are many and varied.

You can grab a vanilla distribution and build your tools of choice from source on every node in your cluster. Then when new software, or new user account information, or a new mount directive, or a new host entry, or whatever, needs to be distributed to every node in the cluster, you can figure out how to do it and then automate it so that next time it'll be a piece of cake. Eventually (or maybe that should be "occasionally"), you'll get to a point where you're not really sure if the software is all in sync on all of the nodes, or a small set of nodes is acting in a manner inconsistent with the rest of the nodes, and you can just blow the whole thing away and start this whole process over from scratch. Believe it or not, this appears to work wonderfully for many cluster administrators.

Of course, these are probably full-time cluster admins. For those of us who have an environment to maintain and administer outside the cluster room, it'd be nice to have all of this stuff "just work." This, my friends, is where Rocks can save your sanity and time!

What the heck is Rocks?

Rocks is one of a few available cluster toolkits. Cluster toolkits come in various shapes and sizes, all offering different mixes of control and convenience, which are somewhat inversely proportional to each other. Some are layers on top of a vanilla distribution, others are just packages of scripts to make some maintenance tasks a bit easier. Rocks is an entire cluster implementation, complete with all of the tools, software, and (gasp!) documentation you need to get a cluster off the ground in minutes.

No, really. Minutes.

Since clusters have different purposes, calling for different toolsets, the Rocks distribution is downloadable in the form of "rolls." A roll is basically another ISO image that adds a certain desirable toolset to your cluster automagically. So, for example, in addition to the Rocks "base" and "HPC" rolls (which are required in order to use the cluster as, well, a cluster) I knew from my research that I wanted to use the Maui Scheduler for my cluster scheduling interface, and Torque for my resource management (Torque is based on the wildly popular OpenPBS resource manager). I was able to do this by adding the Rocks "PBS/Maui" roll.

Getting the cluster going is pretty much dependent on getting the head node up and running. Pop in the Rocks base CD, fire it up, and you're prompted with a menu with some obvious options. Choose the one that says you want to install a front end node. At that point, you'll have a few questions to answer which are simple if you've ever installed a Linux distribution. The installer then asks if you want to add any rolls to your installation. Keep adding rolls to your heart's content, and when you're done, just tell the installer you have no more rolls to add. At that point, it'll go about its merry way, and in just about no time you'll have a head node.

And what a head node it is! It contains a PXE server pre-configured to kickstart the rest of the nodes, in accordance with the rolls you added during your head node installation. In my case, I used the PBS/Maui roll, so all of my compute nodes were set up to be PBS compute nodes which report to a server on the head node. If you ever decide that a compute node is so badly out of whack that you want to start over, you just log into the head node, and type shoot-node nodename, and the node will be immediately rebooted, reinstalled, and brought back up for immediate use in about 10 minutes on any decent hardware.

Adding new nodes is a matter of logging into the head node, running insert-ethers, booting the compute node you want to add, telling it to boot to the network, and that's it. The head node adds all of the information about the new node to a local database and kickstarts it.

Need to add user accounts? Just add 'em to the head node, and it'll spit 'em out to the rest of the cluster using a Rocks-specific multicast-based service called 411, which is used as a replacement for NIS and seems to work flawlessly thus far. (There are tools to whip it into submission, just in case, but I have yet to need them.) The same service handles host information, so that your entire cluster is kept in sync at all times. The service works so well that I wonder why nobody has tried to implement 411 outside the context of a cluster -- maybe because of the multicast dependency?

In addition to adding compute nodes, the head node can also add different kinds of nodes. For example, I currently have 16 compute nodes, which will grow to roughly 140. Sometime before node 140 gets added to the cluster, I/O performance for large, long-running jobs will become a problem. Rocks, however, can bring up a dedicated Parallel Virtual File System IO node for you, and make sure that your nodes actually use it. PVFS can completely alleviate the need for NFS within your cluster, and we all know NFS is an enormous source of performance issues, administrative overhead, and downtime.

If your cluster nodes don't have PXE or even a CD drive, no problem -- just check the Rocks documentation for alternative solutions suitable to your situation.

Support and docs

The Rocks documentation is great for getting an initial Rocks configuration up and running (and tested) -- and keep in mind this is coming from the guy who got so fed up with the state of open source documentation that he wrote an article about it. The Rocks team has done a good job of addressing the possible issues that can arise during a configuration, and how to avoid common pitfalls. They explain their take on the problem of cluster configuration up front, state their assumptions, and clearly explain how to get started.

The Rocks discussion mailing list is amazing. I've posted a few questions, and even caught one bug, and there has been friendly help and encouragement along the way. I generally avoid adding yet another mailing list to my collection, since most of them seem to be frequented by new users who don't understand why their hostname keeps showing up as "localhost," but this one was well worth it. I have not asked an unanswered question yet, and the problems I've had have been solved in record time.

In conclusion

If you don't know much about clusters, but you want to know more, it's sometimes easier to learn on a system that is already in some form of "known-working" state. Rocks makes getting to this point a breeze, after which you can start kicking tires and playing with the tools provided. From there, it's just a short hop to production. Rocks has done an excellent job of abstracting the problems associated with maintaining a cluster, and boiling them all down to a set of very usable, very efficient tools.