Using RAID with PVFS under ROCKS

I administer a newly deployed ROCKS compute cluster, and I use the Parallel Virtual Filesystem which comes with the ROCKS linux distribution to provide a parallel IO system. For those who are not familiar, check out my earlier ROCKS article, as well as my earlier article about PVFS.

My cluster is slightly older hardware -- dual PIIIs, and each PC has two hard drives. Initially, I thought having two drives was great news, because I could add all of the capacity of the second drive, along with unused capacity of the first drive to grant large amounts of scratch space to the cluster users, some of whom would be more than happy to have it. However, what I didn't realize was that PVFS can only use a single mount point for its data storage needs. I couldn't tell PVFS to use /dev/hda3 and /dev/hdb1. Then someone on the ROCKS list said I might consider using RAID, but that he hadn't tried it himself. I was game, and it works wonderfully. So here's how I did it.

Moving the metadata server off the head node

On a ROCKS cluster, the head node is called frontend-0-0. In testing, this is where my PVFS metadata server lived. However, the frontend also serves up home directories to the rest of the cluster, and handles intercommunications between the scheduler and workload management daemons across the cluster. It also gathers statistics on the cluster, pushes out administrative changes to the nodes, and runs a web server. That's more than enough load without coordinating all of the PVFS clients and matching their requests with the 16 PVFS IO nodes. On frontend-0-0, I just turned off the "mgr" and "pvfsd" services. I'll turn the "pvfsd" service back on after I configure the *new* metaserver, which will allow frontend-0-0 to be a pvfs client. It needs to be a client so that users can collect all of their data from there without logging into other nodes.

The new metadata server is pvfs-0-0, which (as the name implies) is also a PVFS IO node. It has two physical disks installed, as do all of the PVFS IO nodes, so I'll configure them in a RAID0 setup to maximize capacity, and then configure the metadata server.

Setting up RAID

First, I took a look at the state of things in diskland on what will become the new PVFS metaserver. It will also be an IO node.

[root@pvfs-0-0 init.d]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/hdg1 5.8G 2.2G 3.4G 40% /
/dev/hde1 57G 8.8G 45G 17% /state/partition2
/dev/hdg3 50G 33M 48G 1% /state/partition1
none 501M 0 501M 0% /dev/shm

I'm going to use /dev/hde1 and /dev/hdg3 in my RAID 0 volume. Having never used RAID on these systems, I needed to run vgscan to create the files which will be used later by the other commands used to create the RAID filesystem. Then I immediately run pvcreate, which creates physical volumes using the physical disk partitions you specify.

[root@pvfs-0-0 init.d]# vgscan
vgscan -- reading all physical volumes (this may take a while...)
vgscan -- "/etc/lvmtab" and "/etc/lvmtab.d" successfully created
vgscan -- WARNING: This program does not do a VGDA backup of your volume

[root@pvfs-0-0 init.d]# pvcreate /dev/hde1 /dev/hdg3
pvcreate -- physical volume "/dev/hde1" successfully created
pvcreate -- physical volume "/dev/hdg3" successfully created

Next I have to create a "volume group", which takes the physical volumes and essentially puts them under a unified label. In this case, I'm labeling the volume "pvmeta".


[root@pvfs-0-0 init.d]# vgcreate pvmeta /dev/hde1 /dev/hdg3
vgcreate -- INFO: using default physical extent size 32 MB
vgcreate -- INFO: maximum logical volume size is 2 Terabyte
vgcreate -- doing automatic backup of volume group "pvmeta"
vgcreate -- volume group "pvmeta" successfully created and activated

Now it's time to actually create a logical volume -- something that can be treated like a regular old disk partition, so I can go about the business of creating a filesystem, assigning a mount point, etc.


[root@pvfs-0-0 init.d]# lvcreate -L 100G pvmeta
lvcreate -- doing automatic backup of "pvmeta"
lvcreate -- logical volume "/dev/pvmeta/lvol1" successfully created

With that out of the way, it's time to put a filesystem on the new volume so that PVFS can use it for data storage. This is done using the standard mkfs command, just as you would do with any other partition.

[root@pvfs-0-0 init.d]# mkfs -t ext3 /dev/pvmeta/lvol1
mke2fs 1.32 (09-Nov-2002)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
13107200 inodes, 26214400 blocks
1310720 blocks (5.00%) reserved for the super user
First data block=0
800 block groups
32768 blocks per group, 32768 fragments per group
16384 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632,
4096000, 7962624, 11239424, 20480000, 23887872

Writing inode tables: done
Creating journal (8192 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 36 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.

All that's left now is to mount our new volume under the /pvfs-data directory, and check to make sure everything worked. One thing you'll notice is that you don't get the full 100GB of space you asked for when you created the volume, but this is due to PVFS taking up some space creating a structure that it uses to manage and quickly find data chunks that are managed on the local machine on behalf of clients spread throughout the node.

[root@pvfs-0-0 /]# mount /dev/pvmeta/lvol1 /pvfs-data
[root@pvfs-0-0 /]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/hdg1 5.8G 2.2G 3.4G 40% /
/dev/hde1 16T 16T 0 100% /state/partition2
/dev/hdg3 16T 16T 0 100% /state/partition1
none 501M 0 501M 0% /dev/shm
/dev/pvmeta/lvol1 99G 33M 94G 1% /pvfs-data

Using the ROCKS tools to propagate changes

This is easy enough to set up on a single box, but now what? I have 16 IO nodes! Certainly there's some way to automate this to some extent, no? Yes. ROCKS comes with a plethora of cluster management tools that make things really easy to manage across a number of nodes, no matter if they're compute nodes, PVFS nodes, or otherwise. The most powerful tool in the ROCKS administrator's arsenal is the cluster-fork command. Here are a few of the same commands I ran on pvfs-0-0, only this time, they're being run on pvfs-0-1 thru pvfs-0-15:

cluster-fork --nodes "pvfs-0-%d:1-15" mkfs -t ext3 /dev/pvmeta/lvol1 cluster-fork --nodes "pvfs-0-%d:1-15" mkdir /pvfs-data cluster-fork --nodes "pvfs-0-%d:1-15" mount /dev/pvmeta/lvol1 /pvfs-data

This is the same command I used to set up the compute nodes as proper clients to the newly configured PVFS metaserver. These clients have a pvfstab file left over from testing, so you'll notice that I use the overwrite redirection operator, &gt, to make sure that what I'm echoing is the only thing in the file.

cluster-fork --nodes "compute-0-%d:0-17" 'echo "pvfs-0-0.local:/pvfs-meta /mnt/pvfs pvfs port=3000 0 0" > /etc/pvfstab'

Two quick things about cluster-fork to remember. One is that you can specify ranges using the %d as a placeholder, then specifying the range (ie %d:range). you can also supply a comma-separated list of numbers instead of a strict range. The second thing to remember is to be very careful about passing multiple commands to cluster-fork -- especially destructive ones. If you don't use quotes, and you run something like this:

cluster-fork --nodes "pvfs-0-%d:5,6,9,10" mount /dev/hdg3 /state/partition1; mount /dev/hde1 /state/partition2

You're going to wind up mounting /dev/hdg3 on the pvfs nodes you specify, and then mounting /dev/hde1 on the local machine you're typing on. As an administrator working in such foreign waters at first, you sometimes forget these things -- but it's still bash, so stay alert!


In Conclusion

Using this method, I was able to grow the available scratch space to my cluster from around 500GB to about 1.5TB. So far, everything is humming along nicely on the production cluster. There are details I haven't covered here, like the fact that you have to restart all of the IOD daemons on the IO nodes, and the pvfsd daemons on the clients, but the reason for that is twofold: one, the last thing I do before launching anything into production is make sure I haven't done anything that will cause the cluster to not reboot -- by rebooting the cluster. Two, rebooting the cluster restarts everything that needs to be restarted in order for things to work.

I hope this article has piqued an interest in clustering with ROCKS, and using PVFS in your cluster.