Parallel I/O for Clusters Using PVFS

When you're trying to share files and perform I/O-intensive operations across 100+ nodes of a Beowulf cluster, the old model of a central NFS file server handling every client request begins to break horribly. What many cluster admins do instead is limit NFS to distributing home directories across the cluster, and using some form of parallel I/O model for doing "real work."

Of course, cluster administrators and users alike crave the benefits of an NFS-like system; admins (in many cases) want logins only to the head node of the cluster, and users want to have all of their output in one place so they don't have to collect it from the various local disks of the individual compute nodes. However, it has been proven time and again that NFS is ill-suited for the kind of pounding it would take in a large, I/O-intensive cluster environment. The answer to this problem has come from researchers, open source projects, and the private sector, in the form of a parallel I/O model.

PVFS basics

The Parallel Virtual File System (PVFS) is one solution for creating a parallel I/O environment for your compute nodes to play in. The model is simple when you look at it from a high level: instead of an all-encompassing server that handles every part of the operation requested by the client, the server's jobs are split among various servers. In a PVFS environment, there is a single metadata server that keeps track of where existing data lives, and which machines are available for performing actual I/O operations. The I/O daemons do the heavy lifting of performing the reads and writes, and they are the actual data stores.

While there is only one metadata server in use per PVFS filesystem, there can be many IODs (I/O daemons). While theory might suggest that the number of IODs could scale indefinitely, reality is that at some point you'll see a decreasing return due to limitations either in the network infrastructure, the metadata server's ability to handle all of the requests, or the client-side utilities, libraries, and network hardware. Though I have yet to go over 16 IODs to a PVFS filesystem, other cluster admins I've spoken to have indicated that IODs can scale up to 32 per filesystem with no problems. Developers who work with PVFS say that 100 IODs have been run without issues in testing environments, though they note that, at some point, a breakpoint is reached where your ability to increase I/O performance is outweighed by the cost increase of metadata operations when an IOD is added. For the record, much work has been done to address these issues in PVFS2, the next generation of PVFS.

The observant among you will note that I've implied that there can be more than one PVFS filesystem per cluster. This is absolutely true. However, a metadata server can handle requests for only a single filesystem, and to my knowledge, IODs answer only to a single metadata server, which also limits them to one filesystem. Technically, you can create two PVFS filesystems using the same physical machines by running two instances of the metadata server on one machine, and running two instances of the IODs on the I/O nodes. Of course, this would affect the scalability, performance, and reliability of both filesystems, not to mention perhaps your own sanity and other mental faculties, but the daemons communicate with each other on configurable port numbers, so in non-production environments, this is perfectly feasible.

Another detail you might wonder about is how storage capacity is determined. Storage capacity is a function of how much space is dedicated to the IODs on the I/O nodes. If you have a 50GB partition dedicated to the IODs on each of two IO nodes, you'll have something slightly less than 100GB available for use by the clients. Some space is taken up by directories set up by the IODs to help them find data more efficiently without using, say, a separate database component.

A PVFS I/O operation up close

There is already great documentation on getting started and setting up a basic PVFS filesystem in the form of a PVFS Quickstart guide. There is also a PVFS-users mailing list, which is very helpful while not being very high-traffic. While I won't duplicate the effort put into that guide, I will briefly discuss the basic mechanics of PVFS.

In this example, I have one metadata server, two IODs, and one client. The client, it should be noted, mounts a directory from the metadata server, since it is the authoritative source of information regarding what the filesystem looks like at any given time (nice, as it hides the details and allows you to scale IODs in a way that is transparent to the clients). I've mounted my PVFS partition to /mnt/pvfs, and I'm going to copy a file from my local disk to that directory.

Let's look at a simple dump of this operation from the client side to see things from the client's point-of-view. My network uses a 192.168.22.2/24 address scheme. Node 2 (192.168.22.2) is the metadata server. Node 252 is the local machine, and 253-254 are the I/O nodes. For clarity in reading the dump, port 7000 is the port used by the I/O nodes, and 3000 is used by the metadata server. I've stripped some most of the data to focus on the conversation (and to be nice to my editors!)

[root@compute-0-0 root]# tcpdump -nn -i eth0 host \( pvfs-0-0 or pvfs-0-1 or
frontend-0-0 \) and port \(7000 or 3000\)
tcpdump: listening on eth0
192.168.22.252.32792 > 192.168.22.2.3000
16:16:00.146923 192.168.22.2.3000 > 192.168.22.252.32792
16:16:00.147008 192.168.22.252.32792 > 192.168.22.2.3000
16:16:00.147242 192.168.22.252.32792 > 192.168.22.2.3000
16:16:00.149613 192.168.22.2.3000 > 192.168.22.252.32792
16:16:00.149738 192.168.22.252.32792 > 192.168.22.2.3000
16:16:00.153163 192.168.22.2.3000 > 192.168.22.252.32792
16:16:00.192986 192.168.22.252.32792 > 192.168.22.2.3000
16:16:00.194201 192.168.22.2.3000 > 192.168.22.252.32792
16:16:00.194250 192.168.22.252.32792 > 192.168.22.2.3000

To start, the client first contacts the metadata server, and requests a read operation on the /mnt/pvfs directory, to make sure that the file doesn't already exist. The metadata server replies with the contents of the directory (in this case, there's nothing in the directory, so that didn't take long). The client also requests a write operation, and the metadata server returns the list of IODs to contact to handle that request. From here, most of the traffic is between the client (252), and the IODs (253 and 254).

16:16:00.194386 192.168.22.252.32793 > 192.168.22.254.7000
16:16:00.194433 192.168.22.252.32794 > 192.168.22.253.7000
16:16:00.195631 192.168.22.254.7000 > 192.168.22.252.32793
16:16:00.195664 192.168.22.252.32793 > 192.168.22.254.7000
16:16:00.195670 192.168.22.253.7000 > 192.168.22.252.32794
16:16:00.195701 192.168.22.252.32794 > 192.168.22.253.7000
16:16:00.195898 192.168.22.252.32792 > 192.168.22.2.3000
16:16:00.197730 192.168.22.2.3000 > 192.168.22.252.32792
16:16:00.197884 192.168.22.252.32792 > 192.168.22.2.3000
16:16:00.202981 192.168.22.2.3000 > 192.168.22.252.32792
16:16:00.203261 192.168.22.252.32793 > 192.168.22.254.7000
16:16:00.203346 192.168.22.252.32793 > 192.168.22.254.7000
16:16:00.203700 192.168.22.254.7000 > 192.168.22.252.32793
16:16:00.203730 192.168.22.252.32793 > 192.168.22.254.7000
16:16:00.205020 192.168.22.254.7000 > 192.168.22.252.32793
16:16:00.205217 192.168.22.252.32793 > 192.168.22.254.7000
16:16:00.205300 192.168.22.252.32794 > 192.168.22.253.7000
16:16:00.206463 192.168.22.254.7000 > 192.168.22.252.32793
16:16:00.206542 192.168.22.253.7000 > 192.168.22.252.32794
16:16:00.206734 192.168.22.252.32792 > 192.168.22.2.3000
16:16:00.209134 192.168.22.2.3000 > 192.168.22.252.32792
16:16:00.242995 192.168.22.252.32792 > 192.168.22.2.3000
16:16:00.243010 192.168.22.252.32793 > 192.168.22.254.7000
16:16:00.243022 192.168.22.252.32794 > 192.168.22.253.7000

This traffic consists almost entirely of data-moving operations, from the clients to the IODs. Dumps on the IODs and metadata server of the same operation show that there is also communication betweeen these two components, since the IODs have to report to the metadata server about the state of the directory, which enables the metadata server to provide a consistent view of the directory. Though it's still a bit fuzzy to me, as best I can tell, the traffic interspersed among the last half of this conversation is likely an effort to propagate this updated view of the directory to the client.

PVFS is just one solution out there for parallel I/O. Others include Lustre and IBM's GPFS, to name just two available for Linux. I hope this article has helped give an understanding of how PVFS works.