Coming up with an IPC locking mechanism that is robust and works between more than one computer is a hard problem. Our needs for locking at work have been pretty light up until recently. We’ve relied on our ORM grabbing object IDs from sequence in the database for the most part to avoid needing to worry about locking, or at least push it down to being the database’s problem. Still, some code sections need locks around them.
The first implementation of a multi-machine locking mechanism involved a simple database table with a unique constraint to store the resource ID that you’re locking. It works pretty well, you just have to remember to do all your lock-related transactions on a separate database handle so it can be committed independently of other transactions you may have going on, and it doesn’t support shared locks unless you look into stored procedures or some other nonsense. Our database is also very busy, and one slow or hung process can block access to that lock table for everyone else. Pretty soon, the database is being hammered by queries on that one table.
The next pass was to use the existence of a directory on a shared NFS filesystem to represent the lock; mkdir(2) is atomic, even across NFS. Then we started dropping a file in the directory with information about who created it to support cleaning up the lock being held by a hung or crashed program.
That introduced a bug because of the interaction with the NFS servers. Say process A has the lock. Process B wants to check on the staleness of the lock, and so opens the info file and starts to read it. Process A now wants to give up the lock, and so deletes its info file and tries to remove its lock directory, which actually fails. For our NFS systems, if process B has the file open and process A unlinks the file, the NFS server creates a file called something like .nfslock1234blah in the same directory to keep track of it. Since the directory is not empty, process A can’t remove the lock directory and the lock is now hung. We’ve tried different wait times, dancing directory names before deleting them, nothing could reliably get rid of the issues.
Oh yeah, this mechanism doesn’t support shared locks, either.
Our requirements have changed, and now we need shared locks. We’d also like to continue to use the NFS filesystem to communicate about the locks, because it’s well maintained by our sysadmins, available on all the machines in the cluster, and its behavior is well known. Another possibility was to create a locking daemon running somewhere on the cluster, but then the sysadmins would have to support it and make sure it was always running and saving state reliably during downtime.
The first thing we can do is to make the locking mechanism to a two-step process. A locker first declares its intention to lock by dropping a uniquely named subdirectory in the lock directory as a reservation. To actually acquire the lock, it must create a symlink with a well-known name in the lock directory pointing to its reservation; symlink(2) should also be atomic. We’ll compose these reservation names by using the hostname, processID and the current time. To get the lock, the process tries to symlink() to their private reservation directory. To release the lock, they first remove the symlink (someone else can now claim it) and then remove the reservation directory.
First lock directory scheme
We can also easily add in shared locks by having a common directory where all the shared lockers drop in a uniquely named file or subdirectory. Locking is slightly different in that you get the lock if the symlink() succeeds, or if the symlink already exists and it points to the shared directory. Unlocking a shared lock requires that you first try to remove the shared directory, and only remove the symlink if the rmdir() succeeded. This way, if other shared locks have files in the shared directory, the rmdir will fail, but that process can still give up it’s hold on the lock while keeping the shared lock alive.
There’s still a race condition there…Process Y has the only shared lock and wants to give it up. It rmdir()s the shared directory, which succeeds. Meanwhile, process Z wants the shared lock. It mkdir()s the shared directory, sees that the symlink points to the shared directory, and assumes it has the lock. Now, process Y continues by removing the symlink. Process Z now thinks it has the lock, through someone else can also claim it.
To fix that, shared lockers creating a new shared directory use a name that matches a pattern so others can find it, but is still unique when it gets created. The next process that wants the shared lock will first look to see if any shared directory exists and drops a file in there. In the race condition above, Z will create a new directory but readlink() will point to some other directory name, and so it knows it does not yet have the lock.
Final lock directory scheme
Here, PID 554 on host 2 is in the process of giving up the lock. The reservation file and directory have been removed, but the lock symlink still points to the deleted shared-lock directory. Meanwhile, two other processes are requesting a shared lock and three others are waiting on an exclusive lock.
Probably the most important thing to remember about avoiding race conditions is don’t check for some condition and then alter the filesystem based on the return value. Instead, try and do the change and then check the return value and allowed failure modes. For example, don’t check for directory’s existence and then create it if it doesn’t exist. Another process can sneak in between the stat() and mkdir(). Instead, just do the mkdir() and know that it can fail with EEXIST if the directory was already there, and another error like EPERM would be a fatal exception.
This should cover all the bases. Be thorough about checking return codes from syscalls, add in some stale lock cleanup code, zillions of test cases and we should be good to go.