Core Technology: Interprocess communication
|Prise the back off Linux and find out what really makes it tick.
Hello everyone, and welcome to classes. My name is Dr Sinitsyn, and as Dr Brown has retired, I’ll be your new Core Technologies coach. It is my pleasure to stand in front of you, even if virtually, and I hope you’ll be enjoying it as well. We are going to continue to uncover the most fundamental, most fascinating and most obscure locations in Unix and networking technologies.
Our latest subject will be Interprocess Communication, or IPC. Do I see a hand raised in the far corner? You think you already learned sockets as an IPC mechanism in issue 6? Good! Sockets are indeed the way to go if the communicating processes run on different machines. However, there are also dedicated efficient means for local communications.
Unix comes with a vast variety of IPC mechanisms. Richard Stevens and Steven Rago jointly authored an excellent book, Advanced Programming in the UNIX Environment, 3rd Edition, which covers all of them in detail, and I suggest you get it before doing any serious Linux programming. But for starters, I’ll tell you a story.
Basically, a shared memory segment is just a set of physical memory pages mapped two or more times, possibly to different processes.
Common data
At the beginning of century, I was involved in the development of banner network system. In those days, network advertising was less obtrusive and much less sophisticated than now. Basically, we only needed to target ads by visitor’s city, local time and date (say, weekends only), and a few other things. To meet these goals, we had two Unix daemons: scheduler and barmand. The job of scheduler was two fill large in-memory bitmap tables, and barmand used them to determine a subset of banners of potential interest to the visitor (at least, we hoped so).
We are not concerned with business logic now, but re-read the previous sentence again and think for a moment: how could barmand (a separate process) read the memory of scheduler (another process)? Memory protection forms the basis of many reliability and security features that we enjoy in Linux, so how could a small thing named barmand circumvent it?
The short answer is it didn’t. Unix has a way to peek into selected chunks of another process memory, and even modify data there. Granted, it is accomplished in a tightly controlled manner and is subject to permission checks. The way it works is called System V shared memory, and it is arguably the simplest IPC method. It also has a minimal overhead, as after you map a “foreign” memory into your process address space, no further actions on the OS’s side are required until you decide to unmap it.
To map a chunk of memory, you need to refer to it somehow. The solution is to use the System V IPC key, which is a unique integer associated with shared memory segment. We’ll also need some way to pass it to all processes that share memory. This is easy if processes involved are in a parent–child relationship, but may involve some external means like configuration files for completely unrelated processes.
Usually, you don’t concern yourself with all these details. The standard C library provides the ftok() function, which accepts a path to an existing file (maybe a process executable) and some non-zero value labelled proj_id (which can be hardcoded) to produce a System V IPC key that fulfils all these requirements. After that, you associate the key with a shared memory segment via the shmget() system call. Finally, you attach (map) the segment with shmat(). You can detach segments no longer needed with shmdt().
Consider (or, even better, type in and compile) the following code. Let’s call it shmwrite.c and omit error handling for brevity:
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/shm.h>
#define PROJ_ID 903
#define SEG_SIZE 4096
int main(int argc, char **argv)
{
key_t key;
int shmid;
char *seg = NULL;
if (argc < 2)
exit(1);
key = ftok(“/etc/hostname”, PROJ_ID);
shmid = shmget(key, SEG_SIZE, IPC_CREAT | 0666);
seg = (char *)shmat(shmid, NULL, 0);
strncpy(seg, argv[1], SEG_SIZE);
shmdt(seg);
return 0;
}
Here, we attach 4k (one page) of memory and write a string passed as a command line argument at its beginning. The key is created from /etc/hostname and PROJ_ID arbitrarily set to 903. Also pay attention to the shmget() call. We specify a desired segment size, and that we want it created (IPC_CREAT). More interestingly, we set access permissions much the same way we do it for ordinary files. Here, the segment will be world-readable and writable. shmat() returns a pointer to a memory segment in the current process address space. If you want it to be at specific address, pass it as a second argument instead of NULL. Detaching a segment is not strictly necessary here, as Linux does this automatically when the program exits. Keeping things clean after yourself is always a good habit, however.
The code to print a string stored in this shared memory segment is very similar:
key = ftok(“/etc/hostname”, PROJ_ID);
shmid = shmget(key, SEG_SIZE, 0600);
seg = (char *)shmat(shmid, NULL, SHM_RDONLY);
printf(“%sn”, seg);
shmdt(seg);
This should go into main() of shmread.c. Note that we request read-only access to an already existing segment here.
Try both these programs in action; for instance, run ./shmwrite “Hello, IPC world!”. Then, execute ./shmread and see the same message printed. Note that unlike “ordinary” memory you allocate with malloc(), shared memory survives program termination. To “free” it, use the shmctl() system call or ipcrm command (see below). Anyway, the segment will remain available until the last process detaches it. Brave souls can now play with permissions and see how they affect the behaviour.
Let’s make things a bit more interesting. Take some large file (maybe one of your logs) and send it to shmwrite line-by-line, in a fashion similar to this:
while read LINE; do
./shmwrite “$LINE”
done < /var/log/file
In another shell (or perhaps a tmux window, see LV013), run shmread in a similar loop:
while true; do
./shmread
done
You may expected this to behave like a poor man’s terminal-to-terminal copy utility, but what does it really do? Try it yourself, and check the answer below.
Unix sockets are at the base of D-Bus, which vital for modern Linux desktop. This context menu is also a result of IPC.
One at time
The code should “mostly work”, however strings may occasionally appear cut off or mangled. This is a typical example of “race condition”: both programs compete for the single memory region. Say, shmwrite may overwrite a string that shmread is printing now. To fix this, we need to serialise memory access.
There is a dedicated System V IPC synchronisation primitive, and it’s called “semaphore”. Akin to real semaphore that controls railroad traffic, IPC semaphore serialises process access to a resource but only as long as all processes obey the semaphore signals. If only one train ignores semaphore, there will be a crash. In Unix, the situation is the same (albeit the consequences are hopefully less dramatic).
Semaphore is operated very similarly to shared memory segment (and other System V IPC primitives that we don’t cover here). You use an ftok()-generated key to obtain a semaphore identifier with semget(), then you can call semop() to perform semaphore operations. But what are they?
At its most basic level, semaphore is simply an array of zero-initialised counters. You can increment or decrement them, or check if semaphore stores a non-zero value. What’s the trick, you ask? Semaphore can never become negative: if you try to decrement a counter too much, semop() will block until some other process increments it enough. There is also the “wait-for-zero” operation, which blocks semop() until semaphore is zeroed. You can request semop() to perform more than one operation at time, and they will happen atomically. Either all operations will succeed or none of them, and no process will be able to interleave between semop() checking a semaphore value and changing it. This is very different from naive implementations using a shared integer variable.
For our case, we need semaphore with two counters: the read lock and the write lock. When shmwrite wants to change shared memory contents, it increments the write lock, and decrements it back when done. shmread waits for the write lock to become zero and increments the read lock, which in turn waits for zero in shmwrite. This makes running shmwrite and shmread mutually exclusive. Multiple readers and writers are permitted though, which may not be what you want (but is OK in our case).
The synchronisation code for shmread and shmwrite is almost identical and looks like this:
...
#include <sys/sem.h>
/* This is for shmread.c */
struct sembuf rlock[2] = {
1, 0, 0,
0, 1, SEM_UNDO
};
struct sembuf runlock = {
0, -1, 0
};
int main(int argc, char **argv)
{
...
semid = semget(key, 2, IPC_CREAT | 0666);
semop(semid, &rlock[0], 2);
printf(“%sn”, seg);
semop(semid, &runlock, 1);
...
}
Note we re-use the shared memory key for the semaphore. Again, we start with semget() that creates semaphore if it doesn’t exists and sets up permissions. The second argument is number of semaphores (counters) we want. Here, we need two: read lock (number 0) and write lock (1). semop() takes semaphore id and struct sembuf[] array describing operations to perform; its third argument is the array size. This first member of struct sembuf, sem_num, refers to semaphore number (zero-based). The next one, sem_op, is basically counter increment (or decrement, if it’s negative), or “wait-for-zero” operation if it is zero. Please spend a second understanding how lock and unlock operations are expressed in these terms. The operations aren’t undone automatically on process termination unless you include SEM_UNDO in sem_flags (the third field). It is really bad idea to exit having semaphore locked. Other processes may spend ages waiting for it to unlock, which may never happen in this scenario.
With this fix in place, you should no longer see broken strings. The reader can still lose some text, however, as it has no way to signal to the writer whether it is done with the current line. In a nutshell, semaphores are similar to pthread mutexes except they work system-wide across processes, not threads that share a single address space. We discussed the System V flavour here; there are also POSIX semaphores, which are somewhat simpler to use.
A typical Linux system will have many shared memory regions (mostly private, as the zero key suggests), and a few semaphores as well.
Sockets revisited
Shared memory enjoys the benefits of being lightweight, but sometimes you need a higher-level abstraction. Sockets come in handy here, and although two Linux process can certainly communicate via two TCP or UDP sockets (presumably, bound to a loopback device), there is a better alternative. Switch to a terminal window, and do ls /run/dbus. On my system, this yields:
total 0
srw-rw-rw- 1 root root 0 Mar 7 17:04 system_bus_socket
There is a single file, and the s character in permissions stands for “special”. It’s a Unix domain socket designed especially for local, non-networked process communication. Basically, Unix domain sockets just copy data from the buffer in one process to another. Network sockets, on other hand, pass it to the network stack for protocol parsing, checksumming, firewalling and doing all other funky things you don’t need for local data.
This particular socket is from D-Bus, which is a very important thing to tie all components in modern desktop Linux system together. Note that it also has permissions associated, but given the role it plays, anyone can connect to it.
Unix domain sockets are very similar to TCP or UDP sockets we discussed in back LV007. The only notable difference is that Unix sockets belong to AF_UNIX, not AF_INET, and you specify a filesystem path rather than an IP address for them. To draw the parallels, we’ll take the UDP example code from LV007 Core Technologies and adapt it slightly. Only the relevant parts are shown below to save space, but the complete original code can be found at www.linuxvoice.com/mag_code/lv07/coretech007.tar.
#define SOCKET_PATH “/tmp/coretech”
...
struct sockaddr_un server;
sock = socket (AF_UNIX, SOCK_DGRAM, 0);
server.sun_family = AF_UNIX;
strncpy(server.sun_path, SOCKET_PATH, sizeof server.sun_path);
Let’s see what’s going on here. First, server is now of the type struct sockaddr_un (for Unix), not sockaddr_in as before. Next, family is set to AF_UNIX both in server and in the socket() call, and we also set the socket path (sun_path) member to /tmp/coretech. Special files like sockets are usually kept either in /tmp for short-lived processes or in /run for system-level daemon services. A similar change was done to client code, and you should also disable broadcasting, but everything else stays pretty much the same.
If you run this program, you’ll see random numbers flowing through the console. Maybe it’s not too impressive now, but Unix sockets can also do some magic that standard networking sockets just can’t (we’ll see it in a moment). You can also check that the program really creates /tmp/coretech, and that this special file is left when it exits. Didn’t I say that cleaning after yourself is a good habit? Anyway, it’s an inconvenience, so Linux provides abstract Unix sockets. These exist purely in memory and don’t leave any traces in the filesystem. To make a Unix socket abstract, just set its first byte to NUL value (0), like this:
#define SOCKET_NAME “@/tmp/coretech”
...
strncpy(server.sun_path, SOCKET_NAME, sizeof server.sun_path);
server.sun_path[0] = ‘0’;
Do rm /tmp/coretech, and run the program again. You’ll see numbers flowing, as before, but there will be no socket file. Abstract socket names are just string identifiers, and could look however you want. Following the filesystem path model is a common convention, however. The @ prefix is also chosen arbitrarily, as we overwrite it with NUL at the last line.
To list abstract sockets on your system, use netstat:
$ netstat -nx
...
unix 3 [ ] STREAM CONNECTED 17104 /run/dbus/system_bus_socket
unix 3 [ ] STREAM CONNECTED 20463 @/tmp/dbus-OQLzhYGMTI
unix 2 [ ] DGRAM 42977 @/tmp/coretech
...
Everything that starts with @ is an abstract socket. Note that this displays non-abstract Unix sockets as well.
Offloading work
Sockets, regardless of their type, are just means to convey data. However, Unix sockets are a bit more capable. As they work only locally, they can be sure that both connecting sides are Unix processes. This means they are able to pass more complex objects than just raw bytes.
Currently, these objects could be Unix credentials (which we won’t discuss) or file descriptors. In either case, they are sent and received as “ancillary” (or control) messages. These messages are not part of the data payload, and you use sendmsg() and recvmsg() functions to send and receive them as predefined C structures. Messages may come in batches, so there is a set of macros designed to quickly decode and traverse what’s in the control messages buffer.
Moving the file descriptor to another process is a simple way to offload a job, like handling an incoming connection. It’s actually quite common in Linux, as fork() preserves open file descriptors. With Unix sockets, however, you can hand out a file descriptor to a completely unrelated process, as long as it is willing to accept it.
Take a popular mailserver, Postfix, as an example. It needs to cut spambots quickly without incurring significant additional costs to legitimate clients. To facilitate this, the Postfix server usually runs the postscreen process, which examines incoming connections and hands them off to real SMTP processes if they pass security checks. All of this happens transparently for the connecting user, and he shouldn’t notice the servicing process change.
Postfix is a large and complex program, but the code to pass file descriptors is quite simple. You can find it in src/util/unix_send_fd.c and src/util/unix_recv_fd.c, respectively. Below is a cut-down simplified version of the unix_send_fd() function:
int unix_send_fd(int fd, int sendfd)
{
struct msghdr msg;
union {
struct cmsghdr just_for_alignment;
char control[CMSG_SPACE(sizeof(sendfd))];
} control_un;
struct cmsghdr *cmptr;
memset((void *) &msg, 0, sizeof(msg));
msg.msg_control = control_un.control;
msg.msg_controllen = sizeof(control_un.control);
cmptr = CMSG_FIRSTHDR(&msg);
cmptr->cmsg_len = CMSG_LEN(sizeof(sendfd));
cmptr->cmsg_level = SOL_SOCKET;
cmptr->cmsg_type = SCM_RIGHTS;
*(int *) CMSG_DATA(cmptr) = sendfd;
...
if (sendmsg(fd, &msg, 0) >= 0)
return (0);
}
fd is the Unix socket descriptor, and sendfd is the file descriptor that Postfix wants to pass. The CMSG_SPACE() macro returns the number of bytes required for an ancillary message with a given payload size. struct cmsghdr describes the control message and is often combined with a buffer for proper alignment. struct msghdr wraps one or more control messages and is a type that sendmsg() and recvmsg() operate on. Usually, you manipulate it with CMSG_*() macros: CMSG_FIRSTHDR(), which returns a pointer to the first message, and CMSG_NEXTHDR(), which advances to the next one.
Here, a single message of type SCM_RIGHTS is created. It indicates to Linux that the payload is an array of file descriptors, although unix_send_fd() sends only one descriptor at time. The cmsg_len field contains data length, including necessary alignment, and again we use the helper macro, CMSG_LEN(), to do the math for us. Finally, we get a pointer to a data buffer with CMSG_DATA() and copy sendfd (single int value) there. Later, sendmsg() sends data in msg, and another process receives it with recvmsg(). From this point, both processes can use file descriptors in msg to refer to the same resource, albeit fd values can be different.
Real unix_send_fd() and unix_recv_fd() functions in Postfix are a bit more elaborate as they account for differences in Unix variants, but hopefully you’ve got the idea.
More to try
The IPC primitives we cover here are arguably the most popular ones in Linux. But historically, Unix provided many more mechanisms, and they are still available.
First, there are named pipes or FIFO channels. As non-abstract Unix sockets, they look like a special file (you can create one with the mkfifo command), and they are good for piping output between unrelated processes. There are also message queues that may come handy if your process communication fits into a messaging pattern.
Command of the month: ipcs, ipmk and ipcrm
This issue, we speak about IPC primitives. So it’s quite natural to declare command of the month the one to work with them.
Actually, it’s not one command but three, coming as a part of the util-linux package. ipcs lists message queues, shared memory segments and semaphores you have access to. If you call it as root, you’ll get everything in the system. If you run it now, you’ll probably see a decent list of memory segments and a few semaphores. These IPC mechanisms are used extensively on all Linux systems. If you feel this is not enough, you can create new primitives with ipcmk. This tool creates a requested resource, and prints its ID. You can set options like shared memory segment size or number of semaphores, and optionally, permissions. When you decide you do not need a resource, use iprcm to remove it. Both keys and IDs (as returned by ipcmk) are accepted.
Further reading
Advanced Programming in the UNIX Environment is not the only resource available. The system calls and library functions we mention here have dedicated man pages. Moreover, IPC issues are covered well in the “miscellaneous” (seventh) section of man. There is a chapter dedicated to Unix sockets (man 7 unix) and the newer POSIX semaphore API (man 7 sem_overview). Ancillary messages are covered in man 3 cmsg. It’s rare to have overview-style man pages, but these are lucky exceptions.
Valentine Sinitsyn develops high-loaded services and teaches students completely unrelated subjects. He also has a KDE developer account that he’s never really used.