Yet another messaging system?

or

Why Zephyr sucks, and a rationale for the switch to Gale

The Gale system has recently come under fire for ``reinventing the wheel'' and as an example of ``egotistical programming'' (rewriting something just to have ownership). Several people have challenged the necessity of making a complete break from the Zephyr system, and compared the upgrade process to Microsoft's infamous Office upgrade tactics.

After all, Zephyr mostly works for most people; why not concentrate on fixing its flaws, rather than starting from scratch? The source code is freely available, and a group at MIT is actively developing the system; they would probably welcome improvements and bug fixes. Large schools like MIT and CMU find Zephyr adequate for campus-wide messaging and notification, so can't we make it meet our own relatively modest needs?

These are good points, and I did not lightly make the decision to abandon Zephyr. In particular, I've taken a fairly close look at the Zephyr system itself (and I credit Heath Hunnicutt for actually diving into the Zephyr codebase and attempting to repair it) to determine the feasibility of fixing its problems. I believe strongly in re-use, and would rather not write any code I don't have to; Gale does make substantial use of several freely-available code libraries.

However, I do not believe in maintaining legacy code at the expense of having a modular, robust, flexible and useful system. In many cases, the long-term maintenance effort to keep the old system alive causes more pain than the short-term hassles of switching to a new system. This is particularly true where a new system is demonstrably more extensible and reconfigurable in such a way that it can likely meet future needs without requiring another such catastrophic replacement itself.

I believe the transition from Zephyr to Gale meets these criteria, and I intend to demonstrate so in this document. I make the case that the long-term benefits of using Gale outweigh the short-term difficulties of abandoning Zephyr, and that Zephyr's flaws are sufficiently ingrained that they cannot be easily corrected without such a complete re-engineering effort.

Why we like Zephyr

Before we launch into a discussion of the systemic weaknesses of Zephyr, it seems only fair to acknowledge its not-inconsiderable strengths.

First and foremost, Zephyr discovered a very nice ``sweet spot'' for human communication: a message-oriented, background-operation, (almost) always active message service. Its mode of operation fits very nicely between other, more traditional communication protocols, such as e-mail, ``talk'', ``write'', IRC, and similar systems. Zephyr has become the local chat and quick private message tool of choice on the systems I use, and it wouldn't be exaggerating too much to say that a fair-sized community has grown up around its public instance mechanism.

Gale borrows its fundamental modes of user interaction from Zephyr, and offers few improvements in that regard. We must credit Zephyr for its messaging paradigm; Gale merely offers technical superiority.

Secondly, Zephyr does mostly work, and a rich set of utilities and tools have been developed for it. It has sophisticated graphical and console-based clients, notification services for mail, syslog, and other events, user-location services, and a plethora of individual users' hacks built around it. In short, it has the breadth of support and coverage to be expected from a mature legacy system.

Okay, let's drop the other shoe already

However, Zephyr has also proved to be a major pain in the butt, both for its users and for the system administrators who must operate it. Before launching into a discussion of why I feel its flaws are inherent and incorrigible, I will enumerate the visible symptoms of its problems, the reasons it's difficult to simply use as-is.

Stability. The Zephyr system is constantly failing in small and large ways. The server consumes ever-increasing amounts of memory until it starts thrashing and brings message service (and anything else running on the same machine) to a standstill. When the server is restarted, everyone must then restart their windowgram client (zwgc) processes, or else they won't receive any more messages (but they receive no other indications of failure). Even when the server is kept running, it will often ``forget'' about users; unless they notice that they aren't receiving messages, they won't know to restart their zwgc, and they become disconnected from the system. The user location service is flaky, and ``zombie login entries'' often remain visible for months after the user has actually logged out.

When multiple servers are running, the situation worsens. A significant fraction of the messages from one server's clients to another's will simply get dropped without warning. When the servers are restarted, they often perform the ``braindump'' process incorrectly, leading to garbled data and, once more, the requirement for users to restart their zwgc processes. Multiple servers have an even greater tendency to forget about users, and the problem becomes even more insidious because they will still receive messages from some servers, and will probably not be aware that messages to them from other servers are disappearing into the void.

These problems come and go; they appear to be related to server load, network traffic, amount of Zephyr traffic, and the phase of the moon. But Zephyr is in no sense a reliable system; users complain about it, sysadmins must deal with it, and system resets require the cooperation of everyone logged in to reset their own client.

Security. Zephyr has Kereberos-based security. I will give it the benefit of the doubt and assume that, in a Kerberos environment, the security is convenient, strong, and ubiquitous, and I will only mention in passing the weaknesses in Kerberos itself.

Even then, Kerberos is not particularly widespread, and can't be installed lightly. One wouldn't install Kerberos merely to support Zephyr. Kerberos requires replacement of most of the security-related system services, a complex setup procedure, the dedication of a keyserver machine, and other radical changes. Furthermore, Kerberos doesn't interoperate well between realms, so it needs to be installed and maintained on an organization-wide scale. In many organizations, this is not going to happen.

And Zephyr without Kerberos is absolutely and totally insecure. Not only can users anywhere on the Internet read any message (including so-called ``private'', personal messages), forge messages with impunity and untraceability, and generally bypass the ACLs and other protection schemes, but they can even stop and restart individual users' clients. Denial of service attacks are trivial. These attacks don't require sophisticated hacking; you can do most of them by simply running the Zephyr system yourself and configuring it appropriately. Anyone who uses Zephyr without Kerberos relies on the goodwill (or apathy) of the entire Internet to maintain their privacy and integrity.

Some people don't care about this, but some do. (Enough so that Zephyr was built to use Kerberos, for example.)

Scalability. Large organizations such as MIT and CMU successfully use Zephyr on a campus-wide scale (though I do not know how many resources they devote to this task), so Zephyr does have some scalability. You can run many Zephyr servers to accomplish a sort of load-balancing. These schools do have the advantage of central administration for all the machines so connected.

Zephyr does not, however, scale well across heterogeneous systems. Environments like Caltech's, which have many small, independently administered clusters of systems, each with their own user population and configuration, make the widespread use of Zephyr difficult. Each such cluster can run its own, independent Zephyr system, but then users can't communicate from one system to the next. If they are all linked together, then username conflicts become problematic; since the servers are all under different control, with different attitudes towards downtime and upgrades, the reliability problems are exacerbated.

Zephyr also does not work well across slow, relatively lossy network links, such as the Internet as a whole. While a single client could probably operate over a PPP link, servers distributed geographically -- even if connected by relatively high-speed T1 lines -- cause traffic lossage and synchronization problems. Zephyr would certainly not scale up to the Internet as a whole, which puts it at a disadvantage when compared to protocols like ``talk'' or IRC which allow one to communicate with people on arbitrary remote hosts.

How Zephyr Works

So far, we still have not addressed the fundamental criticism stated in the introduction: Why not devote your time to improving Zephyr, rather than rewriting it entirely?

To answer this question, I will describe how Zephyr works, show the fundamental causes of its problems, and demonstrate that its basic design prevents meaningful solutions without an overhaul (and loss of compatibility) at least as difficult and inconvenient as the complete rewrite. This section will, of necessity, have a rather technical focus and assume a working knowledge of networking technologies, distributed application design, and Zephyr itself (from a user's point of view, at least).

I begin with a brief overview of the Zephyr system.

Zephyr messages (Zephyrgrams) contain (among other, unimportant things) the following fields, all of which contain textual data: the message class, the message instance, its selected recipient user, and the message body. Users select which messages to receive with a set of class, instance, recipient triples (subscriptions).

The user's windowgram client (zwgc) is responsible for receiving and displaying messages the user subscribes to; the user runs a utility program (zwrite) to send a message. Each host runs a host manager (zhm) process; within a Zephyr system one or more hosts run Zephyr server (zephyrd) processes. These programs communicate with each other via acked UDP.

The zhm keeps track of the location of a zephyrd process currently in use; if that zephyrd becomes too slow or fails entirely the zhm will find and select a different one. When the user starts a zwgc, it sends a message to the zhm, which forwards it to the currently selected server; this message describes the location of the zwgc and the user's subscriptions. The server records these, and forwards the message to the other servers, which do likewise. The servers enforce any ACLs configured for different message classes, and prevent users from subscribing to each other's private messages -- though, without Kerberos, you can trivially forge your username to bypass these checks.

When the user sends a message, the zwrite process transmits the Zephyrgram through the zhm to its server. The server compares the message's class, instance, and recipient fields against its stored list of subscriptions; if it finds any matches, it transmits a copy of the Zephyrgram directly to the appropriate zwgc processes for the users who subscribe to that message. The zwgc processes then format and display the messages for the users to read.

When the zwgc terminates, it sends a message to the server (via the zhm) telling it that it is doing so; the server removes its subscriptions from the list and forwards the notice to the other servers, which do likewise.

What's wrong with this picture

This architecture leads directly to the problems described with Zephyr. Readers familiar with the pragmatic aspects of distributed systems design may already realize why, but I will describe the design flaws and their ramifications in detail.

First, note that in the presence of perfectly running systems and networks, orderly logouts and shutdowns, infinite server uptimes and completely bug-free software, Zephyr's design does work reliably, albeit not necessarily efficiently or securely. However, we do not live in such a world, and any large-scale system must be able to handle small failures gracefully and without requiring excessive attention (such as global resets) to recover from them.

In particular, the Zephyr architecture is extremely failure-prone and inadequate in several respects.

Network protocol choice. The use of acked UDP means that failure can only be detected by timeout, and the responsibility for connection negotiation, retransmission, and termination lies with the Zephyr code itself rather than with the tried-and-true operating system mechanisms in TCP for doing these things. For example, if the server crashes, is restarted, or has a corrupt subscription database that causes it to lose track of one of the clients, the client has no way to know about this; it simply stops receiving message packets. Similarly, if a zwgc crashes or is killed drastically without having a chance to notify the server that it has gone away, the server doesn't know to remove its entries from its tables; the login entry may persist forever.

However, Zephyr could probably (with some effort) be retrofitted to use TCP. It would mean an incompatible protocol, but this is not a completely incorrectible problem.

Subscription database replication. Note that for proper operation, each server must maintain an up-to-date replica of the list of the location and subscriptions for every client connected to Zephyr (not just the ones that talk to it). When changes happen (users log in and log out), the servers must notify each other to replicate the changes.

Keeping a replicated set of copies of a database consistent in the presence of constant, asynchronous updates from multiple sources and occasional failures of servers and networks is a notoriously difficult problem. When a new server comes up, it receieves a ``braindump'' from another, neighboring server to get the current list of subscriptions. However, this process is error-prone; any sort of failure -- or even temporary performance degradation -- in the network or the servers can cause them to lose synchronization. Once this happens, users will find that some people can send them messages, but others can't. This sort of hard-to-detect, intermittent, location-variable problem is difficult to diagnose or correct.

This problem is further exacerbated by interaction with the use of acked UDP; when a client dies, each server must decide individually how and when to time it out and remove it from its database. But even if a more reliable network protocol were used, the way Zephyr does things would remain inherently unstable.

What's more, if all the servers go down (this is easy if a cluster only runs a single server), the subscription list is totally lost. All the clients will quietly cease receiving messages until their owners restart them. Since the server has a number of memory leaks, this makes life difficult for the system administrators, who would like to restart them periodically but must alert (and annoy) their user population to restart their zwgc processes every time they do so.

These problems cannot be easily fixed without a complete redesign.

Server complexity. The servers are responsible for a number of Zephyr features. They must maintain the subscription list and distribute notices to users. They must keep track of user locations, and allow users to search the user location database on request. They manage security, enforcing ACLs and authenticating connections (in the presence of Kerberos; otherwise they fake it).

However, they are also critical to the functioning of Zephyr, and are required to operate properly without failure to keep their databases up-to-date and synchronized. This mixture of complexity and dependency can't help but lead to a system that's unreliable as a whole, and indeed, the Zephyr servers have memory leaks, lose track of subscriptions, fail to retransmit packets, and generally muck up the system. This goes beyond any single bug (which might be fixed) and impacts the maintainability and reliability of the system as a whole.

These problems cannot be easily fixed without a complete redesign.

Security. Since Zephyr relies on the servers for security, this means the Zephyr system requires a complex system like Kerberos that can provide key distribution and connection-authentication services from servers to clients. It also means that users cannot easily set up their own access control systems. It would be very difficult to make Zephyr secure without requiring Kerberos.

Even worse, the Zephyr architecture requires that every server be equipped with the same security configuration (ACLs and so forth) and operate in the same security domain (Kerberos realm), with a shared user database. This means that secure Zephyr requires the operator to trust every Zephyr server, which means every server must operate under centralized control. This might work with MIT's Athena system, but is not likely to work at Caltech, or on the Internet in general. If even one server operates under external control, it is a potential security breach.

These problems cannot be easily fixed without a complete redesign.

Scalability. Even ignoring the issues of security control discussed above, the architecture has some basic difficulties scaling beyond a single, well-connected organization.

The requirement that each server maintain a list of every client, and be able to send messages directly to them, is fundamentally an unscalable requirement; it means that each server's memory size and response time will worsen in direct proportion to the total number of clients on the system. This is what load-balancing is supposed to alleviate, but in Zephyr it cannot. The requirements of maintaining the client lists in sync further worsens the problem.

Since each server sends messages directly to the client zwgc processes, the system behaves poorly over long-distance network links. Ten servers on one side of a link sending packets to a hundred clients on the other side will make poor use of network bandwidth, even if they have good retransmission algorithms (and Zephyr does not). A single TCP connection carrying all the messages would do much better, but that would require servers to coordinate messages among themselves and arrange in a spanning tree, which is fundamentally in opposition to the current protocol design.

These problems cannot be easily fixed without a complete redesign.

Why Gale will do better

Gale is not only the complete redesign that solves these problems, but its architecture makes it likely that it can continue to serve as a messaging system without radical restructing even as requirements change and features get added.

Server simplicity. The Gale servers do exactly one thing: they distribute messages according to subscriptions (with a far simpler subscription mechanism than Zephyr). This multicast distribution is the one thing the servers really have to do; everything else can be accomplished by using the server architecture as a transport. I haven't changed the server code in several weeks, and I do not intend any major changes to it in the near future, even though I will add a great deal of functionality to Gale.

In general, changes to the Gale system, including messaging conventions, security techniques, and user interface, can happen without changing the server system at all. This means that, unlike Zephyr, changes to the Gale system can be made quietly and incrementally without requiring system replacement or even shutdown.

Message distribution protocol. Gale clients connect to the local server via a TCP link; only that server needs to keep track of that client's subscriptions. The servers exchange messages among themselves via a spanning tree of TCP connections, thus making good use of network bandwidth and establishing a high degree of scalability.

The use of TCP means that the Gale system uses the well-tested and efficient reliable transport mechanisms built into the operating system rather than Zephyr's roll-your-own acked UDP. Clients and servers can reliably detect when the other end breaks the connection (either deliberately or through failure) and reconnect, failing over to a different server as necessary. Servers don't need to keep any databases in sync between themselves, and you can safely shut down and restart all the servers; the clients will merely retry, reconnect, and re-establish their subscriptions.

Security. Gale uses end-to-end strong cryptography based on public-key encryption in the clients. This removes the burden of security and trust from the servers, allowing decentralized administration while maintaining privacy and integrity for personal messages.

Gale also allows many different ``domains'' to coexist, each with its own set of users and its own security policies. Users can send messages from one domain to another; if they transfer public key files, they can also send encrypted messages with no problems.

Summary

Commercial software companies spend a great deal of effort to maintain backwards compatibility; in many cases, e.g. Microsoft operating systems, this is widely belived to be at the expense of quality and usability. Re-use is a powerful tool, but one should not always maintain legacy code at the expense of quality and future maintenance.

I believe the Gale system is a reasonable design that can, with incremental changes and additions, meet the need for a messaging system in the foreseeable future. In this document I have presented the case for the (admittedly disruptive) switch from Zephyr. If unanswered concerns, questions, or arguments remain, please contact me.

Thank you.


Gale information