Thoughts on Network Management at the University of Minnesota

This document originally appeared as "Thoughts on Network Management at the University of Minnesota," ConneXions, Vol 6 No 12, December 1992, page 16. While the specific technology and numbers are dated, the design principles have stood the test of time.

University Networking Services, +1 612 625 8888, 130 Lind Hall, unet@unet.umn.edu

<URL:ftp://mail.unet.umn.edu/unet/wiring/net-management;type=a> <URL:gopher://mail.unet.umn.edu/0/home/ftp/unet/wiring/net-management> July 1992

I recently attended a vendor presentation that described a new network management product. The product used advanced database technology and can store dozens of attributes on thousands of objects. Using this data, the product could automatically display many different views of the data in full color. It was customizable, extensible and in every way a state-of-the-art system. Yet I left the presentation feeling vaguely uncomfortable with what I had seen.

Consider this parallel: in the late 1970's, most U.S. Corporations believed that inventory was good and more inventory was better. They constructed huge warehouses and elaborate inventory tracking systems. Then came the 1980's and the realization that this inventory, rather then helping its owners, was actively hurting them. There were some major ancillary reasons for this: the inventory tied up working capital, it's very maintenance and tracking drew resources away from other projects, and so on. Still, the real damage was that it masked over fundamental problems with the way that the corporation was organized. Lowering the level of inventory uncovered these problems, sometimes painfully. Correcting the problems resulted in major improvements in corporate operations and ultimately increased profits.

A network management system performs many functions:

it tracks what equipment is where
it tracks how the equipment is organized into a network
it looks for malfunctioning equipment
it provides information required to diagnose problems

So, if this parallel is to hold, what problems are uncovered by looking at information as the inventory of a network management system?

Ancillary: I estimate that it would cost over $500,000 to create a database that described our existing network in the level of detail that this network management system supports.
Ancillary: I estimate that it would cost over $200,000 each year to maintain the database.
Real Issue: None of this investment by itself delivers any service to the users. None of this investment affects the mean time between failures, although it might have a tiny impact on the mean time to repair. (The reason for the tiny impact is simple: it doesn't matter if it takes one minute or one day to diagnose a problem if six weeks are needed to order and install the replacement unit.)
Real Issue: Whenever incorrect data is encountered, it can significantly delay problem diagnosis. The incorrect data divert attention away from the real cause or to a spurious cause.

New Directions

I suggest that these problems can be addressed by following these principles:

Assertion: Data, left to itself, will deteriorate.

Principle: Continually use the data (i.e., draw conclusions from it) and cross-check those conclusions to identify incorrect data.

Assertion: If a datum is entered more than once, one instance will be wrong.

Principle: Distinguish between entering data -- that should only be done once -- and referring to the data, which will be done often.

Assertion: Creating a model of the network constitutes entering the data more than once.

Principle: Have the actual network be its own model.

A network management system can be constructed using these principles that has four types of data: existence, sampled, entered, and derived.

Existence data are data implied by the existing tangible network. An example of such data is "Cisco IGS/L router, serial number 0002LP, located in the SW corner of room 4 of Pillsbury Hall." You obtain data of this type by traveling to the physical location(s) and directly observing the data. By definition, this type of data cannot be incorrect or out-of-date in any way.

Sampled data are obtained periodically (usually automatically) and stored for later reference. Such data are never completely up to date, since when they are transferred and stored, they may have changed. Often sampled data are somewhat static and the stored versions still may be useful for other purposes. Barring bugs in the implementation or transmission errors, these data are never incorrect. It is often useful to retain multiple versions of these data for comparison or analyses. Examples of this type of data are router configurations, routing tables, ARP tables, etc.

Entered data are (usually manually) entered and stored for later use. Typically, these data contain only the minimum information required to tie the other types of data together and are thus small in comparison. Examples of this type of data are DNS information, lists of routers, etc.

(There are no hard boundaries between sampled and entered data. For example, a router configuration could be considered entered data.)

Derived data are created from other data. One example of this type of data is comparing sampled routing tables to entered lists of assigned network numbers and looking for discrepancies. Another example is expanding an entered list of devices by looking up addressing, maintainer, and other information and collecting that expanded information into one place.

In Operation

Our current network management system is based on the above principles. While not completely implemented, enough is in place to demonstrate this system is feasible.

We are currently making extensive use of sampled data. Each night, all routers are queried and the data saved. Each week, all Shiva FastPaths are queried.

Our entered data break down as follows:

	1 MByte	Domain Name Server (DNS)
	200 KBytes	monitoring program configuration
	200 KBytes	other (network number lists, contact information, etc.)

It is interesting that this amount of data can comfortably fit in current palmtop computers such as the HP95LX, with room left over for programs to access the data efficiently.

We use the DNS to record the host name, IP address(es), and MX information. In addition, we optionally record the host and operating system types (very generic), the physical location if available, and the device's maintainer if different from that of the department. As comments, we record directives to a program that generates the bulk of the monitoring program configuration.

Our network contains about 10,000 hosts. The entered data thus averages about 140 bytes per host. (Much of this large size is the result of inefficient coding in the DNS.)

We are now using derived data to manage the AppleTalk network. We are also about to expand this type of data.

But Wait, There's More

If we were to stop here, our system would be failure. Like a flying buttress with no cathedral to lean on, it would fall over and we would have chaos.

The design of our network management system follows from the design of our actual network. To return to the original thesis: if the network itself is muddled, the network management system will reflect that mess by becoming complex. While one can improve the network management system to make the messy network survive, it cannot fix the disorder itself.

The purpose of a network is reliably to move data from one node to another. A network failure happens when the network does not fulfill its purpose. Failures are measured from the users' perspective. Failures are traditionally characterized by two parameters:

The Mean Time Between Failures (MTBF). measures how often a failure is observed.
The Mean Time To Repair (MTTR). measures how long it takes to rectify the failure.

We have selected the following values:

MTBF of 1 year
MTTR of 2 hours

These values were selected because we believe that they correspond to our users' desires. This is somewhat of an educated guess during our current network transition phase at the University. If asked, we believe that most of our users would say that they do not rely on the network and that it is a luxury. As such, it wouldn't matter if it went down or for how long it was down. Yet, we have not had a major network failure in recent memory. We believe that if we were to have such a failure, our users -- and we -- would quickly find out that they do not consider the network a luxury. A large-scale network failure probably will have the same consequences as a large-scale power failure (i.e., bring operations to a halt).

The following sections will describe the techniques we use to meet these design goals.

Define the Service That Is Being Offered

If you can't define your service, you can't measure its reliability. Our service is the "reliable," end-to-end delivery of packets from the originating node to the recipient. The hardware level interface is Ethernet or IEEE 802.3, at either an AUI or 10BaseT interface. The software level interface is any of TCP/IP, DECNET, AppleTalk Phase II or Novell IPX, with plans to add ISO/OSI in the future.

Reliable is in quotes in the above definition because it is not used to mean "100%," but adequate reliability as required by each protocol. For TCP/IP, for example, even a 2% or so failure rate at the packet level does not interfere with delivering reliable service to the user.

Keeping Failures From Happening (MTBF of One Year)

Failures can occur anywhere in the network. There are two ways to reduce the number of failures: either reduce the number of network components or reduce the failure rate of each component.

Given the typical size of buildings, space between buildings, and the number of network nodes, there isn't too much that can be done about the number of components. A single, physical network can only reach a tiny fraction of the existing nodes. Thus, multiple physical networks must be joined into a larger network by active components.

The next step is to reduce the effective failure rate of each component. This reduction is obtained by selecting reliable components and minimizing stress on each component.

For active components, we look at the observed failure rate and only choose components that are reliable. We minimize stress on these components by careful installation, minimal disturbance (e.g., locked rooms), limiting environmental stress (e.g., air conditioning), and using UPS power.

For passive components, we install them carefully, in full accordance with network specifications and use conservative network designs. For example, keeping network segments short and having only a few devices (preferably two) on each network segment (e.g., one host per twisted pair hub port).

Network control is another area of concern. Where possible, the active devices (e.g., routers) are configured only to accept control information from each other or a secure command area. The computers that hold the network management data are configured to be secure and all changes in network management data must be made from those systems. (Network information is available on a read-only basis to many other sites.)

We will close this section by reviewing a typical cross-section of the network between a user and the server that they are using.

user's machine (Macintosh)
LocalTalk network
Shiva FastPath
Ethernet network
bridge
Ethernet link segment
10BaseT hub
Ethernet link segment
Cisco IGS/L router
fibre Ethernet link segment
Cisco AGS+ router
FDDI ring
Cisco AGS+ router
fibre Ethernet link segment
Cisco IGS/L router
Ethernet network
server host (mainframe computer)

There are 15 network elements between the user's computer and the server. (We are not responsible for the reliability of any of the end nodes.) If the user is to observe a failure rate e of one per year on this network, each component must fail no more often than:

	1/year >= 1 - (1 - e)^15

or about once in fifteen years. Now, fifteen years are about 130,000 hours and active equipment usually is rated at no more than 50,000 hour MTBF. Thus, if the passive elements (cabling) are considerably more reliable than necessary, we just might make our goal.

Fixing Failures Quickly (MTTR of Two Hours)

No matter how carefully a network is designed and installed, it will still fail. Therefore, it is important that any failures are repaired quickly. How quickly? With 10,000 nodes and a MTTR goal of two hours, the network has an annual failure budget of 20,000 node-hours. This is a large enough value to allow us leeway to solve a few tough problems -- so long as most problems are solved quickly.

One thing we can't afford is an intermittent failure. Such failures are difficult to diagnose and resolve. It is not unusual for one to take a week to solve. If only ten nodes are affected, a single failure would eat up over eight percent of our failure budget.

The way to avoid intermittent failures is in network design. One characteristic of an out-of-specification network is its failure mode. Such a network will appear to work perfectly under low load conditions. Yet, once network load exceeds a threshold value, the network fails. As this threshold can be exceeded for periods roughly milliseconds at a time, the resulting effect is one of sporadic, inexplicable failures. This threshold value varies depending upon the exact way in which the network fails to meet specifications. The only way to avoid this class of problem is by strict adherence to network standards.

Still, even network standards make assumptions, and these assumptions may not be true. Amplifiers and receivers age and lose power, cable joints may be imperfect, cable lengths and transceiver spacing (for thick and thin net) may not be correct, interference may be present, network interfaces may not follow specifications, and so forth. For these reasons, we design as follows:

If the network medium cannot be directly monitored (e.g., thick and thin net), stay well under network standards. For example, keep thin net segments to 100 m.

If the network medium can be directly monitored, specifications may be met exactly or even exceeded slightly. This will only occur on 10BaseT and fibre Ethernet link segments where we have SNMP-compliant equipment on both ends of the link. With this configuration, we can (and do) directly track the error rate of the link and can tell if we have current problems or project future problems on the link. With intermittent failures minimized, regular failures can still cause problems. Given that the failure has happened, we:

Keep the scope of any failure as local as possible. Our initial goal is that any failure should affect at most a single building. This constraint helps in two ways:
1. Fewer people are affected, therefore our failure budget is used up more slowly.
2. Limiting the scope also limits the possible causes, thus allowing us to locate the failure more quickly.
Use only equipment that is "too simple to fail" (e.g., a cable) or smart enough that it can be interrogated with SNMP. Other active equipment types (e.g., repeaters and bridges that don't speak SNMP) are not used. This constraint:
1. Helps turn intermittent failures (which are difficult to diagnose) into hard failures.
2. Often, it allows us to pinpoint the cause of the failure remotely.
We try to "never backtrack." The network is designed to grow by adding or upgrading equipment. This means that, from time to time, we find ourselves specifying equipment that it "too big" for the current need and thus we are tempted to cut corners. We resist, as we know that we will later spend more time and energy undoing the "temporary" equipment than was saved.
We use standard "cookbook" network design. This constraint has many benefits:
1. There is no need of a detailed component-by-component network map. The cookbook is small enough that everyone can know it by heart.
2. We learn these configurations thoroughly, and can quickly transfer learning from one person or installation to another.
3. We minimize the amount of inventory that we must stock and that a repair person must take into the field. This means that we are unlikely to be out of stock on a component. Such stock outage would imply a long down time.

Summary

Large, complex network management systems are outdated. Having one is a sure sign that your network is not designed properly. Instead, your network management system should be "lightweight" and have a minimum of data that you must enter. Other information is derived from this data and from the operating network.

Given that your existing network requires a complex network management system, a necessary part of any solution is to redesign your network according to principles. These principles follow from the service definition and the MTBF and MTTR figures.

The network design now being implemented at the University of Minnesota shows that it is possible to achieve a lightweight network design.