Dealing with SLAs and Visitors from ‘Outlier’ Space

Those in the information technology (IT) field often encounter service-level agreements (SLAs), which define the performance a customer can expect from a particular process or service, such as a help desk. Often, these agreements are established by negotiation rather than by a more scientific approach that would be used in a Six Sigma process (where SLAs would be termed specification limits).

There is an approach for setting SLAs that can help avoid most common mistakes and that can be applied in many situations without committing to full-blown Six Sigma training and deployment. The simplified illustration included here does not assume the reader is statistically trained, and it is configured with a smaller set of data than would be expected in actual practice. In the interest of simplicity and clarity, certain technicalities and qualifications that could apply in rare circumstances have not been addressed.

Basic Terminology and Concepts

Every process exhibits variability — sometimes things get done quickly, and sometimes they do not. The time it takes to drive to work, response time to display a web page or the length of the teller line at the bank are all everyday things that vary. Understanding variability is essential to establishing SLAs that are meaningful to the customer and achievable by the provider.

The amount of variability inherent in a process can be measured by the standard deviation (if the elements of data are “normal” — an issue to be examined later). There is no need to go into the underlying math to understand the concept and power of this measure. At its most basic level, standard deviation is simply a value that tells how much of the variation is contained within a certain range of process performance, as specifically enumerated in Table 1.

 Table 1: Understanding variability.
 Number of Standard Deviations  Percent of Outcomes Included
 1  68.26%
 2  95.45%
 3  99.73%
 4  99.99%

Three standard deviations (also called 3s) means that fewer than three outcomes per thousand opportunities, or executions of the process, are expected (statistically speaking) to be outside the numerical range defined by plus or minus 3s. Here is what this means in a particular process:

Assume a business is running an IT help desk, and the business has a set of historical data on how many days it takes to close cases. The business wants to establish an SLA for days-to-close-cases with its customer. To get a first approximation, the business might use a basic tool such as “descriptive statistics” to find the range of values in its data set — the average, or mean, and the standard deviation.

In the particular data set being used, the minimum value is zero days, the maximum is 9 days, the average is 1.69 days and the standard deviation is 1.74 days. With one important qualification (to be addressed in a moment), this information tells us that 68.26 percent of the time a case will be closed in not more than 3.4 days (1.69 + 1.74). Similarly, 99.73 percent of the time a case will close within 6.9 days (1.69 + 3 x 1.74). So, at first glance, the business might be inclined to set its SLA to guarantee that 99.7 percent of cases will close within 6.9 days.

The qualification: The foregoing logic gives correct results only if the data is normally distributed — that is, when the data is charted, it produces the familiar bellshaped curve, with about one-half of the data points above the mean and one-half below the mean.

Looking for Normally Distributed Data

The data in Figure 1 clearly shows that that this condition is not satisfied. The data is skewed so that there is a long “tail” to the right. The visual impression is confirmed by a statistical test (the Anderson-Darling Normality Test, not shown), which indicates there is less than a 0.5 percent chance that this data is, in fact, normal.

Visual examination reveals that there is one data point at about 9 days. This is a lot different than any of the other data, so the business must wonder why. This sort of data, commonly called an outlier, is probably the result of what is known as an assignable cause.

Upon investigation, the business finds that the case was actually completed in two days but was not closed because the system allows a case to be closed only by the person to whom it was assigned. In this instance, that person was out sick for a week, so the case did not close until she returned.

 Figure 1: Original data.
Figure 1. Original data.

The business corrects the data and then takes another look at the overall distribution (Figure 2). It is still not normal. There seem to be two peaks in the data — one at about one day and another at about six days. This pattern is known as a bimodal distribution and is often an indication that the data actually represents more than one process, rather than the single process the business might have assumed initially.

 Figure 2: Outlier removed.
Figure 2. Outlier removed.

When investigating this bimodal pattern, the business takes a careful look at the subset of the data that took six days to close. It discovers that all of the cases relate to data corruption problems, which are not handled by the help desk. Cases of this type are given to the data management group and are handled by a process that is very different from other calls. Hence, the business decides to exclude these from the data set used to establish the SLA for all other types of calls, which gives the business the data set shown in Figure 3.

 Figure 3: Corruption problems excluded.
Figure 3. Corruption problems excluded.

Although strictly speaking this data is still not normal, the range and distribution are now quite tight, and the business can use standard deviation to set its SLA without the risk of drawing the wrong conclusions.

Creating Reasonable SLAs

Consequently, the business proposes to set the SLA for days-to-close-cases not related to data corruption at 2.3 days (1.3 mean + 0.96 standard deviation). That is far different from the initial impression of 4.3 days.

In this example, in the end, the business created two different SLAs — one for data corruption cases and one for other types. This simple approach, which can be applied to many types of historical data, will lead to SLAs that are more meaningful and useful than those produced by negotiation.

This article was originally published in CTQ Media’s iSixSigma Magazine, http://www.isixsigma-magazine.com/.