Software Engineering: The Software Metric System

SOFTWARE ENGINEERING

THE SOFTWARE METRIC SYSTEM

BY KARL E. WIEGERS

How quickly can you write computer programs? Do you program just as fast in one language as another? Are you getting faster? How good are the programs you write? Is there a way to tell how good is "good"? Are you getting better? How large are your programs? How complex?

If you have answers for all (or even any) of these questions, you should be writing this series, not reading it. These questions all pertain to measurement: size, speed, quality. These are characteristics of software production, as they are characteristics of any creative or manufacturing process. However, people are still grappling with good ways to "measure" software. As an introduction to the topic of software quality-assurance, let's spend a few pages thinking about software-measurement techniques, also known as "software metrics".

Why measure?

Recall from earlier installments in this series that the major challenge facing us is the ability to produce high-quality software rapidly enough to meet an ever-increasing demand. The key concepts are quality, reliability and productivity. We've talked about methodologies for improving all of these factors, including structured analysis and system specification, structured system design, structured programming, structured testing and computer-aided software engineering (CASE).

But a critical question remains: How can I tell if the quality and productivity of my software development efforts are improving? This is where the metrics come in. We need some way to assess software quality, such as the number of errors per quantity of code. Productivity usually reflects the quantity of a product created per unit effort. Reliability is indicated by the number and severity of failures per unit time. This still leaves some questions, such as how to measure the quantity of software created in a project, what constitutes an error, and how to quantify the effort that goes into a software project.

There are several benefits to being able to measure our software production in these ways. A chronic problem is that software systems are delivered to the customer much later than expected. Good project management demands reliable methods for estimating the size, complexity and work effort needed for a software system. We can only base such estimates on our previous experience, so collecting software metrics can help us better estimate the completion time, cost and resources required for the next project. This is extremely important for business planning.

From a technological viewpoint, it's important to know which of the various methodologies aimed at improving productivity are providing the best payback. If the CASE tool you're using doesn't seem to be increasing your productivity as much as you'd like, you might want to spend your money on some other kind of programming assistance. This assessment is based squarely upon meaningful methods for measuring productivity. Software metrics can also help assess the magnitude of our maintenance burden by tracking the cost, size and impact of changes in existing systems.

PHOTO: ELLEN SHUSTER

The problem

It's much more difficult to measure the manufacture of software than it is for hardware (by which I mean almost anything besides computer programs). Let's use the term "widget" to refer to any kind of non-software product. What are some differences between widgets and software?

A widget is some kind of object, but what is a "software," or even a "program"? You can't count the software equivalent of widgets directly. Also, it's difficult to apply to software the statistical quality-control measurements people do with widgets. Defects in widgets are usually attributable to flaws in materials or production. There are ways to spot the defective widgets before they are sold and to decide how to fix a poor process. But all copies of a computer program are identical (unless there's a hardware flaw, like a bad floppy disk). The defects in software are intrinsic in the product, and they're usually hard to detect.

Given all these problems, let's see some methods people have devised for measuring different aspects of the software development process.

How much?

Lines of Code. We'll begin by discussing how to measure the size of a software system. The number of lines of code (LOC) has long been used for this purpose. LOC has the advantage of being an objective measure that is fairly easy to determine, but there are several short-comings.

One question is simply the definition of a "line of code." Should you count comment lines in the source file? What if you have several logical statements on the same physical line in the file (a poor programming style, but you may remember it from older BASICs)? How should you deal with continuation statements, permitted by some languages, in which one logical line is split over several physical source lines? Should statements like variable type declarations in C or Fortran count the same as actual logic or computation statements? What about lines of text in a help display?

Perhaps the simplest measure for LOC is to count each nonblank, non-comment logical source statement in the file as a single line of code. Continued statements count as one LOC, no matter how many physical lines they occupy in the source file.

For a given programming language, using lines of code to compare the relative sizes of program modules is usually okay. However, if one module is longer than another, you can't tell whether the longer one is more complex than the other or just less efficiently coded.

We run into more problems when we try to compare modules written in different languages. An assembly language program generally takes several times as many source statements to accomplish a task as does a higher-level language, which is why we use the latter whenever we can. To be sure, the assembly statements are usually much shorter, but we haven't considered the complexity of each line of code yet. In fact, this is another shortcoming of the LOC metric: All source statements are not created equal.

Some new languages don't even involve lines of code in the conventional sense. Fourth-generation languages (4GLs) often involve some kind of query language by which users can access databases without having to write all their own routines for handling files, presenting information on the screen, performing calculations, and so on. If you develop an application using a 4GL, you may end up with few actual "lines of code" in a system that performs some complex tasks. It's not meaningful to compare LOC in a case like this with a system written using a 3GL, like C, Pascal or BASIC. Obviously, LOC as a software size metric has some limitations.

Number of Modules. Another way to estimate the size of a software system is by counting the number of modules it contains. Recall that we define a module as a named, callable block of code. Again, however, this metric doesn't consider the complexity of different modules in a system, or of modules in different systems. Nor does it consider pre-existing modules that were reused in this application.

The modules I write in high-level languages average about 30-50 lines of code (according to the definition I gave a few paragraphs back). In contrast, I recently heard about a huge software system written in PL/I, totalling about one million lines of code and about 1,000 modules. This works out to an average of around one thousand LOC/module. (Naturally, I have no idea how those folks defined "line of code.") Simply counting modules in these two cases isn't a good comparator of system size.

But to a first approximation, especially if the same software developer is involved in each case, we can compare the size of software systems by counting the number of modules, number of reused modules, the average lines of code per module, the number of lines of help text and the number of elemental pieces of data involved in the system. This last notion gets back to the system data dictionary we've discussed in earlier articles. The October 1988 issue of ST-LOG might be worth reviewing if you're fuzzy on data dictionaries.

Function Points. Some years ago, A. J. Albrecht at IBM came up with another scheme for estimating the size and complexity of a software system, that attempts to circumvent the language dependency. This method is based on the fact that every program performs an assortment of specific functions. Some of these will be simple, and some more complex. Albrecht's method counts the "function points" contained in a software system. Albrecht's original papers would be hard to find, but you can read about the function-point method in Roger Pressman's book, Software Engineering, A Practitioner's Approach, 2nd Ed., McGraw-Hill, 1987 (pp. 91–94). Since the function-point metric seems to have broad applicability, let's talk about the method a little bit.

To tally the function points in your software system, you go through four steps. First, count the number of instances of five different components: external (user) inputs; external outputs; external inquiries; logical internal files; and external interface files. Table 1 defines these five component types. Next, classify every instance in each of these classes as to its relative complexity: low, average or high. Albrecht's papers contain guidelines for the complexity classification based on the number of individual data elements and the number of file types referenced in each component. We won't worry about the details now.

The third step involves weighting the instances of each component type for complexity. Table 2 shows the weighting factors used. Let's suppose your system contains five external inputs of low complexity, three of average and one of high complexity. Applying the weighting factors from the first row of Table 2 gives this equation:

(5*3) + (3*4) + (1*6) = 33

This means that your system contains 33 function points from external input-type components. Perform the same calculation for the other four component types, applying the appropriate weighting factors from Table 2, to come up with the "unadjusted function points" for your system.

The final step is to compute a fudge factor, which you'll apply to the unadjusted function-point count to determine the final number of function points. This adjustment number is based on 14 factors, listed in Table 3. You assign a numeric value from 0 (no effect in your system) through 5 (strong effect in your system) for each of these 14 factors. Then, add up these numeric values for all 14 factors, divide by 100 and add the result to 0.65. This is your weighting factor, which will range from 0.65 through 1.35. An "average" system is imagined to have a weighting factor of 1.00, corresponding to an average value of 2.5 for each of these 14 items. In other words, our fudge factor can change the raw function-point tally by plus or minus 35%.

Multiply this weighting factor times the unadjusted function-point count, and there you have the final, adjusted function-point count for the system. This equation below summarizes the computation process:

FPs = raw FP sum *

[0.65 + (0.01*fudge factor sum)]

Whew! I agree that the function-point calculation is tedious. But in practice it works out well. After you've gone through the computation once, you'll get the hang of it. Function points do provide a measure of system size and complexity that is independent of language. In fact, you can even estimate the function points for a system at the design stage, which is helpful for estimating completion times for the project if you know how long it takes you to generate one function point worth of code. Which brings us to the next class of software metrics.

How fast?

Now we get to the question of how to measure productivity. We want to be able to measure it because we have a stated goal of improving our productivity. Unless we know where we are, we really can't tell if we're making any progress.

"Productivity" can be defined as the ratio of product created to the effort expended in creating it. In the preceding section we looked at ways to estimate the quantity of software created for a particular project. Despite their limitations, we'll have to use one or another of those metrics for the numerator in any productivity calculation.

One obvious measure of effort expended is the time spent on a software project. This is surprisingly difficult to measure accurately. First, of course, we have to define the start and finish points. The software engineering philosophy suggests that the official beginning of a project might be the time at which you sit down to write the statement of purpose for the system you are about to build.

The ending point is more obscure. An arbitrary definition is the time at which the product (documentation and all) is delivered to the customer. This doesn't mean that no more work will be done on the project after delivery. Rather, this definition simply identifies a boundary between the system-development process and the infinitely long maintenance phase. You're probably interested in measuring both development work effort and maintenance work effort, so it makes sense to draw a line between the two.

Another consideration is how to count hours during the development period spent on activities other than working directly on the project. We can think of two kinds of development time. "Gross" time includes all work hours (weeks, months, years) between the beginning and end of the project, including meetings, coffee breaks, vacations and chitchat. "Net" time counts just the hours devoted explicitly to project activities. The difference between gross and net time is the necessary overhead for having human beings perform a task. The employer has to pay for gross time, but he might be able to charge the customer only for net time. Because of such nightmares, I've avoided the business aspects of software development as much as possible.

Let's assume we can measure the time spent on a project to our satisfaction. Now we can talk about productivity. A simple measure is lines of code written per day. Before you get too excited about this metric, remember the shortcomings of the LOC measure itself. You can expect to see big differences in LOC/day for programs written in different languages.

A longtime software industry bench-mark is that an average computer programmer can generate only 10–15 lines of debugged code not per hour, but per day. This seems abominably low, but by the time you factor in the time spent on system analysis and design, progress reviews, testing and documentation, it turns out to be sadly accurate.

Alternatively, we could calculate productivity by considering function points to be the measure of product created. Therefore, function points created per unit time (day, week, month) is a productivity metric. We would expect this metric to depend less on the language used than does the LOC/time metric.

The opposite of productivity is cost. If you imagine cost to be proportional to time (as in hourly salaries for programmers), we can get cost metrics by taking the reciprocals of our productivity metrics. These might be hours/line of code or days/function point, which could be translated into $/FP or $/LOC if you have a cost/hour figure available.

You may think, this notion of cost doesn't apply if you're a computer hobbyist working on your own, rather than a cog in a corporate wheel. But it does! Your spare time isn't really "free" time; it's valuable to you. What if you're spending an hour a day at the Atari keyboard when you have the option of working overtime at your real job? Don't tell me there's no cost associated with that.

What do you do once you've selected a productivity metric? The first step is to apply it to your current or most recent projects. The idea is to build a baseline of your current productivity status. Then track the productivity metrics for your future projects and compare them to the baseline. If you see your productivity increasing or your cost decreasing, congratulations; you're becoming a more efficient software developer.

A decrease in productivity could mean lots of things, so analyze it before you panic. Are you confident that your baseline measure is accurate? Are you working in a new environment (unfamiliar computer or new language)? If you're in a team environment, are there new members on the team who aren't fully up to speed yet? Are the metrics you're using the most appropriate ones for the work you do and the factors you care about?

You can use the productivity measure to determine how changes you've made in your development effort are working out. For example, I recently worked on a project where we used a new CASE tool for system design and a new programming language on the IBM PC. We also experimented with another tool intended for increased productivity: a code generator. After six man-months of effort, I calculated our productivity at, you guessed it, ten lines of code per day! What a disappointment.

But the project wasn't a failure. The learning curve associated with the new environment is bound to cost some productivity. For the CASE tools, I viewed this as an investment that will pay off in spades on future projects, when the learning curve is gone but the benefit remains. More important, I felt that our software engineering approach had resulted in a great improvement in the quality and reliability of the system we created. So despite the apparent lack of productivity gain, I concluded that this project, and the application of the new SE technologies, was a grand success.

For personal programming of the sort most hobbyists do, productivity takes on a slightly different meaning. You probably don't care so much that every hour is spent as efficiently as possible, because you're doing it all for fun. But the notion of productivity can become important if you're trying to justify making further investments in your hobby. Remember, your personal time is worth money. How much time would it save you to have a hard disk or a second floppy drive? How about a more powerful C compiler? Of course, all these arguments fly out the window if you just want to buy a new toy (nothing wrong with that), but they can improve your chances of success when attempting to convince a spouse that you need to spend a couple of hundred bucks.

In the professional world, productivity metrics are important for estimating completion time and the cost of a new project. Suppose that, based on a system specification and preliminary design, you estimate that a new program would comprise 60 function points. You know from collecting metrics that you can generate an average of three function points per week. Thus, you estimate that this project would take about 20 man-weeks to complete. Now you have some meaningful numbers to present to your customer. We'll talk more about software project planning in a future article.

How well?

Another class of metrics addresses the question of software quality: How many defects are contained in our products, when are they discovered, at what point in the development cycle are they introduced, how costly are they to correct and what is their impact on the system? Lots of questions here, but the answers are important to software quality-assurance. If you know where your software errors are arising, you can concentrate your efforts in the right places to minimize the defects.

Software defects

What do we mean by a "software defect"? Basically, I'm referring to some unanticipated and undesired behavior in the system. (Occasionally you'll encounter some unanticipated but desirable behavior, called an "undocumented feature," but those are scarcer than hen's teeth.) You know these better as "bugs."

It turns out that most software defects are errors of omission, not commission. By an error of commission, I mean something like a syntax error or an erroneous algorithm. Modern compilers prevent most syntax errors, such as incorrect function or subroutine argument lists, from slipping through. They do have some trouble reading your mind, however, so mistakes in algorithms probably won't be detected. An error of omission means that you've left something out, like a trap for bad input data, or a check for a full disk before attempting to write to it, or an ELSE statement in an IF/THEN construct. This sort of mistake is best avoided by following the software-engineering stratagems of systematic design, review and testing. The software-engineering approach preaches a twofold assault on bugs. Bug prevention is best accomplished by following structured analysis, design and implementation techniques. Early bug detection is facilitated by software quality-assurance efforts, including structured walkthroughs, project reviews and testing. Software quality-assurance will be the subject of a future article.

It's useful to keep track of when in the development cycle defects are identified and at what phase they were introduced. It's much easier and cheaper to correct errors when they are detected early in the development life cycle. As time goes on, undetected errors become better concealed in the thicket of code that grows over the skeleton of design. Also, the range of influence of a particular bug widens as its tendrils penetrate into more parts of the system. The longer this infiltration continues, the more difficult it is to surgically excise the critter without killing the patient.

Defect metrics

Let's assume that you've devised a method for counting defects that show up in your systems either in testing or in the hands of users after the system has been released (horrors!). One way to quantify the defect rate is to count defects revealed per line of code. I hope that you can do better than one error per line, so a more useful measure is defects per thousand lines of code (defects/KLOC). Track this metric over several projects; if you see a decline in the defects/KLOC figure, pat yourself on the back for having attained a quality improvement.

But wait! We're only counting the defects we've spotted. How many more are lurking about that we have not yet encountered? This is impossible to know with certainty, but it's a sure bet that a module that has been found to contain a large number of errors probably contains even more than you think. You should monitor the defects/KLOC/unit time, since bugs are gradually revealed with continued use of the program. As the maintenance phase continues, the rate of appearance of new bugs normally dwindles.

Measures used to assess the severity of errors in software (and hardware, for that matter) are the mean time to failure (MTTF), mean time between failures (MTBF) and mean time to repair (MTTR). MTTF is the average time that the system runs properly before it crashes due to a defect. This may be indicative of the density of bugs in the system. The MTTR pertains to the difficulty of fixing a bug once it is detected, sort of a measure of how difficult maintenance on the system is. The mean time between failures is the sum of the MTTF and the MTTR. These metrics can be used to estimate software-system reliability.

How complex?

The final category for software metrics today is module complexity. We've already seen that lines of code and function points tell us something about system size, but not much about the complexity of a particular module. One method was devised by Thomas McCabe. McCabe's complexity measure is based on the notion of representing the control structure of a module (branching and iteration constructs) in the form of a graph and then counting particular features of the resulting graph for each module. We won't worry about the details now.

Table 1.

System component types for function-point analysis.

External Inputs: Count each input by which the user supplies application-oriented data to the system. Each data input screen counts as a single input, even if it has multiple data elements.

External Outputs: Count each output from the system that provides application-oriented information to the user. This could be screen displays, printed reports or error messages.

External Inquiries: Count each kind of request the user can make for the system to do something, such as retrieve and display data, mouse-clicks, showing help displays, etc. Not the same as external inputs.

Logical Internal Files: Count each logical grouping of data used within the system; similar to data stores.

External Interface Files: Count each file that is shared with another application, rather than being strictly internal to the present system.

Another computer scientist, M. Halstead, devised equations for quantitatively calculating the complexity of a program module by counting operators and operands. Operators are things like equals signs, comparisons (< =), IF statements and mathematical symbols (+, -, *, /). Operands are variable names or constants used in comparisons or loops. Both the numbers of unique operators and operands, and the total number of operators and operands that appear in the module are used in the complexity calculation.

I've never actually used either McCabe's or Halstead's complexity metrics, so I won't go into more detail. The important point is to know that methods for calculating module complexity do exist, should you ever need to do so.

The bottom line

We've talked about several metrics that attempt to quantify different aspects of software creation: quantity, productivity, quality and complexity. If you do try to use some of these metrics, it's important to get your definitions straight at the outset. At least this way you can be internally consistent, either for your personal projects or for systems built by all the people in the same organization. It's somewhat dangerous to use these metrics to compare different software developers, since the numbers are so fuzzy and since they ignore other important aspects of software engineering, such as documentation and coding style.

Table 2.

Complexity weighting factors for function-point analysis.

Component Type	Low	Average	High
External Inputs	\|3	4	6
External Outputs	\|4	5	7
External Inquiries	\|3	4	6
Logical Internal Files	\|7	10	15
External Interface Files	5	7	10

Nonetheless, I feel that anything we can do to quantify the software-creation process will give us a possible handle for improving that process. Software metrics are an important aspect of any serious software-engineering effort.

Table 3.

Adjustment factors for function-point analysis.

Does the system involve data transmitted over communication facilities?
Does the system perform processing on more than one computer?
Is system performance (speed) a critical feature?
Does the system run in an existing, heavily used computing environment?
Is the system designed to handle a high transaction rate?
Does the user enter data online (as opposed to in batch)?
Does the design of an online system emphasize end-user efficiency?
Are the logical internal files updated by online activities?
Does the system involve particularly complex processing (mathematical computations, heavy error-checking)?
Is the code in this application designed to be reusable?
Is the system designed for easy conversion to production and easy installation?
Does the system require reliable backup and recovery operations?
Was the system designed to be used in multiple installations by multiple organizations?
Was the system designed specifically to facilitate change in data or files by the user?

After receiving a Ph.D. in organic chemistry, Karl E. Wiegers decided it was more fun to practice programming without a license. He is now a software engineer in the Eastman Kodak Photography Research Laboratories. He lives in Rochester, New York, with his wife, Chris, and two cats.