Classic Computer Magazine Archive CREATIVE COMPUTING VOL. 11, NO. 1 / JANUARY 1985 / PAGE 121

Choosing and using business forecasting software.

Every forecast you make is a "conditional" statement of what will happen in the future. The forecast depends on what also happens to the surrounding situation--the effect of coffee prices on tea sales depends upon consumer's expectations about future coffee prices, whether there is an excess supply of tea at current prices, and the exchange rate between the United States and tea/coffee producing countries. Every forecast is, then, limited by the "conditioning events" which surround the event to be forecasted. Assumptions of Every Forecast

While instinct and estimates will always have their place in business (some of us are better than others at following hunches) managers are lately turning toward systematic and objective forms of forecasting. An objective forecast is simply one which results from the forecaster using a model to make the forecast. A model, we will see, is just a compact statement of the way you think things work. Any model used for forecasting today is based on three simple assumptions:

* Future occurrences are based, at least in part, on presently observable events.

* Things will behave in the future much as they have in the past.

* The relationships that have occurred in the past can be discovered by study and observation. Systematic Forecasting

Systematic forecasting assumes that we can observe the underlying relationships that have occurred in the past by blocking out much of reality and building abstractions (models) which take into account only those things we feel are of prime importance in predicting something. It is no wonder that forecasters are accused of being simplistic and unrealistic--they are! In fact, to be simplistic is the only way to make any sense out of the complex relationships we face in the real world.

Forecasts may consist of predicting amounts, probabilities, or the timing of an event. We may all feel certain, for instance, that man will ultimately fly to Mars. If we were to predict when man would fly to Mars that would be a "timing forecast." If we were to predict the probability that the Dallas Cowboys would make it to the Superbowl, that would be a probability forecast. In this article we will deal with neither timing nor probability forecasts.

Here we will concentrate on predicting quantities. This is, by far, the most common form of business forecasting. if your company wants a sales forecast for next quarter, that is a quantity forecast. A cash flow forecast or an inventory forecast would also be a quantity forecast.

The systematic approach to forecasting may take the form of a carefully constructed model which the forecaster builds to mimic a real-world situation in which the assumptions are set up in strict mathematical form; this method is called econometrics. Or the forecaster's approach may be much less rigorous and much more dependent upon intuition and whatever data are readily available. Either approach to forecasting requires essentially the same statistical tools. Widely Used Forecasting Techniques

While many techniques for forecasting require a thorough study of economics and statistics, we shall present a set of elementary forecasting techniques most of which are available in most of the software packages listed in the accompanying comparison chart. We purposely overlook some forecasting techniques (such as using leading indicators or surveys of economic intentions) which, while quite useful, do not directly incorporate computer models.

The techniques we will cover include:

* Linear Regression--a method for using one variable to predict a second variable.

* Multiple Linear Regression--a method for using more than a single variable to predict another variable.

* Time Series Analysis--a way of studying the movement of a variable over time in order to predict its future values. Simple Linear Regression Model

A problem encountered by almost every manager is how to predict the value of some variable when the forecast variable is assumed to be dependent upon (or caused by) another variable.

For example, assume a carpet manufacturer finds that the number of residential building permits issued in a given quarter is strongly related to the company's carpet sales in the next quarter (this example is taken from the Graph 'n Calc manual).

The variable the manufacturer would like to predict is carpet sales, and the belief is that carpet sales are dependent upon the number of residential building permits issued in the previous quarter:

In statistical terminology, carpet sales is called the dependent variable and permits last quarter is called the independent variable. The objective of the linear regression model is to discover and measure the association between two variables. The usual reason for measuring the association is to aid in predicting one variable (the dependent variable) based upon the value of the other variable (the independent variable).

A standard convention in regression analysis is to use X to represent the independent variable and Y to represent the dependent variable. It is often useful to plot a scatter diagram of the variables to observe any possible relationship. The scatter diagram in Figure 1 is the plot of the information in Table 1 for carpet sales and building permits issued in the previous quarter.

Note that in Figure 1 the known variable--in this case permits issued last quarter--is plotted along the horizontal axis. The unknown variable by convention is plotted along the vertical axis. For the period to which our carpet sales and permit data pertain, both variables are known, but when the regression is used to forecast how large carpet sales will be, only permits, and not carpet sales, will be known.

While carpet sales may vary for many reasons the manufacturer has indicated that building permits last quarter (that is, lagged one quarter) are believed to be very predictive. Each point in Figure I represents the number of building permits issued in a given quarter and the corresponding carpet sales one quarter hence. The line was drawn to fit the points as closely as possible.

The carpet manufacturer could use Figure 1 to forecast carpet sales for the coming quarter if he knew the value of building permits issued this quarter. If, say, 350 permits were issued this quarter the manufacturer could forecast that carpet sales next quarter would be about 550.

The line we drew on the graph "to fit the points as closely as possible" might not, however, be the same you would draw through this same set of points. Since the points lie near the line but not precisely on the line, we can see that permits issued last quarter is not a perfect forecaster of carpet sales (if it were, all points would lie exactly on the line).

Simple linear regression will perform the task of choosing the line that best represents the points according to a decision rule set by statisticians some time ago. We are generally interested in statistical relationships when forecasting. If a statistical relationship exists between Y (carpet sales) and X (permits issued), the average value of Y tends to be related to the value of X, but it is impossible to predict with certainty the value of Y on the basis of a given value of X.

In our example, the amount of carpet sold tends to increase as the number of building permits issued increases, and this relationship can be used to forecast carpet sales if we know how many building permits were issued last quarter. But this relationship is far from exact. Since we, as forecasters, have not taken into account all the variables affecting carpet sales, it is impossible to forecast with certainty the exact amount of carpet sales.

Simple linear regression describes how the dependent variable is related to the independent variable. Regression derives an equation for a line like the one in figure 1 which can be used to estimate the dependent variable on the basis of known values of the other variable.

The term "regression" is used because Francis Galton, an English statistician of the last century, compared heights of parents with heights of offspring and found that very tall parents tended to have offspring shorter than their parents but that very short parents tended to have offspring taller than their parents. Thus the heights of offspring tended to "regress" toward some average height of the population. Because galton used the technique we now know as regression to carry out the study, we now call the technique "regression" after its first important application.

To carry out regression analysis we need a method for finding the equation that minimizes the average squared deviation of the points from the line. Minimizing the average squared deviation is the method statisticians have agreed "fits" the best line to a set of points. This technique avoids large errors because the squaring of the deviations from the line places more emphasis on minimizing large errors as well as counting positive errors equally as important as negative errors (since all numbers squared are positive).

The regression program in Listing 1 will allow you to enter your own data, find the equation of the regression line, and forecast individual values of the dependent variable. The program as written is in Applesoft Basic but it is plain enough to run on almost any type of microcomputer. Using this simple program will give us a bit of the flavor of the commercial packages which are much easier to use and more powerful.

To use the program in Listing 1, type the program in, save it to disk for later use, and execute it with a RUN command. The program first prompts you for the number of observations. This would be 18 if we wish to use the data in Table 1.

You are next asked for the number of independent variables; this would be 1 if we wish to use only permits issued last quarter as a predictor.

The program then prompts for the 18 pairs of observations by requesting first the independent variable (permits) and then the dependent variable (carpet sales). After the last observation is entered, the program calculates and displays the equation for the regression line and some summary statistics which we will discuss.

The output in Table 2 shows the results we obtained from the program by entering the carpet sales and permit data. The "equation coefficients" define the regression line as:

Carpet Sales = 232.306478

1.00318177 (Permits)

If we were to plot this line on the scatter diagram in Figure 1 it would look much like the "best fit" line which was drawn in freehand. The equation may be used to forecast carpet sales by substituting a known value for permits issued last quarter into the equation. If you know that permits issued last quarter were 350: Carpet Sales = 232.306478 + 1.00318177 (350) or Carpet Sales = 583.420096 Your estimate of carpet sales this quarter would then be about 583. Summary Statistics

The summary statistics are measures of the goodness of fit of the regression line. Our simple program provides two measures for goodness of fit: the standard error of the estimate and the R-squared of coefficient of determination.

We have seen that the regression equation provides forecast of the dependent variable for given values of the independent variable. The standard error of the estimate is a measure of the amount of scatter about the regression line. A rule of thumb often used by forecasters in interpreting the standard error of the estimate is to say that you may be 95% confident of any estimate of the dependent variable if you "bracket" the estimate by two standard errors.

For instance, using the equation in Table 2 we could estimate carpet sales of about 583 if permits we known to be 350. The standard error of 47.5 would tell us that we could be 95% certain that the real value of carpet sales would be between 583 + 2(47.5) and 583 - 2(47.5): 678 488 If asked how certain you are that real carpet sales would be around the 583 you predicted, you could safely answer that you are 95% certain that actual sales will be between 488 and 678.

Obviously, if the standard error is very small your "bracket" will be small; a largest standard error may, likewise, give you such a large "bracket" that the forecast is useless. The standard error will be smaller if you are successful in finding an independent variable that is closely related to the dependent variable. So Choose your variables with care!

The other summary statistic produced by the program is variously called the coefficient of determination or the R-squared. This statistic is also a measure of how well the regression equation fits the data. The R-squared for our example is readers 80.79 percent; that is 80.79 percent of the variation (i.e. the up and down movements) in carpet sales is explained by variation in permits issued. The other odd 19 percent of the variation is unexplained by the regression and probably occurs because other things affect carpet sales.

All of the software packages mentioned in the comparison chart that follows calculate at least these two summary statistics. Many of the packages calculate others as well; the serious forecaster would do well to learn to interpret these statistics in order to use regression analysis correctly. Multiple Regression Model

Forecasters often wish to use more than one independent or predictor variable; this is where multiple regression becomes a useful technique. Whereas simple linear regression includes only one independent variable, multiple regression includes two or more independent variables. The reason for using more than one independent variable is to be able to predict more accurately the dependent variable. In the case of the carpet manufacturer, the firm may feel that factors other than permits issued have an important effect on carpet sales. For example, it may seem likely that carpet sales will tend to increase if advertising expenditures by the firm also increase: Sales ( dependent variable depend upon Last Quarter Permits (1st independent variable and Advertising (2nd independent variable

Another reason for using multiple regression instead of simple regression is that if the dependent variable depends upon more than one independent variable, a simple regression may result in a biased estimate of the coefficient of the independent variable. When a dependent variable is a function of more than one independent variable, running a simple regression on any one independent variable may be misleading. To estimate the true effects of any independent variable on the dependent variable, we must include all the independent variables in the regression (i.e. use multiple regression).

The program in Listing 1 also performs multiple regression, and its use is essentially the same as our previous example. Using the advertising data in Table 1 along with the sales and permits data gave us the output in Table 3 where variable (1) is advertising and variable (2) is permits issued last quarter. The regression equation (which how defines a plane in three-space rather than a line in two-space) would be: Sales = -139.771666 + 11.6434849 (Adv.) + .779639035 (Permits)

Again, we could use this equation to forecast carpet sales this quarter if we knew permits issued last quarter and advertising expenditures. If permits were 300 and advertising were 37: Sales=-139.771666+11.6434849(37) + .779639035 (300) or Sales = 524.928986 Carpet sales would be estimate to be about 525 this quarter. Summary Statistics Again

The summary statistics are also interpreted in a fashion similar to that used in simple regression. The standard error of the estimate may also be used in multiple regression to construct a confidence interval ("bracket") for any forecast. We may again use our rule of thumb in constructing the interval. We see from Table 3 that the standard error is 37.2085677 which is smaller than our previous standard error with simple regression. That is good for forecasting, because our confidence interval will be narrower; if our estimate of sales is about 524, the confidence interval will be between: 524 + 2 (37.2085677) and 524 - 2 (37.2083677) or 598 450 Note that when the extra independent variable is added to the forecasting equation, the size of the 95% confidence interval decreases.

The coefficient of determination can also be used in multiple regressions to measure how the regression equation fits the data. Our R-squared of 88.98% from Table 3 indicates a higher explanatory power for this multiple regression. Overall, our multiple regression appears to be a better tool for accurate forecasting than the simple regression. Time Series Analysis

All business data are either cross section data or time series data. Cross section data pertain to measurement at a given point in time. For instance, data on automobile horsepower, auto weight, and miles per gallon, as listed in Table 4, are cross section data. As such, they could be used to estimate the regression equation also listed in Table 4 relating miles per gallon to the weight/horsepower ratio during a particular period (in this case, 1982).

Timeseries data pertain to measurements at a number of points. Our data on carpet sales in Table 1 are time series data. Sales figures are only one of a number of time series that are used by most companies. Moving Averages

For some time series there is no simple mathematical function that neatly portrays long-run movement. For example, consider demand for an item as listed in Table 5.

Clearly, this time series does not exhibit a simple linear trend. Instead, demand bounced up and down until period 15 after which it took a large jump up and then continued jumping up and down but always at the new, higher level. In situations like this, forecasting is sometimes carried out using moving averages to "smooth" the time series. "Smooth" is simply a way of saying we are generating a smooth curve which shows the long-term movements in the series.

Consider Table 5 which includes the observed values and smoothed values made by using a moving average (Table 5 is produced with DSM). The smoothed value is a five-period moving average. For example, the smoothed number in period 8 is 50.2. It is calculated as follows: 50.2 = 51 + 60 + 43 + 57 + 40 / 5 where the average is taken over five periods centered on the eighth period. Of course, a five-period moving average is not the only kind. We could, for example, compute a seven-period moving average. The important concept to note is that if the time series data contains fluctuations that tend to recur, that effect can be eliminated by using averages in which the number of periods included equals the period of the fluctuation.

The forecast is made by using the latest moving average as the forecast for the next period. Thus, after ten data points had been collected, our forecast of demand in future periods would be 50.2. Econometric Time Series Models

In recent years, forecasters have been moving from simple forms of extrapolation in time series (such as moving averages) to the construction of systems of equations which show the effects of numerous independent variables on the value they wish to forecast.

The Wharton Model of the University of Pennsylvania is the archetypical example of an econometric time series model. The model in Table 6 is a miniature version of such a model (this table is produced with TK! Solver). The model describes an entire economy (such as the U.S.) with just five equations, each describing a different aspect of the economy.

The individual equations in TAble 6 may be estimated using regression analysis or may be "identities" like the G = 45 equation. The entire system of equations may be used to forecast GNP (labeled as Y in the model) given known values of Investment (I) and Government Spending (G).

The solution to the model (which could also be called the forecast) is shown in the TK! Solver output in Table 6. Those values in the "output" column are the values given by the program once the user inputs values for G and I. Given the G and I values, those are the only values that solve the model.

Only TK! Solver of the packages we have seen is capable of working systems of equations in this "automatic" manner, but many spreadsheet programs such as Symphony can be used with models. Consider the following multiple equation model: Sales = 6.0 + 6.4 (Advertising) + 1.01 (Sales.sub.-1.) Costs = Fixed Costs + Variable Costs Variable Costs = .75 (Sales) Fixed Costs = 3000 Earnings Before Taxes = Sales--Costs Taxes = .48 (EBT) Profit = EBT -- Taxes

A model of this type is quite easy to enter into a spreadsheet program for the purpose of developing a range of forecasts for profit given various estimates of sales. The impact of factors other than sales on profits could aslo be examined by changing the appropriate information and calculating profits again. The real value of the spreadsheet approach, of course, comes from the ability to work with rather large, complex models.