Overview of Elementary Concepts in Statistics. In this introduction, we will briefly discuss those elementary statistical concepts that provide the necessary foundations for more specialized expertise in any area of statistical data analysis. The selected topics illustrate the basic assumptions of most statistical methods and/or have been demonstrated in research to be necessary components of one's general understanding of the "quantitative nature" of reality (Nisbett, et al., 1987). Because of space limitations, we will focus mostly on the functional aspects of the concepts discussed and the presentation will be very short. Further information on each of those concepts can be found in statistical textbooks. Recommended introductory textbooks are: Kachigan (1986), and Runyon and Haber (1976); for a more advanced discussion of elementary theory and assumptions of statistics, see the classic books by Hays (1988), and Kendall and Stuart (1979).

• What are variables?

• Correlational vs. experimental research

• Dependent vs. independent variables

• Measurement scales

• Relations between variables

• Why relations between variables are important

• Two basic features of every relation between variables

• What is "statistical significance" (p-value)

• How to determine that a result is "really" significant

• Statistical significance and the number of analyses performed

• Strength vs. reliability of a relation between variables

• Why stronger relations between variables are more significant

• Why significance of a relation between variables depends on the size of the sample

• Example: "Baby boys to baby girls ratio"

• Why small relations can be proven significant only in large samples

• Can "no relation" be a significant result?

• How to measure the magnitude (strength) of relations between variables

• Common "general format" of most statistical tests

• How the "level of statistical significance" is calculated

• Why the "Normal distribution" is important

• Illustration of how the normal distribution is used in statistical reasoning (induction)

• Are all test statistics normally distributed?

• How do we know the consequences of violating the normality assumption?

What are variables. Variables are things that we measure, control, or manipulate in research. They differ in many respects, most notably in the role they are given in our research and in the type of measures that can be applied to them.

To index

Correlational vs. experimental research. Most empirical research belongs clearly to one of those two general categories. In correlational research we do not (or at least try not to) influence any variables but only measure them and look for relations (correlations) between some set of variables, such as blood pressure and cholesterol level. In experimental research, we manipulate some variables and then measure the effects of this manipulation on other variables; for example, a researcher might artificially increase blood pressure and then record cholesterol level. Data analysis in experimental research also comes down to calculating "correlations" between variables, specifically, those manipulated and those affected by the manipulation. However, experimental data may potentially provide qualitatively better information: Only experimental data can conclusively demonstrate causal relations between variables. For example, if we found that whenever we change variable A then variable B changes, then we can conclude that "A influences B." Data from correlational research can only be "interpreted" in causal terms based on some theories that we have, but correlational data cannot conclusively prove causality.

To index

Dependent vs. independent variables. Independent variables are those that are manipulated whereas dependent variables are only measured or registered. This distinction appears terminologically confusing to many because, as some students say, "all variables depend on something." However, once you get used to this distinction, it becomes indispensable. The terms dependent and independent variable apply mostly to experimental research where some variables are manipulated, and in this sense they are "independent" from the initial reaction patterns, features, intentions, etc. of the subjects. Some other variables are expected to be "dependent" on the manipulation or experimental conditions. That is to say, they depend on "what the subject will do" in response. Somewhat contrary to the nature of this distinction, these terms are also used in studies where we do not literally manipulate independent variables, but only assign subjects to "experimental groups" based on some pre-existing properties of the subjects. For example, if in an experiment, males are compared with females regarding their white cell count (WCC), Gender could be called the independent variable and WCC the dependent variable.

To index

Measurement scales. Variables differ in "how well" they can be measured, i.e., in how much measurable information their measurement scale can provide. There is obviously some measurement error involved in every measurement, which determines the "amount of information" that we can obtain. Another factor that determines the amount of information that can be provided by a variable is its "type of measurement scale." Specifically variables are classified as (a) nominal, (b) ordinal, (c) interval or (d) ratio.

a. Nominal variables allow for only qualitative classification. That is, they can be measured only in terms of whether the individual items belong to some distinctively different categories, but we cannot quantify or even rank order those categories. For example, all we can say is that 2 individuals are different in terms of variable A (e.g., they are of different race), but we cannot say which one "has more" of the quality represented by the variable. Typical examples of nominal variables are gender, race, color, city, etc.

b. Ordinal variables allow us to rank order the items we measure in terms of which has less and which has more of the quality represented by the variable, but still they do not allow us to say "how much more." A typical example of an ordinal variable is the socioeconomic status of families. For example, we know that upper-middle is higher than middle but we cannot say that it is, for example, 18% higher. Also this very distinction between nominal, ordinal, and interval scales itself represents a good example of an ordinal variable. For example, we can say that nominal measurement provides less information than ordinal measurement, but we cannot say "how much less" or how this difference compares to the difference between ordinal and interval scales.

c. Interval variables allow us not only to rank order the items that are measured, but also to quantify and compare the sizes of differences between them. For example, temperature, as measured in degrees Fahrenheit or Celsius, constitutes an interval scale. We can say that a temperature of 40 degrees is higher than a temperature of 30 degrees, and that an increase from 20 to 40 degrees is twice as much as an increase from 30 to 40 degrees.

d. Ratio variables are very similar to interval variables; in addition to all the properties of interval variables, they feature an identifiable absolute zero point, thus they allow for statements such as x is two times more than y. Typical examples of ratio scales are measures of time or space. For example, as the Kelvin temperature scale is a ratio scale, not only can we say that a temperature of 200 degrees is higher than one of 100 degrees, we can correctly state that it is twice as high. Interval scales do not have the ratio property. Most statistical data analysis procedures do not distinguish between the interval and ratio properties of the measurement scales.

To index

Relations between variables. Regardless of their type, two or more variables are related if in a sample of observations, the values of those variables are distributed in a consistent manner. In other words, variables are related if their values systematically correspond to each other for these observations. For example, Gender and WCC would be considered to be related if most males had high WCC and most females low WCC, or vice versa; Height is related to Weight because typically tall individuals are heavier than short ones; IQ is related to the Number of Errors in a test, if people with higher IQ's make fewer errors.

To index

Why relations between variables are important. Generally speaking, the ultimate goal of every research or scientific analysis is finding relations between variables. The philosophy of science teaches us that there is no other way of representing "meaning" except in terms of relations between some quantities or qualities; either way involves relations between variables. Thus, the advancement of science must always involve finding new relations between variables. Correlational research involves measuring such relations in the most straightforward manner. However, experimental research is not any different in this respect. For example, the above mentioned experiment comparing WCC in males and females can be described as looking for a correlation between two variables: Gender and WCC. Statistics does nothing else but help us evaluate relations between variables. Actually, all of the hundreds of procedures that are described in this manual can be interpreted in terms of evaluating various kinds of inter-variable relations.

To index

Two basic features of every relation between variables. The two most elementary formal properties of every relation between variables are the relation's (a) magnitude (or "size") and (b) its reliability (or "truthfulness").

a. Magnitude (or "size"). The magnitude is much easier to understand and measure than reliability. For example, if every male in our sample was found to have a higher WCC than any female in the sample, we could say that the magnitude of the relation between the two variables (Gender and WCC) is very high in our sample. In other words, we could predict one based on the other (at least among the members of our sample).

b. Reliability (or "truthfulness"). The reliability of a relation is a much less intuitive concept, but still extremely important. It pertains to the "representativeness" of the result found in our specific sample for the entire population. In other words, it says how probable it is that a similar relation would be found if the experiment was replicated with other samples drawn from the same population. Remember that we are almost never "ultimately" interested only in what is going on in our sample; we are interested in the sample only to the extent it can provide information about the population. If our study meets some specific criteria (to be mentioned later), then the reliability of a relation between variables observed in our sample can be quantitatively estimated and represented using a standard measure (technically called p-value or statistical significance level, see the next paragraph).

To index

What is "statistical significance" (p-value). The statistical significance of a result is the probability that the observed relationship (e.g., between variables) or a difference (e.g., between means) in a sample occurred by pure chance ("luck of the draw"), and that in the population from which the sample was drawn, no such relationship or differences exist. Using less technical terms, one could say that the statistical significance of a result tells us something about the degree to which the result is "true" (in the sense of being "representative of the population"). More technically, the value of the p-value represents a decreasing index of the reliability of a result (see Brownlee, 1960). The higher the p-value, the less we can believe that the observed relation between variables in the sample is a reliable indicator of the relation between the respective variables in the population. Specifically, the p-value represents the probability of error that is involved in accepting our observed result as valid, that is, as "representative of the population." For example, a p-value of .05 (i.e.,1/20) indicates that there is a 5% probability that the relation between the variables found in our sample is a "fluke." In other words, assuming that in the population there was no relation between those variables whatsoever, and we were repeating experiments like ours one after another, we could expect that approximately in every 20 replications of the experiment there would be one in which the relation between the variables in question would be equal or stronger than in ours. (Note that this is not the same as saying that, given that there IS a relationship between the variables, we can expect to replicate the results 5% of the time or 95% of the time; when there is a relationship between the variables in the population, the probability of replicating the study and finding that relationship is related to the statistical power of the design. See also, Power Analysis). In many areas of research, the p-value of .05 is customarily treated as a "border-line acceptable" error level.

To index

How to determine that a result is "really" significant. There is no way to avoid arbitrariness in the final decision as to what level of significance will be treated as really "significant." That is, the selection of some level of significance, up to which the results will be rejected as invalid, is arbitrary. In practice, the final decision usually depends on whether the outcome was predicted a priori or only found post hoc in the course of many analyses and comparisons performed on the data set, on the total amount of consistent supportive evidence in the entire data set, and on "traditions" existing in the particular area of research. Typically, in many sciences, results that yield p .05 are considered borderline statistically significant but remember that this level of significance still involves a pretty high probability of error (5%). Results that are significant at the p .01 level are commonly considered statistically significant, and p .005 or p .001 levels are often called "highly" significant. But remember that those classifications represent nothing else but arbitrary conventions that are only informally based on general research experience.

To index

Statistical significance and the number of analyses performed. Needless to say, the more analyses you perform on a data set, the more results will meet "by chance" the conventional significance level. For example, if you calculate correlations between ten variables (i.e., 45 different correlation coefficients), then you should expect to find by chance that about two (i.e., one in every 20) correlation coefficients are significant at the p .05 level, even if the values of the variables were totally random and those variables do not correlate in the population. Some statistical methods that involve many comparisons, and thus a good chance for such errors, include some "correction" or adjustment for the total number of comparisons. However, many statistical methods (especially simple exploratory data analyses) do not offer any straightforward remedies to this problem. Therefore, it is up to the researcher to carefully evaluate the reliability of unexpected findings. Many examples in this manual offer specific advice on how to do this; relevant information can also be found in most research methods textbooks.

To index

Strength vs. reliability of a relation between variables. We said before that strength and reliability are two different features of relationships between variables. However, they are not totally independent. In general, in a sample of a particular size, the larger the magnitude of the relation between variables, the more reliable the relation (see the next paragraph).

To index

Why stronger relations between variables are more significant. Assuming that there is no relation between the respective variables in the population, the most likely outcome would be also finding no relation between those variables in the research sample. Thus, the stronger the relation found in the sample, the less likely it is that there is no corresponding relation in the population. As you see, the magnitude and significance of a relation appear to be closely related, and we could calculate the significance from the magnitude and vice-versa; however, this is true only if the sample size is kept constant, because the relation of a given strength could be either highly significant or not significant at all, depending on the sample size (see the next paragraph).

To index

<>

Why significance of a relation between variables depends on the size of the sample. If there are very few observations, then there are also respectively few possible combinations of the values of the variables, and thus the probability of obtaining by chance a combination of those values indicative of a strong relation is relatively high. Consider the following illustration. If we are interested in two variables (Gender: male/female and WCC: high/low) and there are only four subjects in our sample (two males and two females), then the probability that we will find, purely by chance, a 100% relation between the two variables can be as high as one-eighth. Specifically, there is a one-in-eight chance that both males will have a high WCC and both females a low WCC, or vice versa. Now consider the probability of obtaining such a perfect match by chance if our sample consisted of 100 subjects; the probability of obtaining such an outcome by chance would be practically zero. Let's look at a more general example. Imagine a theoretical population in which the average value of WCC in males and females is exactly the same. Needless to say, if we start replicating a simple experiment by drawing pairs of samples (of males and females) of a particular size from this population and calculating the difference between the average WCC in each pair of samples, most of the experiments will yield results close to 0. However, from time to time, a pair of samples will be drawn where the difference between males and females will be quite different from 0. How often will it happen? The smaller the sample size in each experiment, the more likely it is that we will obtain such erroneous results, which in this case would be results indicative of the existence of a relation between gender and WCC obtained from a population in which such a relation does not exist.

To index

Example. "Baby boys to baby girls ratio." Consider the following example from research on statistical reasoning (Nisbett, et al., 1987). There are two hospitals: in the first one, 120 babies are born every day, in the other, only 12. On average, the ratio of baby boys to baby girls born every day in each hospital is 50/50. However, one day, in one of those hospitals twice as many baby girls were born as baby boys. In which hospital was it more likely to happen? The answer is obvious for a statistician, but as research shows, not so obvious for a lay person: It is much more likely to happen in the small hospital. The reason for this is that technically speaking, the probability of a random deviation of a particular size (from the population mean), decreases with the increase in the sample size.

To index

Why small relations can be proven significant only in large samples. The examples in the previous paragraphs indicate that if a relationship between variables in question is "objectively" (i.e., in the population) small, then there is no way to identify such a relation in a study unless the research sample is correspondingly large. Even if our sample is in fact "perfectly representative" the effect will not be statistically significant if the sample is small. Analogously, if a relation in question is "objectively" very large (i.e., in the population), then it can be found to be highly significant even in a study based on a very small sample. Consider the following additional illustration. If a coin is slightly asymmetrical, and when tossed is somewhat more likely to produce heads than tails (e.g., 60% vs. 40%), then ten tosses would not be sufficient to convince anyone that the coin is asymmetrical, even if the outcome obtained (six heads and four tails) was perfectly representative of the bias of the coin. However, is it so that 10 tosses is not enough to prove anything? No, if the effect in question were large enough, then ten tosses could be quite enough. For instance, imagine now that the coin is so asymmetrical that no matter how you toss it, the outcome will be heads. If you tossed such a coin ten times and each toss produced heads, most people would consider it sufficient evidence that something is "wrong" with the coin. In other words, it would be considered convincing evidence that in the theoretical population of an infinite number of tosses of this coin there would be more heads than tails. Thus, if a relation is large, then it can be found to be significant even in a small sample.

To index

Can "no relation" be a significant result? The smaller the relation between variables, the larger the sample size that is necessary to prove it significant. For example, imagine how many tosses would be necessary to prove that a coin is asymmetrical if its bias were only .000001%! Thus, the necessary minimum sample size increases as the magnitude of the effect to be demonstrated decreases. When the magnitude of the effect approaches 0, the necessary sample size to conclusively prove it approaches infinity. That is to say, if there is almost no relation between two variables, then the sample size must be almost equal to the population size, which is assumed to be infinitely large. Statistical significance represents the probability that a similar outcome would be obtained if we tested the entire population. Thus, everything that would be found after testing the entire population would be, by definition, significant at the highest possible level, and this also includes all "no relation" results.

To index

How to measure the magnitude (strength) of relations between variables. There are very many measures of the magnitude of relationships between variables which have been developed by statisticians; the choice of a specific measure in given circumstances depends on the number of variables involved, measurement scales used, nature of the relations, etc. Almost all of them, however, follow one general principle: they attempt to somehow evaluate the observed relation by comparing it to the "maximum imaginable relation" between those specific variables. Technically speaking, a common way to perform such evaluations is to look at how differentiated are the values of the variables, and then calculate what part of this "overall available differentiation" is accounted for by instances when that differentiation is "common" in the two (or more) variables in question. Speaking less technically, we compare "what is common in those variables" to "what potentially could have been common if the variables were perfectly related." Let us consider a simple illustration. Let us say that in our sample, the average index of WCC is 100 in males and 102 in females. Thus, we could say that on average, the deviation of each individual score from the grand mean (101) contains a component due to the gender of the subject; the size of this component is 1. That value, in a sense, represents some measure of relation between Gender and WCC. However, this value is a very poor measure, because it does not tell us how relatively large this component is, given the "overall differentiation" of WCC scores. Consider two extreme possibilities:

a. If all WCC scores of males were equal exactly to 100, and those of females equal to 102, then all deviations from the grand mean in our sample would be entirely accounted for by gender. We would say that in our sample, gender is perfectly correlated with WCC, that is, 100% of the observed differences between subjects regarding their WCC is accounted for by their gender.

b. If WCC scores were in the range of 0-1000, the same difference (of 2) between the average WCC of males and females found in the study would account for such a small part of the overall differentiation of scores that most likely it would be considered negligible. For example, one more subject taken into account could change, or even reverse the direction of the difference. Therefore, every good measure of relations between variables must take into account the overall differentiation of individual scores in the sample and evaluate the relation in terms of (relatively) how much of this differentiation is accounted for by the relation in question.

To index

Common "general format" of most statistical tests. Because the ultimate goal of most statistical tests is to evaluate relations between variables, most statistical tests follow the general format that was explained in the previous paragraph. Technically speaking, they represent a ratio of some measure of the differentiation common in the variables in question to the overall differentiation of those variables. For example, they represent a ratio of the part of the overall differentiation of the WCC scores that can be accounted for by gender to the overall differentiation of the WCC scores. This ratio is usually called a ratio of explained variation to total variation. In statistics, the term explained variation does not necessarily imply that we "conceptually understand" it. It is used only to denote the common variation in the variables in question, that is, the part of variation in one variable that is "explained" by the specific values of the other variable, and vice versa.

To index

How the "level of statistical significance" is calculated. Let us assume that we have already calculated a measure of a relation between two variables (as explained above). The next question is "how significant is this relation?" For example, is 40% of the explained variance between the two variables enough to consider the relation significant? The answer is "it depends." Specifically, the significance depends mostly on the sample size. As explained before, in very large samples, even very small relations between variables will be significant, whereas in very small samples even very large relations cannot be considered reliable (significant). Thus, in order to determine the level of statistical significance, we need a function that represents the relationship between "magnitude" and "significance" of relations between two variables, depending on the sample size. The function we need would tell us exactly "how likely it is to obtain a relation of a given magnitude (or larger) from a sample of a given size, assuming that there is no such relation between those variables in the population." In other words, that function would give us the significance (p) level, and it would tell us the probability of error involved in rejecting the idea that the relation in question does not exist in the population. This "alternative" hypothesis (that there is no relation in the population) is usually called the null hypothesis. It would be ideal if the probability function was linear, and for example, only had different slopes for different sample sizes. Unfortunately, the function is more complex, and is not always exactly the same; however, in most cases we know its shape and can use it to determine the significance levels for our findings in samples of a particular size. Most of those functions are related to a general type of function which is called normal.

To index

Why the "Normal distribution" is important. The "Normal distribution" is important because in most cases, it well approximates the function that was introduced in the previous paragraph (for a detailed illustration, see Are all test statistics normally distributed?). The distribution of many test statistics is normal or follows some form that can be derived from the normal distribution. In this sense, philosophically speaking, the Normal distribution represents one of the empirically verified elementary "truths about the general nature of reality," and its status can be compared to the one of fundamental laws of natural sciences. The exact shape of the normal distribution (the characteristic "bell curve") is defined by a function which has only two parameters: mean and standard deviation.

A characteristic property of the Normal distribution is that 68% of all of its observations fall within a range of ±1 standard deviation from the mean, and a range of ±2 standard deviations includes 95% of the scores. In other words, in a Normal distribution, observations that have a standardized value of less than -2 or more than +2 have a relative frequency of 5% or less. (Standardized value means that a value is expressed in terms of its difference from the mean, divided by the standard deviation.) If you have access to STATISTICA, you can explore the exact values of probability associated with different values in the normal distribution using the interactive Probability Calculator tool; for example, if you enter the Z value (i.e., standardized value) of 4, the associated probability computed by STATISTICA will be less than .0001, because in the normal distribution almost all observations (i.e., more than 99.99%) fall within the range of ±4 standard deviations. The animation below shows the tail area associated with other Z values.

To index

Illustration of how the normal distribution is used in statistical reasoning (induction). Recall the example discussed above, where pairs of samples of males and females were drawn from a population in which the average value of WCC in males and females was exactly the same. Although the most likely outcome of such experiments (one pair of samples per experiment) was that the difference between the average WCC in males and females in each pair is close to zero, from time to time, a pair of samples will be drawn where the difference between males and females is quite different from 0. How often does it happen? If the sample size is large enough, the results of such replications are "normally distributed" (this important principle is explained and illustrated in the next paragraph), and thus knowing the shape of the normal curve, we can precisely calculate the probability of obtaining "by chance" outcomes representing various levels of deviation from the hypothetical population mean of 0. If such a calculated probability is so low that it meets the previously accepted criterion of statistical significance, then we have only one choice: conclude that our result gives a better approximation of what is going on in the population than the "null hypothesis" (remember that the null hypothesis was considered only for "technical reasons" as a benchmark against which our empirical result was evaluated). Note that this entire reasoning is based on the assumption that the shape of the distribution of those "replications" (technically, the "sampling distribution") is normal. This assumption is discussed in the next paragraph.

To index

Are all test statistics normally distributed? Not all, but most of them are either based on the normal distribution directly or on distributions that are related to, and can be derived from normal, such as t, F, or Chi-square. Typically, those tests require that the variables analyzed are themselves normally distributed in the population, that is, they meet the so-called "normality assumption." Many observed variables actually are normally distributed, which is another reason why the normal distribution represents a "general feature" of empirical reality. The problem may occur when one tries to use a normal distribution-based test to analyze data from variables that are themselves not normally distributed (see tests of normality in Nonparametrics or ANOVA/MANOVA). In such cases we have two general choices. First, we can use some alternative "nonparametric" test (or so-called "distribution-free test" see, Nonparametrics); but this is often inconvenient because such tests are typically less powerful and less flexible in terms of types of conclusions that they can provide. Alternatively, in many cases we can still use the normal distribution-based test if we only make sure that the size of our samples is large enough. The latter option is based on an extremely important principle which is largely responsible for the popularity of tests that are based on the normal function. Namely, as the sample size increases, the shape of the sampling distribution (i.e., distribution of a statistic from the sample; this term was first used by Fisher, 1928a) approaches normal shape, even if the distribution of the variable in question is not normal. This principle is illustrated in the following animation showing a series of sampling distributions (created with gradually increasing sample sizes of: 2, 5, 10, 15, and 30) using a variable that is clearly non-normal in the population, that is, the distribution of its values is clearly skewed.

However, as the sample size (of samples used to create the sampling distribution of the mean) increases, the shape of the sampling distribution becomes normal. Note that for n=30, the shape of that distribution is "almost" perfectly normal (see the close match of the fit). This principle is called the central limit theorem (this term was first used by Pólya, 1920; German, "Zentraler Grenzwertsatz").

To index

How do we know the consequences of violating the normality assumption? Although many of the statements made in the preceding paragraphs can be proven mathematically, some of them do not have theoretical proofs and can be demonstrated only empirically, via so-called Monte-Carlo experiments. In these experiments, large numbers of samples are generated by a computer following predesigned specifications and the results from such samples are analyzed using a variety of tests. This way we can empirically evaluate the type and magnitude of errors or biases to which we are exposed when certain theoretical assumptions of the tests we are using are not met by our data. Specifically, Monte-Carlo studies were used extensively with normal distribution-based tests to determine how sensitive they are to violations of the assumption of normal distribution of the analyzed variables in the population. The general conclusion from these studies is that the consequences of such violations are less severe than previously thought. Although these conclusions should not entirely discourage anyone from being concerned about the normality assumption, they have increased the overall popularity of the distribution-dependent statistical tests in all areas of research.

To index

© Copyright StatSoft, Inc., 1984-2003

Basic Statistics

• Descriptive statistics

o "True" Mean and Confidence Interval

o Shape of the Distribution, Normality

• Correlations

o Purpose (What is Correlation?)

o Simple Linear Correlation (Pearson r)

o How to Interpret the Values of Correlations

o Significance of Correlations

o Outliers

o Quantitative Approach to Outliers

o Correlations in Non-homogeneous Groups

o Nonlinear Relations between Variables

o Measuring Nonlinear Relations

o Exploratory Examination of Correlation Matrices

o Casewise vs. Pairwise Deletion of Missing Data

o How to Identify Biases Caused by the Bias due to Pairwise Deletion of Missing Data

o Pairwise Deletion of Missing Data vs. Mean Substitution

o Spurious Correlations

o Are correlation coefficients "additive?"

o How to Determine Whether Two Correlation Coefficients are Significant

• t-test for independent samples

o Purpose, Assumptions

o Arrangement of Data

o t-test graphs

o More Complex Group Comparisons

• t-test for dependent samples

o Within-group Variation

o Purpose

o Assumptions

o Arrangement of Data

o Matrices of t-tests

o More Complex Group Comparisons

• Breakdown: Descriptive statistics by groups

o Purpose

o Arrangement of Data

o Statistical Tests in Breakdowns

o Other Related Data Analysis Techniques

o Post-Hoc Comparisons of Means

o Breakdowns vs. Discriminant Function Analysis

o Breakdowns vs. Frequency Tables

o Graphical breakdowns

• Frequency tables

o Purpose

o Applications

• Crosstabulation and stub-and-banner tables

o Purpose and Arrangement of Table

o 2x2 Table

o Marginal Frequencies

o Column, Row, and Total Percentages

o Graphical Representations of Crosstabulations

o Stub-and-Banner Tables

o Interpreting the Banner Table

o Multi-way Tables with Control Variables

o Graphical Representations of Multi-way Tables

o Statistics in crosstabulation tables

o Multiple responses/dichotomies

Descriptive Statistics

"True" Mean and Confidence Interval. Probably the most often used descriptive statistic is the mean. The mean is a particularly informative measure of the "central tendency" of the variable if it is reported along with its confidence intervals. As mentioned earlier, usually we are interested in statistics (such as the mean) from our sample only to the extent to which they can infer information about the population. The confidence intervals for the mean give us a range of values around the mean where we expect the "true" (population) mean is located (with a given level of certainty, see also Elementary Concepts). For example, if the mean in your sample is 23, and the lower and upper limits of the p=.05 confidence interval are 19 and 27 respectively, then you can conclude that there is a 95% probability that the population mean is greater than 19 and lower than 27. If you set the p-level to a smaller value, then the interval would become wider thereby increasing the "certainty" of the estimate, and vice versa; as we all know from the weather forecast, the more "vague" the prediction (i.e., wider the confidence interval), the more likely it will materialize. Note that the width of the confidence interval depends on the sample size and on the variation of data values. The larger the sample size, the more reliable its mean. The larger the variation, the less reliable the mean (see also Elementary Concepts). The calculation of confidence intervals is based on the assumption that the variable is normally distributed in the population. The estimate may not be valid if this assumption is not met, unless the sample size is large, say n=100 or more.

Shape of the Distribution, Normality. An important aspect of the "description" of a variable is the shape of its distribution, which tells you the frequency of values from different ranges of the variable. Typically, a researcher is interested in how well the distribution can be approximated by the normal distribution (see the animation below for an example of this distribution) (see also Elementary Concepts). Simple descriptive statistics can provide some information relevant to this issue. For example, if the skewness (which measures the deviation of the distribution from symmetry) is clearly different from 0, then that distribution is asymmetrical, while normal distributions are perfectly symmetrical. If the kurtosis (which measures "peakedness" of the distribution) is clearly different from 0, then the distribution is either flatter or more peaked than normal; the kurtosis of the normal distribution is 0.

More precise information can be obtained by performing one of the tests of normality to determine the probability that the sample came from a normally distributed population of observations (e.g., the so-called Kolmogorov-Smirnov test, or the Shapiro-Wilks' W test. However, none of these tests can entirely substitute for a visual examination of the data using a histogram (i.e., a graph that shows the frequency distribution of a variable).

The graph allows you to evaluate the normality of the empirical distribution because it also shows the normal curve superimposed over the histogram. It also allows you to examine various aspects of the distribution qualitatively. For example, the distribution could be bimodal (have 2 peaks). This might suggest that the sample is not homogeneous but possibly its elements came from two different populations, each more or less normally distributed. In such cases, in order to understand the nature of the variable in question, you should look for a way to quantitatively identify the two sub-samples.

To index

Correlations

Purpose (What is Correlation?) Correlation is a measure of the relation between two or more variables. The measurement scales used should be at least interval scales, but other correlation coefficients are available to handle other types of data. Correlation coefficients can range from -1.00 to +1.00. The value of -1.00 represents a perfect negative correlation while a value of +1.00 represents a perfect positive correlation. A value of 0.00 represents a lack of correlation.

The most widely-used type of correlation coefficient is Pearson r, also called linear or product- moment correlation.

Simple Linear Correlation (Pearson r). Pearson correlation (hereafter called correlation), assumes that the two variables are measured on at least interval scales (see Elementary Concepts), and it determines the extent to which values of the two variables are "proportional" to each other. The value of correlation (i.e., correlation coefficient) does not depend on the specific measurement units used; for example, the correlation between height and weight will be identical regardless of whether inches and pounds, or centimeters and kilograms are used as measurement units. Proportional means linearly related; that is, the correlation is high if it can be "summarized" by a straight line (sloped upwards or downwards).

This line is called the regression line or least squares line, because it is determined such that the sum of the squared distances of all the data points from the line is the lowest possible. Note that the concept of squared distances will have important functional consequences on how the value of the correlation coefficient reacts to various specific arrangements of data (as we will later see).

How to Interpret the Values of Correlations. As mentioned before, the correlation coefficient (r) represents the linear relationship between two variables. If the correlation coefficient is squared, then the resulting value (r2, the coefficient of determination) will represent the proportion of common variation in the two variables (i.e., the "strength" or "magnitude" of the relationship). In order to evaluate the correlation between variables, it is important to know this "magnitude" or "strength" as well as the significance of the correlation.

Significance of Correlations. The significance level calculated for each correlation is a primary source of information about the reliability of the correlation. As explained before (see Elementary Concepts), the significance of a correlation coefficient of a particular magnitude will change depending on the size of the sample from which it was computed. The test of significance is based on the assumption that the distribution of the residual values (i.e., the deviations from the regression line) for the dependent variable y follows the normal distribution, and that the variability of the residual values is the same for all values of the independent variable x. However, Monte Carlo studies suggest that meeting those assumptions closely is not absolutely crucial if your sample size is not very small and when the departure from normality is not very large. It is impossible to formulate precise recommendations based on those Monte- Carlo results, but many researchers follow a rule of thumb that if your sample size is 50 or more then serious biases are unlikely, and if your sample size is over 100 then you should not be concerned at all with the normality assumptions. There are, however, much more common and serious threats to the validity of information that a correlation coefficient can provide; they are briefly discussed in the following paragraphs.

Outliers. Outliers are atypical (by definition), infrequent observations. Because of the way in which the regression line is determined (especially the fact that it is based on minimizing not the sum of simple distances but the sum of squares of distances of data points from the line), outliers have a profound influence on the slope of the regression line and consequently on the value of the correlation coefficient. A single outlier is capable of considerably changing the slope of the regression line and, consequently, the value of the correlation, as demonstrated in the following example. Note, that as shown on that illustration, just one outlier can be entirely responsible for a high value of the correlation that otherwise (without the outlier) would be close to zero. Needless to say, one should never base important conclusions on the value of the correlation coefficient alone (i.e., examining the respective scatterplot is always recommended).

Note that if the sample size is relatively small, then including or excluding specific data points that are not as clearly "outliers" as the one shown in the previous example may have a profound influence on the regression line (and the correlation coefficient). This is illustrated in the following example where we call the points being excluded "outliers;" one may argue, however, that they are not outliers but rather extreme values.

Typically, we believe that outliers represent a random error that we would like to be able to control. Unfortunately, there is no widely accepted method to remove outliers automatically (however, see the next paragraph), thus what we are left with is to identify any outliers by examining a scatterplot of each important correlation. Needless to say, outliers may not only artificially increase the value of a correlation coefficient, but they can also decrease the value of a "legitimate" correlation.

See also Confidence Ellipse.

Quantitative Approach to Outliers. Some researchers use quantitative methods to exclude outliers. For example, they exclude observations that are outside the range of ±2 standard deviations (or even ±1.5 sd's) around the group or design cell mean. In some areas of research, such "cleaning" of the data is absolutely necessary. For example, in cognitive psychology research on reaction times, even if almost all scores in an experiment are in the range of 300-700 milliseconds, just a few "distracted reactions" of 10-15 seconds will completely change the overall picture. Unfortunately, defining an outlier is subjective (as it should be), and the decisions concerning how to identify them must be made on an individual basis (taking into account specific experimental paradigms and/or "accepted practice" and general research experience in the respective area). It should also be noted that in some rare cases, the relative frequency of outliers across a number of groups or cells of a design can be subjected to analysis and provide interpretable results. For example, outliers could be indicative of the occurrence of a phenomenon that is qualitatively different than the typical pattern observed or expected in the sample, thus the relative frequency of outliers could provide evidence of a relative frequency of departure from the process or phenomenon that is typical for the majority of cases in a group. See also Confidence Ellipse.

Correlations in Non-homogeneous Groups. A lack of homogeneity in the sample from which a correlation was calculated can be another factor that biases the value of the correlation. Imagine a case where a correlation coefficient is calculated from data points which came from two different experimental groups but this fact is ignored when the correlation is calculated. Let us assume that the experimental manipulation in one of the groups increased the values of both correlated variables and thus the data from each group form a distinctive "cloud" in the scatterplot (as shown in the graph below).

In such cases, a high correlation may result that is entirely due to the arrangement of the two groups, but which does not represent the "true" relation between the two variables, which may practically be equal to 0 (as could be seen if we looked at each group separately, see the following graph).

If you suspect the influence of such a phenomenon on your correlations and know how to identify such "subsets" of data, try to run the correlations separately in each subset of observations. If you do not know how to identify the hypothetical subsets, try to examine the data with some exploratory multivariate techniques (e.g., Cluster Analysis).

Nonlinear Relations between Variables. Another potential source of problems with the linear (Pearson r) correlation is the shape of the relation. As mentioned before, Pearson r measures a relation between two variables only to the extent to which it is linear; deviations from linearity will increase the total sum of squared distances from the regression line even if they represent a "true" and very close relationship between two variables. The possibility of such non-linear relationships is another reason why examining scatterplots is a necessary step in evaluating every correlation. For example, the following graph demonstrates an extremely strong correlation between the two variables which is not well described by the linear function.

Measuring Nonlinear Relations. What do you do if a correlation is strong but clearly nonlinear (as concluded from examining scatterplots)? Unfortunately, there is no simple answer to this question, because there is no easy-to-use equivalent of Pearson r that is capable of handling nonlinear relations. If the curve is monotonous (continuously decreasing or increasing) you could try to transform one or both of the variables to remove the curvilinearity and then recalculate the correlation. For example, a typical transformation used in such cases is the logarithmic function which will "squeeze" together the values at one end of the range. Another option available if the relation is monotonous is to try a nonparametric correlation (e.g., Spearman R, see Nonparametrics and Distribution Fitting) which is sensitive only to the ordinal arrangement of values, thus, by definition, it ignores monotonous curvilinearity. However, nonparametric correlations are generally less sensitive and sometimes this method will not produce any gains. Unfortunately, the two most precise methods are not easy to use and require a good deal of "experimentation" with the data. Therefore you could:

A. Try to identify the specific function that best describes the curve. After a function has been found, you can test its "goodness-of-fit" to your data.

B. Alternatively, you could experiment with dividing one of the variables into a number of segments (e.g., 4 or 5) of an equal width, treat this new variable as a grouping variable and run an analysis of variance on the data.

Exploratory Examination of Correlation Matrices. A common first step of many data analyses that involve more than a very few variables is to run a correlation matrix of all variables and then examine it for expected (and unexpected) significant relations. When this is done, you need to be aware of the general nature of statistical significance (see Elementary Concepts); specifically, if you run many tests (in this case, many correlations), then significant results will be found "surprisingly often" due to pure chance. For example, by definition, a coefficient significant at the .05 level will occur by chance once in every 20 coefficients. There is no "automatic" way to weed out the "true" correlations. Thus, you should treat all results that were not predicted or planned with particular caution and look for their consistency with other results; ultimately, though, the most conclusive (although costly) control for such a randomness factor is to replicate the study. This issue is general and it pertains to all analyses that involve "multiple comparisons and statistical significance." This problem is also briefly discussed in the context of post-hoc comparisons of means and the Breakdowns option.

Casewise vs. Pairwise Deletion of Missing Data. The default way of deleting missing data while calculating a correlation matrix is to exclude all cases that have missing data in at least one of the selected variables; that is, by casewise deletion of missing data. Only this way will you get a "true" correlation matrix, where all correlations are obtained from the same set of observations. However, if missing data are randomly distributed across cases, you could easily end up with no "valid" cases in the data set, because each of them will have at least one missing data in some variable. The most common solution used in such instances is to use so-called pairwise deletion of missing data in correlation matrices, where a correlation between each pair of variables is calculated from all cases that have valid data on those two variables. In many instances there is nothing wrong with that method, especially when the total percentage of missing data is low, say 10%, and they are relatively randomly distributed between cases and variables. However, it may sometimes lead to serious problems.

For example, a systematic bias may result from a "hidden" systematic distribution of missing data, causing different correlation coefficients in the same correlation matrix to be based on different subsets of subjects. In addition to the possibly biased conclusions that you could derive from such "pairwise calculated" correlation matrices, real problems may occur when you subject such matrices to another analysis (e.g., multiple regression, factor analysis, or cluster analysis) that expects a "true correlation matrix," with a certain level of consistency and "transitivity" between different coefficients. Thus, if you are using the pairwise method of deleting the missing data, be sure to examine the distribution of missing data across the cells of the matrix for possible systematic "patterns."

How to Identify Biases Caused by the Bias due to Pairwise Deletion of Missing Data. If the pairwise deletion of missing data does not introduce any systematic bias to the correlation matrix, then all those pairwise descriptive statistics for one variable should be very similar. However, if they differ, then there are good reasons to suspect a bias. For example, if the mean (or standard deviation) of the values of variable A that were taken into account in calculating its correlation with variable B is much lower than the mean (or standard deviation) of those values of variable A that were used in calculating its correlation with variable C, then we would have good reason to suspect that those two correlations (A-B and A-C) are based on different subsets of data, and thus, that there is a bias in the correlation matrix caused by a non-random distribution of missing data.

Pairwise Deletion of Missing Data vs. Mean Substitution. Another common method to avoid loosing data due to casewise deletion is the so-called mean substitution of missing data (replacing all missing data in a variable by the mean of that variable). Mean substitution offers some advantages and some disadvantages as compared to pairwise deletion. Its main advantage is that it produces "internally consistent" sets of results ("true" correlation matrices). The main disadvantages are:

A. Mean substitution artificially decreases the variation of scores, and this decrease in individual variables is proportional to the number of missing data (i.e., the more missing data, the more "perfectly average scores" will be artificially added to the data set).

B. Because it substitutes missing data with artificially created "average" data points, mean substitution may considerably change the values of correlations.

Spurious Correlations. Although you cannot prove causal relations based on correlation coefficients (see Elementary Concepts), you can still identify so-called spurious correlations; that is, correlations that are due mostly to the influences of "other" variables. For example, there is a correlation between the total amount of losses in a fire and the number of firemen that were putting out the fire; however, what this correlation does not indicate is that if you call fewer firemen then you would lower the losses. There is a third variable (the initial size of the fire) that influences both the amount of losses and the number of firemen. If you "control" for this variable (e.g., consider only fires of a fixed size), then the correlation will either disappear or perhaps even change its sign. The main problem with spurious correlations is that we typically do not know what the "hidden" agent is. However, in cases when we know where to look, we can use partial correlations that control for (partial out) the influence of specified variables.

Are correlation coefficients "additive?" No, they are not. For example, an average of correlation coefficients in a number of samples does not represent an "average correlation" in all those samples. Because the value of the correlation coefficient is not a linear function of the magnitude of the relation between the variables, correlation coefficients cannot simply be averaged. In cases when you need to average correlations, they first have to be converted into additive measures. For example, before averaging, you can square them to obtain coefficients of determination which are additive (as explained before in this section), or convert them into so-called Fisher z values, which are also additive.

How to Determine Whether Two Correlation Coefficients are Significant. A test is available that will evaluate the significance of differences between two correlation coefficients in two samples. The outcome of this test depends not only on the size of the raw difference between the two coefficients but also on the size of the samples and on the size of the coefficients themselves. Consistent with the previously discussed principle, the larger the sample size, the smaller the effect that can be proven significant in that sample. In general, due to the fact that the reliability of the correlation coefficient increases with its absolute value, relatively small differences between large correlation coefficients can be significant. For example, a difference of .10 between two correlations may not be significant if the two coefficients are .15 and .25, although in the same sample, the same difference of .10 can be highly significant if the two coefficients are .80 and .90.

To index

t-test for independent samples

Purpose, Assumptions. The t-test is the most commonly used method to evaluate the differences in means between two groups. For example, the t-test can be used to test for a difference in test scores between a group of patients who were given a drug and a control group who received a placebo. Theoretically, the t-test can be used even if the sample sizes are very small (e.g., as small as 10; some researchers claim that even smaller n's are possible), as long as the variables are normally distributed within each group and the variation of scores in the two groups is not reliably different (see also Elementary Concepts). As mentioned before, the normality assumption can be evaluated by looking at the distribution of the data (via histograms) or by performing a normality test. The equality of variances assumption can be verified with the F test, or you can use the more robust Levene's test. If these conditions are not met, then you can evaluate the differences in means between two groups using one of the nonparametric alternatives to the t- test (see Nonparametrics and Distribution Fitting).

The p-level reported with a t-test represents the probability of error involved in accepting our research hypothesis about the existence of a difference. Technically speaking, this is the probability of error associated with rejecting the hypothesis of no difference between the two categories of observations (corresponding to the groups) in the population when, in fact, the hypothesis is true. Some researchers suggest that if the difference is in the predicted direction, you can consider only one half (one "tail") of the probability distribution and thus divide the standard p-level reported with a t-test (a "two-tailed" probability) by two. Others, however, suggest that you should always report the standard, two-tailed t-test probability.

See also, Student's t Distribution.

Arrangement of Data. In order to perform the t-test for independent samples, one independent (grouping) variable (e.g., Gender: male/female) and at least one dependent variable (e.g., a test score) are required. The means of the dependent variable will be compared between selected groups based on the specified values (e.g., male and female) of the independent variable. The following data set can be analyzed with a t-test comparing the average WCC score in males and females.

GENDER WCC

case 1

case 2

case 3

case 4

case 5 male

male

male

female

female 111

110

109

102

104

mean WCC in males = 110

mean WCC in females = 103

t-test graphs. In the t-test analysis, comparisons of means and measures of variation in the two groups can be visualized in box and whisker plots (for an example, see the graph below).

These graphs help you to quickly evaluate and "intuitively visualize" the strength of the relation between the grouping and the dependent variable.

More Complex Group Comparisons. It often happens in research practice that you need to compare more than two groups (e.g., drug 1, drug 2, and placebo), or compare groups created by more than one independent variable while controlling for the separate influence of each of them (e.g., Gender, type of Drug, and size of Dose). In these cases, you need to analyze the data using Analysis of Variance, which can be considered to be a generalization of the t-test. In fact, for two group comparisons, ANOVA will give results identical to a t-test (t**2 [df] = F[1,df]). However, when the design is more complex, ANOVA offers numerous advantages that t-tests cannot provide (even if you run a series of t- tests comparing various cells of the design).

To index

t-test for dependent samples

Within-group Variation. As explained in Elementary Concepts, the size of a relation between two variables, such as the one measured by a difference in means between two groups, depends to a large extent on the differentiation of values within the group. Depending on how differentiated the values are in each group, a given "raw difference" in group means will indicate either a stronger or weaker relationship between the independent (grouping) and dependent variable. For example, if the mean WCC (White Cell Count) was 102 in males and 104 in females, then this difference of "only" 2 points would be extremely important if all values for males fell within a range of 101 to 103, and all scores for females fell within a range of 103 to 105; for example, we would be able to predict WCC pretty well based on gender. However, if the same difference of 2 was obtained from very differentiated scores (e.g., if their range was 0-200), then we would consider the difference entirely negligible. That is to say, reduction of the within-group variation increases the sensitivity of our test.

Purpose. The t-test for dependent samples helps us to take advantage of one specific type of design in which an important source of within-group variation (or so-called, error) can be easily identified and excluded from the analysis. Specifically, if two groups of observations (that are to be compared) are based on the same sample of subjects who were tested twice (e.g., before and after a treatment), then a considerable part of the within-group variation in both groups of scores can be attributed to the initial individual differences between subjects. Note that, in a sense, this fact is not much different than in cases when the two groups are entirely independent (see t-test for independent samples), where individual differences also contribute to the error variance; but in the case of independent samples, we cannot do anything about it because we cannot identify (or "subtract") the variation due to individual differences in subjects. However, if the same sample was tested twice, then we can easily identify (or "subtract") this variation. Specifically, instead of treating each group separately, and analyzing raw scores, we can look only at the differences between the two measures (e.g., "pre-test" and "post test") in each subject. By subtracting the first score from the second for each subject and then analyzing only those "pure (paired) differences," we will exclude the entire part of the variation in our data set that results from unequal base levels of individual subjects. This is precisely what is being done in the t-test for dependent samples, and, as compared to the t-test for independent samples, it always produces "better" results (i.e., it is always more sensitive).

Assumptions. The theoretical assumptions of the t-test for independent samples also apply to the dependent samples test; that is, the paired differences should be normally distributed. If these assumptions are clearly not met, then one of the nonparametric alternative tests should be used.

See also, Student's t Distribution.

Arrangement of Data. Technically, we can apply the t-test for dependent samples to any two variables in our data set. However, applying this test will make very little sense if the values of the two variables in the data set are not logically and methodologically comparable. For example, if you compare the average WCC in a sample of patients before and after a treatment, but using a different counting method or different units in the second measurement, then a highly significant t-test value could be obtained due to an artifact; that is, to the change of units of measurement. Following, is an example of a data set that can be analyzed using the t-test for dependent samples.

WCC

before WCC

after

case 1

case 2

case 3

case 4

case 5

... 111.9

109

143

101

80

... 113

110

144

102

80.9

...

average change between WCC

"before" and "after" = 1

The average difference between the two conditions is relatively small (d=1) as compared to the differentiation (range) of the raw scores (from 80 to 143, in the first sample). However, the t-test for dependent samples analysis is performed only on the paired differences , "ignoring" the raw scores and their potential differentiation. Thus, the size of this particular difference of 1 will be compared not to the differentiation of raw scores but to the differentiation of the individual difference scores, which is relatively small: 0.2 (from 0.9 to 1.1). Compared to that variability, the difference of 1 is extremely large and can yield a highly significant t value.

Matrices of t-tests. t-tests for dependent samples can be calculated for long lists of variables, and reviewed in the form of matrices produced with casewise or pairwise deletion of missing data, much like the correlation matrices. Thus, the precautions discussed in the context of correlations also apply to t-test matrices; see:

a. the issue of artifacts caused by the pairwise deletion of missing data in t-tests and

b. the issue of "randomly" significant test values.

More Complex Group Comparisons. If there are more than two "correlated samples" (e.g., before treatment, after treatment 1, and after treatment 2), then analysis of variance with repeated measures should be used. The repeated measures ANOVA can be considered a generalization of the t-test for dependent samples and it offers various features that increase the overall sensitivity of the analysis. For example, it can simultaneously control not only for the base level of the dependent variable, but it can control for other factors and/or include in the design more than one interrelated dependent variable (MANOVA; for additional details refer to ANOVA/MANOVA).

To index

Breakdown: Descriptive Statistics by Groups

Purpose. The breakdowns analysis calculates descriptive statistics and correlations for dependent variables in each of a number of groups defined by one or more grouping (independent) variables.

Arrangement of Data. In the following example data set (spreadsheet), the dependent variable WCC (White Cell Count) can be broken down by 2 independent variables: Gender (values: males and females), and Height (values: tall and short).

GENDER HEIGHT WCC

case 1

case 2

case 3

case 4

case 5

... male

male

male

female

female

... short

tall

tall

tall

short

... 101

110

92

112

95

...

The resulting breakdowns might look as follows (we are assuming that Gender was specified as the first independent variable, and Height as the second).

Entire sample

Mean=100

SD=13

N=120

Males

Mean=99

SD=13

N=60 Females

Mean=101

SD=13

N=60

Tall/males

Mean=98

SD=13

N=30 Short/males

Mean=100

SD=13

N=30 Tall/females

Mean=101

SD=13

N=30 Short/females

Mean=101

SD=13

N=30

The composition of the "intermediate" level cells of the "breakdown tree" depends on the order in which independent variables are arranged. For example, in the above example, you see the means for "all males" and "all females" but you do not see the means for "all tall subjects" and "all short subjects" which would have been produced had you specified independent variable Height as the first grouping variable rather than the second.

Statistical Tests in Breakdowns. Breakdowns are typically used as an exploratory data analysis technique; the typical question that this technique can help answer is very simple: Are the groups created by the independent variables different regarding the dependent variable? If you are interested in differences concerning the means, then the appropriate test is the breakdowns one-way ANOVA (F test). If you are interested in variation differences, then you should test for homogeneity of variances.

Other Related Data Analysis Techniques. Although for exploratory data analysis, breakdowns can use more than one independent variable, the statistical procedures in breakdowns assume the existence of a single grouping factor (even if, in fact, the breakdown results from a combination of a number of grouping variables). Thus, those statistics do not reveal or even take into account any possible interactions between grouping variables in the design. For example, there could be differences between the influence of one independent variable on the dependent variable at different levels of another independent variable (e.g., tall people could have lower WCC than short ones, but only if they are males; see the "tree" data above). You can explore such effects by examining breakdowns "visually," using different orders of independent variables, but the magnitude or significance of such effects cannot be estimated by the breakdown statistics.

Post-Hoc Comparisons of Means. Usually, after obtaining a statistically significant F test from the ANOVA, one wants to know which of the means contributed to the effect (i.e., which groups are particularly different from each other). One could of course perform a series of simple t-tests to compare all possible pairs of means. However, such a procedure would capitalize on chance. This means that the reported probability levels would actually overestimate the statistical significance of mean differences. Without going into too much detail, suppose you took 20 samples of 10 random numbers each, and computed 20 means. Then, take the group (sample) with the highest mean and compare it with that of the lowest mean. The t-test for independent samples will test whether or not those two means are significantly different from each other, provided they were the only two samples taken. Post-hoc comparison techniques on the other hand specifically take into account the fact that more than two samples were taken.

Breakdowns vs. Discriminant Function Analysis. Breakdowns can be considered as a first step toward another type of analysis that explores differences between groups: Discriminant function analysis. Similar to breakdowns, discriminant function analysis explores the differences between groups created by values (group codes) of an independent (grouping) variable. However, unlike breakdowns, discriminant function analysis simultaneously analyzes more than one dependent variable and it identifies "patterns" of values of those dependent variables. Technically, it determines a linear combination of the dependent variables that best predicts the group membership. For example, discriminant function analysis can be used to analyze differences between three groups of persons who have chosen different professions (e.g., lawyers, physicians, and engineers) in terms of various aspects of their scholastic performance in high school. One could claim that such analysis could "explain" the choice of a profession in terms of specific talents shown in high school; thus discriminant function analysis can be considered to be an "exploratory extension" of simple breakdowns.

Breakdowns vs. Frequency Tables. Another related type of analysis that cannot be directly performed with breakdowns is comparisons of frequencies of cases (n's) between groups. Specifically, often the n's in individual cells are not equal because the assignment of subjects to those groups typically results not from an experimenter's manipulation, but from subjects' pre-existing dispositions. If, in spite of the random selection of the entire sample, the n's are unequal, then it may suggest that the independent variables are related. For example, crosstabulating levels of independent variables Age and Education most likely would not create groups of equal n, because education is distributed differently in different age groups. If you are interested in such comparisons, you can explore specific frequencies in the breakdowns tables, trying different orders of independent variables. However, in order to subject such differences to statistical tests, you should use crosstabulations and frequency tables, Log-Linear Analysis, or Correspondence Analysis (for more advanced analyses on multi-way frequency tables).

Graphical breakdowns. Graphs can often identify effects (both expected and unexpected) in the data more quickly and sometimes "better" than any other data analysis method. Categorized graphs allow you to plot the means, distributions, correlations, etc. across the groups of a given table (e.g., categorized histograms, categorized probability plots, categorized box and whisker plots). The graph below shows a categorized histogram which enables you to quickly evaluate and visualize the shape of the data for each group (group1-female, group2-female, etc.).

The categorized scatterplot (in the graph below) shows the differences between patterns of correlations between dependent variables across the groups.

Additionally, if the software has a brushing facility which supports animated brushing, you can select (i.e., highlight) in a matrix scatterplot all data points that belong to a certain category in order to examine how those specific observations contribute to relations between other variables in the same data set.

To index

Frequency tables

Purpose. Frequency or one-way tables represent the simplest method for analyzing categorical (nominal) data (refer to Elementary Concepts). They are often used as one of the exploratory procedures to review how different categories of values are distributed in the sample. For example, in a survey of spectator interest in different sports, we could summarize the respondents' interest in watching football in a frequency table as follows:

STATISTICA

BASIC

STATS FOOTBALL: "Watching football"

Category Count Cumulatv

Count Percent Cumulatv

Percent

ALWAYS : Always interested

USUALLY : Usually interested

SOMETIMS: Sometimes interested

NEVER : Never interested

Missing 39

16

26

19

0 39

55

81

100

100 39.00000

16.00000

26.00000

19.00000

0.00000 39.0000

55.0000

81.0000

100.0000

100.0000

The table above shows the number, proportion, and cumulative proportion of respondents who characterized their interest in watching football as either (1) Always interested, (2) Usually interested, (3) Sometimes interested, or (4) Never interested.

Applications. In practically every research project, a first "look" at the data usually includes frequency tables. For example, in survey research, frequency tables can show the number of males and females who participated in the survey, the number of respondents from particular ethnic and racial backgrounds, and so on. Responses on some labeled attitude measurement scales (e.g., interest in watching football) can also be nicely summarized via the frequency table. In medical research, one may tabulate the number of patients displaying specific symptoms; in industrial research one may tabulate the frequency of different causes leading to catastrophic failure of products during stress tests (e.g., which parts are actually responsible for the complete malfunction of television sets under extreme temperatures?). Customarily, if a data set includes any categorical data, then one of the first steps in the data analysis is to compute a frequency table for those categorical variables.

To index

Crosstabulation and stub-and-banner tables

Purpose and Arrangement of Table. Crosstabulation is a combination of two (or more) frequency tables arranged such that each cell in the resulting table represents a unique combination of specific values of crosstabulated variables. Thus, crosstabulation allows us to examine frequencies of observations that belong to specific categories on more than one variable. By examining these frequencies, we can identify relations between crosstabulated variables. Only categorical (nominal) variables or variables with a relatively small number of different meaningful values should be crosstabulated. Note that in the cases where we do want to include a continuous variable in a crosstabulation (e.g., income), we can first recode it into a particular number of distinct ranges (e.g., low, medium, high).

2x2 Table. The simplest form of crosstabulation is the 2 by 2 table where two variables are "crossed," and each variable has only two distinct values. For example, suppose we conduct a simple study in which males and females are asked to choose one of two different brands of soda pop (brand A and brand B); the data file can be arranged like this:

GENDER SODA

case 1

case 2

case 3

case 4

case 5

... MALE

FEMALE

FEMALE

FEMALE

MALE

... A

B

B

A

B

...

The resulting crosstabulation could look as follows.

SODA: A SODA: B

GENDER: MALE 20 (40%) 30 (60%) 50 (50%)

GENDER: FEMALE 30 (60%) 20 (40%) 50 (50%)

50 (50%) 50 (50%) 100 (100%)

Each cell represents a unique combination of values of the two crosstabulated variables (row variable Gender and column variable Soda), and the numbers in each cell tell us how many observations fall into each combination of values. In general, this table shows us that more females than males chose the soda pop brand A, and that more males than females chose soda B. Thus, gender and preference for a particular brand of soda may be related (later we will see how this relationship can be measured).

Marginal Frequencies. The values in the margins of the table are simply one-way (frequency) tables for all values in the table. They are important in that they help us to evaluate the arrangement of frequencies in individual columns or rows. For example, the frequencies of 40% and 60% of males and females (respectively) who chose soda A (see the first column of the above table), would not indicate any relationship between Gender and Soda if the marginal frequencies for Gender were also 40% and 60%; in that case they would simply reflect the different proportions of males and females in the study. Thus, the differences between the distributions of frequencies in individual rows (or columns) and in the respective margins informs us about the relationship between the crosstabulated variables.

Column, Row, and Total Percentages. The example in the previous paragraph demonstrates that in order to evaluate relationships between crosstabulated variables, we need to compare the proportions of marginal and individual column or row frequencies. Such comparisons are easiest to perform when the frequencies are presented as percentages.

Graphical Representations of Crosstabulations. For analytic purposes, the individual rows or columns of a table can be represented as column graphs. However, often it is useful to visualize the entire table in a single graph. A two-way table can be visualized in a 3-dimensional histogram; alternatively, a categorized histogram can be produced, where one variable is represented by individual histograms which are drawn at each level (category) of the other variable in the crosstabulation. The advantage of the 3D histogram is that it produces an integrated picture of the entire table; the advantage of the categorized graph is that it allows us to precisely evaluate specific frequencies in each cell of the table.

Stub-and-Banner Tables. Stub-and-Banner tables, or Banners for short, are a way to display several two-way tables in a compressed form. This type of table is most easily explained with an example. Let us return to the survey of sports spectators example. (Note that, in order simplify matters, only the response categories Always and Usually were tabulated in the table below.)

STATISTICA

BASIC

STATS Stub-and-Banner Table:

Row Percent

Factor FOOTBALL

ALWAYS FOOTBALL

USUALLY Row

Total

BASEBALL: ALWAYS

BASEBALL: USUALLY 92.31

61.54 7.69

38.46 66.67

33.33

BASEBALL: Total 82.05 17.95 100.00

TENNIS: ALWAYS

TENNIS: USUALLY 87.50

87.50 12.50

12.50 66.67

33.33

TENNIS: Total 87.50 12.50 100.00

BOXING: ALWAYS

BOXING: USUALLY 77.78

100.00 22.22

0.00 52.94

47.06

BOXING : Total 88.24 11.76 100.00

Interpreting the Banner Table. In the table above, we see the two-way tables of expressed interest in Football by expressed interest in Baseball, Tennis, and Boxing. The table entries represent percentages of rows, so that the percentages across columns will add up to 100 percent. For example, the number in the upper left hand corner of the Scrollsheet (92.31) shows that 92.31 percent of all respondents who said they are always interested in watching football also said that they were always interested in watching baseball. Further down we can see that the percent of those always interested in watching football who were also always interested in watching tennis was 87.50 percent; for boxing this number is 77.78 percent. The percentages in the last column (Row Total) are always relative to the total number of cases.

Multi-way Tables with Control Variables. When only two variables are crosstabulated, we call the resulting table a two-way table. However, the general idea of crosstabulating values of variables can be generalized to more than just two variables. For example, to return to the "soda" example presented earlier (see above), a third variable could be added to the data set. This variable might contain information about the state in which the study was conducted (either Nebraska or New York).

GENDER SODA STATE

case 1

case 2

case 3

case 4

case 5

... MALE

FEMALE

FEMALE

FEMALE

MALE

... A

B

B

A

B

... NEBRASKA

NEW YORK

NEBRASKA

NEBRASKA

NEW YORK

...

The crosstabulation of these variables would result in a 3-way table:

STATE: NEW YORK STATE: NEBRASKA

SODA: A SODA: B SODA: A SODA: B

G:MALE 20 30 50 5 45 50

G:FEMALE 30 20 50 45 5 50

50 50 100 50 50 100

Theoretically, an unlimited number of variables can be crosstabulated in a single multi-way table. However, research practice shows that it is usually difficult to examine and "understand" tables that involve more than 4 variables. It is recommended to analyze relationships between the factors in such tables using modeling techniques such as Log-Linear Analysis or Correspondence Analysis.

Graphical Representations of Multi-way Tables. You can produce "double categorized" histograms, 3D histograms,

or line-plots that will summarize the frequencies for up to 3 factors in a single graph.

Batches (cascades) of graphs can be used to summarize higher-way tables (as shown in the graph below).

Statistics in Crosstabulation Tables

• General Introduction

• Pearson Chi-square

• Maximum-Likelihood (M-L) Chi-square

• Yates' correction

• Fisher exact test

• McNemar Chi-square

• Coefficient Phi

• Tetrachoric correlation

• Coefficient of contingency (C)

• Interpretation of contingency measures

• Statistics Based on Ranks

• Spearman R

• Kendall tau

• Sommer's d: d(X|Y), d(Y|X)

• Gamma

• Uncertainty Coefficients: S(X,Y), S(X|Y), S(Y|X)

General Introduction. Crosstabulations generally allow us to identify relationships between the crosstabulated variables. The following table illustrates an example of a very strong relationship between two variables: variable Age (Adult vs. Child) and variable Cookie preference (A vs. B).

COOKIE: A COOKIE: B

AGE: ADULT 50 0 50

AGE: CHILD 0 50 50

50 50 100

All adults chose cookie A, while all children chose cookie B. In this case there is little doubt about the reliability of the finding, because it is hardly conceivable that one would obtain such a pattern of frequencies by chance alone; that is, without the existence of a "true" difference between the cookie preferences of adults and children. However, in real-life, relations between variables are typically much weaker, and thus the question arises as to how to measure those relationships, and how to evaluate their reliability (statistical significance). The following review includes the most common measures of relationships between two categorical variables; that is, measures for two-way tables. The techniques used to analyze simultaneous relations between more than two variables in higher order crosstabulations are discussed in the context of the Log-Linear Analysis module and the Correspondence Analysis.

Pearson Chi-square. The Pearson Chi-square is the most common test for significance of the relationship between categorical variables. This measure is based on the fact that we can compute the expected frequencies in a two-way table (i.e., frequencies that we would expect if there was no relationship between the variables). For example, suppose we ask 20 males and 20 females to choose between two brands of soda pop (brands A and B). If there is no relationship between preference and gender, then we would expect about an equal number of choices of brand A and brand B for each sex. The Chi-square test becomes increasingly significant as the numbers deviate further from this expected pattern; that is, the more this pattern of choices for males and females differs.

The value of the Chi-square and its significance level depends on the overall number of observations and the number of cells in the table. Consistent with the principles discussed in Elementary Concepts, relatively small deviations of the relative frequencies across cells from the expected pattern will prove significant if the number of observations is large.

The only assumption underlying the use of the Chi-square (other than random selection of the sample) is that the expected frequencies are not very small. The reason for this is that, actually, the Chi-square inherently tests the underlying probabilities in each cell; and when the expected cell frequencies fall, for example, below 5, those probabilities cannot be estimated with sufficient precision. For further discussion of this issue refer to Everitt (1977), Hays (1988), or Kendall and Stuart (1979).

Maximum-Likelihood Chi-square. The Maximum-Likelihood Chi-square tests the same hypothesis as the Pearson Chi- square statistic; however, its computation is based on Maximum-Likelihood theory. In practice, the M-L Chi-square is usually very close in magnitude to the Pearson Chi- square statistic. For more details about this statistic refer to Bishop, Fienberg, and Holland (1975), or Fienberg, S. E. (1977); the Log-Linear Analysis chapter of the manual also discusses this statistic in greater detail.

Yates Correction. The approximation of the Chi-square statistic in small 2 x 2 tables can be improved by reducing the absolute value of differences between expected and observed frequencies by 0.5 before squaring (Yates' correction). This correction, which makes the estimation more conservative, is usually applied when the table contains only small observed frequencies, so that some expected frequencies become less than 10 (for further discussion of this correction, see Conover, 1974; Everitt, 1977; Hays, 1988; Kendall & Stuart, 1979; and Mantel, 1974).

Fisher Exact Test. This test is only available for 2x2 tables; it is based on the following rationale: Given the marginal frequencies in the table, and assuming that in the population the two factors in the table are not related, how likely is it to obtain cell frequencies as uneven or worse than the ones that were observed? For small n, this probability can be computed exactly by counting all possible tables that can be constructed based on the marginal frequencies. Thus, the Fisher exact test computes the exact probability under the null hypothesis of obtaining the current distribution of frequencies across cells, or one that is more uneven.

McNemar Chi-square. This test is applicable in situations where the frequencies in the 2 x 2 table represent dependent samples. For example, in a before-after design study, we may count the number of students who fail a test of minimal math skills at the beginning of the semester and at the end of the semester. Two Chi-square values are reported: A/D and B/C. The Chi-square A/D tests the hypothesis that the frequencies in cells A and D (upper left, lower right) are identical. The Chi-square B/C tests the hypothesis that the frequencies in cells B and C (upper right, lower left) are identical.

Coefficient Phi. The Phi-square is a measure of correlation between two categorical variables in a 2 x 2 table. Its value can range from 0 (no relation between factors; Chi-square=0.0) to 1 (perfect relation between the two factors in the table). For more details concerning this statistic see Castellan and Siegel (1988, p. 232).

Tetrachoric Correlation. This statistic is also only computed for (applicable to) 2 x 2 tables. If the 2 x 2 table can be thought of as the result of two continuous variables that were (artificially) forced into two categories each, then the tetrachoric correlation coefficient will estimate the correlation between the two.

Coefficient of Contingency. The coefficient of contingency is a Chi-square based measure of the relation between two categorical variables (proposed by Pearson, the originator of the Chi-square test). Its advantage over the ordinary Chi-square is that it is more easily interpreted, since its range is always limited to 0 through 1 (where 0 means complete independence). The disadvantage of this statistic is that its specific upper limit is "limited" by the size of the table; C can reach the limit of 1 only if the number of categories is unlimited (see Siegel, 1956, p. 201).

Interpretation of Contingency Measures. An important disadvantage of measures of contingency (reviewed above) is that they do not lend themselves to clear interpretations in terms of probability or "proportion of variance," as is the case, for example, of the Pearson r (see Correlations). There is no commonly accepted measure of relation between categories that has such a clear interpretation.

Statistics Based on Ranks. In many cases the categories used in the crosstabulation contain meaningful rank-ordering information; that is, they measure some characteristic on an <>ordinal scale (see Elementary Concepts). Suppose we asked a sample of respondents to indicate their interest in watching different sports on a 4-point scale with the explicit labels (1) always, (2) usually, (3) sometimes, and (4) never interested. Obviously, we can assume that the response sometimes interested is indicative of less interest than always interested, and so on. Thus, we could rank the respondents with regard to their expressed interest in, for example, watching football. When categorical variables can be interpreted in this manner, there are several additional indices that can be computed to express the relationship between variables.

Spearman R. Spearman R can be thought of as the regular Pearson product-moment correlation coefficient (Pearson r); that is, in terms of the proportion of variability accounted for, except that Spearman R is computed from ranks. As mentioned above, Spearman R assumes that the variables under consideration were measured on at least an ordinal (rank order) scale; that is, the individual observations (cases) can be ranked into two ordered series. Detailed discussions of the Spearman R statistic, its power and efficiency can be found in Gibbons (1985), Hays (1981), McNemar (1969), Siegel (1956), Siegel and Castellan (1988), Kendall (1948), Olds (1949), or Hotelling and Pabst (1936).

Kendall tau. Kendall tau is equivalent to the Spearman R statistic with regard to the underlying assumptions. It is also comparable in terms of its statistical power. However, Spearman R and Kendall tau are usually not identical in magnitude because their underlying logic, as well as their computational formulas are very different. Siegel and Castellan (1988) express the relationship of the two measures in terms of the inequality:

-1 < = 3 * Kendall tau - 2 * Spearman R < = 1

More importantly, Kendall tau and Spearman R imply different interpretations: While Spearman R can be thought of as the regular Pearson product-moment correlation coefficient as computed from ranks, Kendall tau rather represents a probability. Specifically, it is the difference between the probability that the observed data are in the same order for the two variables versus the probability that the observed data are in different orders for the two variables. Kendall (1948, 1975), Everitt (1977), and Siegel and Castellan (1988) discuss Kendall tau in greater detail. Two different variants of tau are computed, usually called taub and tauc. These measures differ only with regard as to how tied ranks are handled. In most cases these values will be fairly similar, and when discrepancies occur, it is probably always safest to interpret the lowest value.

Sommer's d: d(X|Y), d(Y|X). Sommer's d is an asymmetric measure of association related to tb (see Siegel & Castellan, 1988, p. 303-310).

Gamma. The Gamma statistic is preferable to Spearman R or Kendall tau when the data contain many tied observations. In terms of the underlying assumptions, Gamma is equivalent to Spearman R or Kendall tau; in terms of its interpretation and computation, it is more similar to Kendall tau than Spearman R. In short, Gamma is also a probability; specifically, it is computed as the difference between the probability that the rank ordering of the two variables agree minus the probability that they disagree, divided by 1 minus the probability of ties. Thus, Gamma is basically equivalent to Kendall tau, except that ties are explicitly taken into account. Detailed discussions of the Gamma statistic can be found in Goodman and Kruskal (1954, 1959, 1963, 1972), Siegel (1956), and Siegel and Castellan (1988).

Uncertainty Coefficients. These are indices of stochastic dependence; the concept of stochastic dependence is derived from the information theory approach to the analysis of frequency tables and the user should refer to the appropriate references (see Kullback, 1959; Ku & Kullback, 1968; Ku, Varner, & Kullback, 1971; see also Bishop, Fienberg, & Holland, 1975, p. 344-348). S(Y,X) refers to symmetrical dependence, S(X|Y) and S(Y|X) refer to asymmetrical dependence.

Multiple Responses/Dichotomies. Multiple response variables or multiple dichotomies often arise when summarizing survey data. The nature of such variables or factors in a table is best illustrated with examples.

• Multiple Response Variables

• Multiple Dichotomies

• Crosstabulation of Multiple Responses/Dichotomies

• Paired Crosstabulation of Multiple Response Variables

• A Final Comment

Multiple Response Variables. As part of a larger market survey, suppose you asked a sample of consumers to name their three favorite soft drinks. The specific item on the questionnaire may look like this:

Write down your three favorite soft drinks:

1:__________ 2:__________ 3:__________

Thus, the questionnaires returned to you will contain somewhere between 0 and 3 answers to this item. Also, a wide variety of soft drinks will most likely be named. Your goal is to summarize the responses to this item; that is, to produce a table that summarizes the percent of respondents who mentioned a respective soft drink.

The next question is how to enter the responses into a data file. Suppose 50 different soft drinks were mentioned among all of the questionnaires. You could of course set up 50 variables - one for each soft drink - and then enter a 1 for the respective respondent and variable (soft drink), if he or she mentioned the respective soft drink (and a 0 if not); for example:

COKE PEPSI SPRITE . . . .

case 1

case 2

case 3

... 0

1

0

... 1

1

0

... 0

0

1

...

This method of coding the responses would be very tedious and "wasteful." Note that each respondent can only give a maximum of three responses; yet we use 50 variables to code those responses. (However, if we are only interested in these three soft drinks, then this method of coding just those three variables would be satisfactory; to tabulate soft drink preferences, we could then treat the three variables as a multiple dichotomy; see below.)

Coding multiple response variables. Alternatively, we could set up three variables, and a coding scheme for the 50 soft drinks. Then we could enter the respective codes (or alpha labels) into the three variables, in the same way that respondents wrote them down in the questionnaire.

Resp. 1 Resp. 2 Resp. 3

case 1

case 2

case 3

. . . COKE

SPRITE

PERRIER

. . . PEPSI

SNAPPLE

GATORADE

. . . JOLT

DR. PEPPER

MOUNTAIN DEW

. . .

To produce a table of the number of respondents by soft drink we would now treat Resp.1 to Resp3 as a multiple response variable. That table could look like this:

N=500

Category Count Prcnt. of

Responses Prcnt. of

Cases

COKE: Coca Cola

PEPSI: Pepsi Cola

MOUNTAIN: Mountain Dew

PEPPER: Doctor Pepper

. . . : . . . . 44

43

81

74

.. 5.23

5.11

9.62

8.79

... 8.80

8.60

16.20

14.80

...

842 100.00 168.40

Interpreting the multiple response frequency table. The total number of respondents was n=500. Note that the counts in the first column of the table do not add up to 500, but rather to 842. That is the total number of responses; since each respondent could make up to 3 responses (write down three names of soft drinks), the total number of responses is naturally greater than the number of respondents. For example, referring back to the sample listing of the data file shown above, the first case (Coke, Pepsi, Jolt) "contributes" three times to the frequency table, once to the category Coke, once to the category Pepsi, and once to the category Jolt. The second and third columns in the table above report the percentages relative to the number of responses (second column) as well as respondents (third column). Thus, the entry 8.80 in the first row and last column in the table above means that 8.8% of all respondents mentioned Coke either as their first, second, or third soft drink preference.

Multiple Dichotomies. Suppose in the above example we were only interested in Coke, Pepsi, and Sprite. As pointed out earlier, one way to code the data in that case would be as follows:

COKE PEPSI SPRITE . . . .

case 1

case 2

case 3

. . .

1

. . . 1

1

. . .

1

. . .

In other words, one variable was created for each soft drink, then a value of 1 was entered into the respective variable whenever the respective drink was mentioned by the respective respondent. Note that each variable represents a dichotomy; that is, only "1"s and "not 1"s are allowed (we could have entered 1's and 0's, but to save typing we can also simply leave the 0's blank or missing). When tabulating these variables, we would like to obtain a summary table very similar to the one shown earlier for multiple response variables; that is, we would like to compute the number and percent of respondents (and responses) for each soft drink. In a sense, we "compact" the three variables Coke, Pepsi, and Sprite into a single variable (Soft Drink) consisting of multiple dichotomies.

Crosstabulation of Multiple Responses/Dichotomies. All of these types of variables can then be used in crosstabulation tables. For example, we could crosstabulate a multiple dichotomy for Soft Drink (coded as described in the previous paragraph) with a multiple response variable Favorite Fast Foods (with many categories such as Hamburgers, Pizza, etc.), by the simple categorical variable Gender. As in the frequency table, the percentages and marginal totals in that table can be computed from the total number of respondents as well as the total number of responses. For example, consider the following hypothetical respondent:

Gender Coke Pepsi Sprite Food1 Food2

FEMALE 1 1 FISH PIZZA

This female respondent mentioned Coke and Pepsi as her favorite drinks, and Fish and Pizza as her favorite fast foods. In the complete crosstabulation table she will be counted in the following cells of the table:

Food . . .

TOTAL No.

of RESP.

Gender Drink HAMBURG. FISH PIZZA . . .

FEMALE

MALE

COKE

PEPSI

SPRITE

COKE

PEPSI

SPRITE

X

X

X

X

2

2

This female respondent will "contribute" to (i.e., be counted in) the crosstabulation table a total of 4 times. In addition, she will be counted twice in the Female--Coke marginal frequency column if that column is requested to represent the total number of responses; if the marginal totals are computed as the total number of respondents, then this respondent will only be counted once.

Paired Crosstabulation of Multiple Response Variables. A unique option for tabulating multiple response variables is to treat the variables in two or more multiple response variables as matched pairs. Again, this method is best illustrated with a simple example. Suppose we conducted a survey of past and present home ownership. We asked the respondents to describe their last three (including the present) homes that they purchased. Naturally, for some respondents the present home is the first and only home; others have owned more than one home in the past. For each home we asked our respondents to write down the number of rooms in the respective house, and the number of occupants. Here is how the data for one respondent (say case number 112) may be entered into a data file:

Case no. Rooms 1 2 3 No. Occ. 1 2 3

112 3 3 4 2 3 5

This respondent owned three homes; the first had 3 rooms, the second also had 3 rooms, and the third had 4 rooms. The family apparently also grew; there were 2 occupants in the first home, 3 in the second, and 5 in the third.

Now suppose we wanted to crosstabulate the number of rooms by the number of occupants for all respondents. One way to do so is to prepare three different two-way tables; one for each home. We can also treat the two factors in this study (Number of Rooms, Number of Occupants) as multiple response variables. However, it would obviously not make any sense to count the example respondent 112 shown above in cell 3 Rooms - 5 Occupants of the crosstabulation table (which we would, if we simply treated the two factors as ordinary multiple response variables). In other words, we want to ignore the combination of occupants in the third home with the number of rooms in the first home. Rather, we would like to count these variables in pairs; we would like to consider the number of rooms in the first home together with the number of occupants in the first home, the number of rooms in the second home with the number of occupants in the second home, and so on. This is exactly what will be accomplished if we asked for a paired crosstabulation of these multiple response variables.

A Final Comment. When preparing complex crosstabulation tables with multiple responses/dichotomies, it is sometimes difficult (in our experience) to "keep track" of exactly how the cases in the file are counted. The best way to verify that one understands the way in which the respective tables are constructed is to crosstabulate some simple example data, and then to trace how each case is counted. The example section of the Crosstabulation chapter in the manual employs this method to illustrate how data are counted for tables involving multiple response variables and multiple dichotomies.

To index

© Copyright StatSoft, Inc., 1984-2003

ANOVA/MANOVA

• Basic Ideas

o The Partitioning of Sums of Squares

o Multi-Factor ANOVA

o Interaction Effects

• Complex Designs

o Between-Groups and Repeated Measures

o Incomplete (Nested) Designs

• Analysis of Covariance (ANCOVA)

o Fixed Covariates

o Changing Covariates

• Multivariate Designs: MANOVA/MANCOVA

o Between-Groups Designs

o Repeated Measures Designs

o Sum Scores versus MANOVA

• Contrast Analysis and Post hoc Tests

o Why Compare Individual Sets of Means?

o Contrast Analysis

o Post hoc Comparisons

• Assumptions and Effects of Violating Assumptions

o Deviation from Normal Distribution

o Homogeneity of Variances

o Homogeneity of Variances and Covariances

o Sphericity and Compound Symmetry

• Methods for Analysis of Variance

This chapter includes a general introduction to ANOVA and a discussion of the general topics in the analysis of variance techniques, including repeated measures designs, ANCOVA, MANOVA, unbalanced and incomplete designs, contrast effects, post-hoc comparisons, assumptions, etc. For related topics, see also Variance Components (topics related to estimation of variance components in mixed model designs), Experimental Design/DOE (topics related to specialized applications of ANOVA in industrial settings), and Repeatability and Reproducibility Analysis (topics related to specialized designs for evaluating the reliability and precision of measurement systems).

See also General Linear Models, General Regression Models; to analyze nonlinear models, see Generalized Linear Models.

Basic Ideas

The Purpose of Analysis of Variance

In general, the purpose of analysis of variance (ANOVA) is to test for significant differences between means. Elementary Concepts provides a brief introduction into the basics of statistical significance testing. If we are only comparing two means, then ANOVA will give the same results as the t test for independent samples (if we are comparing two different groups of cases or observations), or the t test for dependent samples (if we are comparing two variables in one set of cases or observations). If you are not familiar with those tests you may at this point want to "brush up" on your knowledge about those tests by reading Basic Statistics and Tables.

Why the name analysis of variance? It may seem odd to you that a procedure that compares means is called analysis of variance. However, this name is derived from the fact that in order to test for statistical significance between means, we are actually comparing (i.e., analyzing) variances.

• The Partitioning of Sums of Squares

• Multi-Factor ANOVA

• Interaction Effects

For more introductory topics, see the topic name.

• Complex Designs

• Analysis of Covariance (ANCOVA)

• Multivariate Designs: MANOVA/MANCOVA

• Contrast Analysis and Post hoc Tests

• Assumptions and Effects of Violating Assumptions

See also Methods for Analysis of Variance, Variance Components and Mixed Model ANOVA/ANCOVA, and Experimental Design (DOE).

The Partioning of Sums of Squares

At the heart of ANOVA is the fact that variances can be divided up, that is, partitioned. Remember that the variance is computed as the sum of squared deviations from the overall mean, divided by n-1 (sample size minus one). Thus, given a certain n, the variance is a function of the sums of (deviation) squares, or SS for short. Partitioning of variance works as follows. Consider the following data set:

Group 1 Group 2

Observation 1

Observation 2

Observation 3 2

3

1 6

7

5

Mean

Sums of Squares (SS) 2

2 6

2

Overall Mean

Total Sums of Squares 4

28

The means for the two groups are quite different (2 and 6, respectively). The sums of squares within each group are equal to 2. Adding them together, we get 4. If we now repeat these computations, ignoring group membership, that is, if we compute the total SS based on the overall mean, we get the number 28. In other words, computing the variance (sums of squares) based on the within-group variability yields a much smaller estimate of variance than computing it based on the total variability (the overall mean). The reason for this in the above example is of course that there is a large difference between means, and it is this difference that accounts for the difference in the SS. In fact, if we were to perform an ANOVA on the above data, we would get the following result:

MAIN EFFECT

SS df MS F p

Effect

Error 24.0

4.0 1

4 24.0

1.0 24.0

.008

As you can see, in the above table the total SS (28) was partitioned into the SS due to within-group variability (2+2=4) and variability due to differences between means (28-(2+2)=24).

SS Error and SS Effect. The within-group variability (SS) is usually referred to as Error variance. This term denotes the fact that we cannot readily explain or account for it in the current design. However, the SS Effect we can explain. Namely, it is due to the differences in means between the groups. Put another way, group membership explains this variability because we know that it is due to the differences in means.

Significance testing. The basic idea of statistical significance testing is discussed in Elementary Concepts. Elementary Concepts also explains why very many statistical test represent ratios of explained to unexplained variability. ANOVA is a good example of this. Here, we base this test on a comparison of the variance due to the between- groups variability (called Mean Square Effect, or MSeffect) with the within- group variability (called Mean Square Error, or Mserror; this term was first used by Edgeworth, 1885). Under the null hypothesis (that there are no mean differences between groups in the population), we would still expect some minor random fluctuation in the means for the two groups when taking small samples (as in our example). Therefore, under the null hypothesis, the variance estimated based on within-group variability should be about the same as the variance due to between-groups variability. We can compare those two estimates of variance via the F test (see also F Distribution), which tests whether the ratio of the two variance estimates is significantly greater than 1. In our example above, that test is highly significant, and we would in fact conclude that the means for the two groups are significantly different from each other.

Summary of the basic logic of ANOVA. To summarize the discussion up to this point, the purpose of analysis of variance is to test differences in means (for groups or variables) for statistical significance. This is accomplished by analyzing the variance, that is, by partitioning the total variance into the component that is due to true random error (i.e., within- group SS) and the components that are due to differences between means. These latter variance components are then tested for statistical significance, and, if significant, we reject the null hypothesis of no differences between means, and accept the alternative hypothesis that the means (in the population) are different from each other.

Dependent and independent variables. The variables that are measured (e.g., a test score) are called dependent variables. The variables that are manipulated or controlled (e.g., a teaching method or some other criterion used to divide observations into groups that are compared) are called factors or independent variables. For more information on this important distinction, refer to Elementary Concepts.

Multi-Factor ANOVA

In the simple example above, it may have occurred to you that we could have simply computed a t test for independent samples to arrive at the same conclusion. And, indeed, we would get the identical result if we were to compare the two groups using this test. However, ANOVA is a much more flexible and powerful technique that can be applied to much more complex research issues.

Multiple factors. The world is complex and multivariate in nature, and instances when a single variable completely explains a phenomenon are rare. For example, when trying to explore how to grow a bigger tomato, we would need to consider factors that have to do with the plants' genetic makeup, soil conditions, lighting, temperature, etc. Thus, in a typical experiment, many factors are taken into account. One important reason for using ANOVA methods rather than multiple two-group studies analyzed via t tests is that the former method is more efficient, and with fewer observations we can gain more information. Let us expand on this statement.

Controlling for factors. Suppose that in the above two-group example we introduce another grouping factor, for example, Gender. Imagine that in each group we have 3 males and 3 females. We could summarize this design in a 2 by 2 table:

Experimental

Group 1 Experimental

Group 2

Males

2

3

1 6

7

5

Mean 2 6

Females

4

5

3 8

9

7

Mean 4 8

Before performing any computations, it appears that we can partition the total variance into at least 3 sources: (1) error (within-group) variability, (2) variability due to experimental group membership, and (3) variability due to gender. (Note that there is an additional source -- interaction -- that we will discuss shortly.) What would have happened had we not included gender as a factor in the study but rather computed a simple t test? If you compute the SS ignoring the gender factor (use the within-group means ignoring or collapsing across gender; the result is SS=10+10=20), you will see that the resulting within-group SS is larger than it is when we include gender (use the within- group, within-gender means to compute those SS; they will be equal to 2 in each group, thus the combined SS-within is equal to 2+2+2+2=8). This difference is due to the fact that the means for males are systematically lower than those for females, and this difference in means adds variability if we ignore this factor. Controlling for error variance increases the sensitivity (power) of a test. This example demonstrates another principal of ANOVA that makes it preferable over simple two-group t test studies: In ANOVA we can test each factor while controlling for all others; this is actually the reason why ANOVA is more statistically powerful (i.e., we need fewer observations to find a significant effect) than the simple t test.

Interaction Effects

There is another advantage of ANOVA over simple t-tests: ANOVA allows us to detect interaction effects between variables, and, therefore, to test more complex hypotheses about reality. Let us consider another example to illustrate this point. (The term interaction was first used by Fisher, 1926.)

Main effects, two-way interaction. Imagine that we have a sample of highly achievement-oriented students and another of achievement "avoiders." We now create two random halves in each sample, and give one half of each sample a challenging test, the other an easy test. We measure how hard the students work on the test. The means of this (fictitious) study are as follows:

Achievement-

oriented Achievement-

avoiders

Challenging Test

Easy Test 10

5 5

10

How can we summarize these results? Is it appropriate to conclude that (1) challenging tests make students work harder, (2) achievement-oriented students work harder than achievement- avoiders? None of these statements captures the essence of this clearly systematic pattern of means. The appropriate way to summarize the result would be to say that challenging tests make only achievement-oriented students work harder, while easy tests make only achievement- avoiders work harder. In other words, the type of achievement orientation and test difficulty interact in their effect on effort; specifically, this is an example of a two-way interaction between achievement orientation and test difficulty. Note that statements 1 and 2 above describe so-called main effects.

Higher order interactions. While the previous two-way interaction can be put into words relatively easily, higher order interactions are increasingly difficult to verbalize. Imagine that we had included factor Gender in the achievement study above, and we had obtained the following pattern of means:

Females

Achievement-

oriented Achievement-

avoiders

Challenging Test

Easy Test 10

5 5

10

Males

Achievement-

oriented Achievement-

avoiders

Challenging Test

Easy Test 1

6 6

1

How could we now summarize the results of our study? Graphs of means for all effects greatly facilitate the interpretation of complex effects. The pattern shown in the table above (and in the graph below) represents a three-way interaction between factors.

Thus we may summarize this pattern by saying that for females there is a two-way interaction between achievement-orientation type and test difficulty: Achievement-oriented females work harder on challenging tests than on easy tests, achievement-avoiding females work harder on easy tests than on difficult tests. For males, this interaction is reversed. As you can see, the description of the interaction has become much more involved.

A general way to express interactions. A general way to express all interactions is to say that an effect is modified (qualified) by another effect. Let us try this with the two-way interaction above. The main effect for test difficulty is modified by achievement orientation. For the three-way interaction in the previous paragraph, we may summarize that the two-way interaction between test difficulty and achievement orientation is modified (qualified) by gender. If we have a four-way interaction, we may say that the three-way interaction is modified by the fourth variable, that is, that there are different types of interactions in the different levels of the fourth variable. As it turns out, in many areas of research five- or higher- way interactions are not that uncommon.

To index

Complex Designs

Let us review the basic "building blocks" of complex designs.

• Between-Groups and Repeated Measures

• Incomplete (Nested) Designs

For more introductory topics, click on the topic name.

• Basic Ideas

• Analysis of Covariance (ANCOVA)

• Multivariate Designs: MANOVA/MANCOVA

• Contrast Analysis and Post hoc Tests

• Assumptions and Effects of Violating Assumptions

See also Methods for Analysis of Variance, Variance Components and Mixed Model ANOVA/ANCOVA, and Experimental Design (DOE).

Between-Groups and Repeated Measures

When we want to compare two groups, we would use the t test for independent samples; when we want to compare two variables given the same subjects (observations), we would use the t test for dependent samples. This distinction -- dependent and independent samples -- is important for ANOVA as well. Basically, if we have repeated measurements of the same variable (under different conditions or at different points in time) on the same subjects, then the factor is a repeated measures factor (also called a within-subjects factor, because to estimate its significance we compute the within-subjects SS). If we compare different groups of subjects (e.g., males and females; three strains of bacteria, etc.) then we refer to the factor as a between-groups factor. The computations of significance tests are different for these different types of factors; however, the logic of computations and interpretations is the same.

Between-within designs. In many instances, experiments call for the inclusion of between-groups and repeated measures factors. For example, we may measure math skills in male and female students (gender, a between-groups factor) at the beginning and the end of the semester. The two measurements on each student would constitute a within-subjects (repeated measures) factor. The interpretation of main effects and interactions is not affected by whether a factor is between-groups or repeated measures, and both factors may obviously interact with each other (e.g., females improve over the semester while males deteriorate).

Incomplete (Nested) Designs

There are instances where we may decide to ignore interaction effects. This happens when (1) we know that in the population the interaction effect is negligible, or (2) when a complete factorial design (this term was first introduced by Fisher, 1935a) cannot be used for economic reasons. Imagine a study where we want to evaluate the effect of four fuel additives on gas mileage. For our test, our company has provided us with four cars and four drivers. A complete factorial experiment, that is, one in which each combination of driver, additive, and car appears at least once, would require 4 x 4 x 4 = 64 individual test conditions (groups). However, we may not have the resources (time) to run all of these conditions; moreover, it seems unlikely that the type of driver would interact with the fuel additive to an extent that would be of practical relevance. Given these considerations, one could actually run a so-called Latin square design and "get away" with only 16 individual groups (the four additives are denoted by letters A, B, C, and D):

Car

1 2 3 4

Driver 1

Driver 2

Driver 3

Driver 4 A

B

C

D B

C

D

A C

D

A

B D

A

B

C

Latin square designs (this term was first used by Euler, 1782) are described in most textbooks on experimental methods (e.g., Hays, 1988; Lindman, 1974; Milliken & Johnson, 1984; Winer, 1962), and we do not want to discuss here the details of how they are constructed. Suffice it to say that this design is incomplete insofar as not all combinations of factor levels occur in the design. For example, Driver 1 will only drive Car 1 with additive A, while Driver 3 will drive that car with additive C. In a sense, the levels of the additives factor (A, B, C, and D) are placed into the cells of the car by driver matrix like "eggs into a nest." This mnemonic device is sometimes useful for remembering the nature of nested designs.

Note that there are several other statistical procedures which may be used to analyze these types of designs; see the section on Methods for Analysis of Variance for details. In particular the methods discussed in the Variance Components and Mixed Model ANOVA/ANCOVA chapter are very efficient for analyzing designs with unbalanced nesting (when the nested factors have different numbers of levels within the levels of the factors in which they are nested), very large nested designs (e.g., with more than 200 levels overall), or hierarchically nested designs (with or without random factors).

To index

Analysis of Covariance (ANCOVA)

General Idea

The Basic Ideas section discussed briefly the idea of "controlling" for factors and how the inclusion of additional factors can reduce the error SS and increase the statistical power (sensitivity) of our design. This idea can be extended to continuous variables, and when such continuous variables are included as factors in the design they are called covariates.

• Fixed Covariates

• Changing Covariates

For more introductory topics, see the topic name.

• Basic Ideas

• Complex Designs

• Multivariate Designs: MANOVA/MANCOVA

• Contrast Analysis and Post hoc Tests

• Assumptions and Effects of Violating Assumptions

See also Methods for Analysis of Variance, Variance Components and Mixed Model ANOVA/ANCOVA, and Experimental Design (DOE).

Fixed Covariates

Suppose that we want to compare the math skills of students who were randomly assigned to one of two alternative textbooks. Imagine that we also have data about the general intelligence (IQ) for each student in the study. We would suspect that general intelligence is related to math skills, and we can use this information to make our test more sensitive. Specifically, imagine that in each one of the two groups we can compute the correlation coefficient (see Basic Statistics and Tables) between IQ and math skills. Remember that once we have computed the correlation coefficient we can estimate the amount of variance in math skills that is accounted for by IQ, and the amount of (residual) variance that we cannot explain with IQ (refer also to Elementary Concepts and Basic Statistics and Tables). We may use this residual variance in the ANOVA as an estimate of the true error SS after controlling for IQ. If the correlation between IQ and math skills is substantial, then a large reduction in the error SS may be achieved.

Effect of a covariate on the F test. In the F test (see also F Distribution), to evaluate the statistical significance of between-groups differences, we compute the ratio of the between- groups variance (MSeffect) over the error variance (MSerror). If MSerror becomes smaller, due to the explanatory power of IQ, then the overall F value will become larger.

Multiple covariates. The logic described above for the case of a single covariate (IQ) can easily be extended to the case of multiple covariates. For example, in addition to IQ, we might include measures of motivation, spatial reasoning, etc., and instead of a simple correlation, compute the multiple correlation coefficient (see Multiple Regression).

When the F value gets smaller. In some studies with covariates it happens that the F value actually becomes smaller (less significant) after including covariates in the design. This is usually an indication that the covariates are not only correlated with the dependent variable (e.g., math skills), but also with the between-groups factors (e.g., the two different textbooks). For example, imagine that we measured IQ at the end of the semester, after the students in the different experimental groups had used the respective textbook for almost one year. It is possible that, even though students were initially randomly assigned to one of the two textbooks, the different books were so different that both math skills and IQ improved differentially in the two groups. In that case, the covariate will not only partition variance away from the error variance, but also from the variance due to the between- groups factor. Put another way, after controlling for the differences in IQ that were produced by the two textbooks, the math skills are not that different. Put in yet a third way, by "eliminating" the effects of IQ, we have inadvertently eliminated the true effect of the textbooks on students' math skills.

Adjusted means. When the latter case happens, that is, when the covariate is affected by the between-groups factor, then it is appropriate to compute so-called adjusted means. These are the means that one would get after removing all differences that can be accounted for by the covariate.

Interactions between covariates and factors. Just as we can test for interactions between factors, we can also test for the interactions between covariates and between-groups factors. Specifically, imagine that one of the textbooks is particularly suited for intelligent students, while the other actually bores those students but challenges the less intelligent ones. As a result, we may find a positive correlation in the first group (the more intelligent, the better the performance), but a zero or slightly negative correlation in the second group (the more intelligent the student, the less likely he or she is to acquire math skills from the particular textbook). In some older statistics textbooks this condition is discussed as a case where the assumptions for analysis of covariance are violated (see Assumptions and Effects of Violating Assumptions). However, because ANOVA/MANOVA uses a very general approach to analysis of covariance, you can specifically estimate the statistical significance of interactions between factors and covariates.

Changing Covariates

While fixed covariates are commonly discussed in textbooks on ANOVA, changing covariates are discussed less frequently. In general, when we have repeated measures, we are interested in testing the differences in repeated measurements on the same subjects. Thus we are actually interested in evaluating the significance of changes. If we have a covariate that is also measured at each point when the dependent variable is measured, then we can compute the correlation between the changes in the covariate and the changes in the dependent variable. For example, we could study math anxiety and math skills at the beginning and at the end of the semester. It would be interesting to see whether any changes in math anxiety over the semester correlate with changes in math skills.

To index

Multivariate Designs: MANOVA/MANCOVA

• Between-Groups Designs

• Repeated Measures Designs

• Sum Scores versus MANOVA

For more introductory topics, see the topic name.

• Basic Ideas

• Complex Designs

• Analysis of Covariance (ANCOVA)

• Contrast Analysis and Post hoc Tests

• Assumptions and Effects of Violating Assumptions

See also Methods for Analysis of Variance, Variance Components and Mixed Model ANOVA/ANCOVA, and Experimental Design (DOE).

Between-Groups Designs

All examples discussed so far have involved only one dependent variable. Even though the computations become increasingly complex, the logic and nature of the computations do not change when there is more than one dependent variable at a time. For example, we may conduct a study where we try two different textbooks, and we are interested in the students' improvements in math and physics. In that case, we have two dependent variables, and our hypothesis is that both together are affected by the difference in textbooks. We could now perform a multivariate analysis of variance (MANOVA) to test this hypothesis. Instead of a univariate F value, we would obtain a multivariate F value (Wilks' lambda) based on a comparison of the error variance/covariance matrix and the effect variance/covariance matrix. The "covariance" here is included because the two measures are probably correlated and we must take this correlation into account when performing the significance test. Obviously, if we were to take the same measure twice, then we would really not learn anything new. If we take a correlated measure, we gain some new information, but the new variable will also contain redundant information that is expressed in the covariance between the variables.

Interpreting results. If the overall multivariate test is significant, we conclude that the respective effect (e.g., textbook) is significant. However, our next question would of course be whether only math skills improved, only physics skills improved, or both. In fact, after obtaining a significant multivariate test for a particular main effect or interaction, customarily one would examine the univariate F tests (see also F Distribution) for each variable to interpret the respective effect. In other words, one would identify the specific dependent variables that contributed to the significant overall effect.

Repeated Measures Designs

If we were to measure math and physics skills at the beginning of the semester and the end of the semester, we would have a multivariate repeated measure. Again, the logic of significance testing in such designs is simply an extension of the univariate case. Note that MANOVA methods are also commonly used to test the significance of univariate repeated measures factors with more than two levels; this application will be discussed later in this section.

Sum Scores versus MANOVA

Even experienced users of ANOVA and MANOVA techniques are often puzzled by the differences in results that sometimes occur when performing a MANOVA on, for example, three variables as compared to a univariate ANOVA on the sum of the three variables. The logic underlying the summing of variables is that each variable contains some "true" value of the variable in question, as well as some random measurement error. Therefore, by summing up variables, the measurement error will sum to approximately 0 across all measurements, and the sum score will become more and more reliable (increasingly equal to the sum of true scores). In fact, under these circumstances, ANOVA on sums is appropriate and represents a very sensitive (powerful) method. However, if the dependent variable is truly multi- dimensional in nature, then summing is inappropriate. For example, suppose that my dependent measure consists of four indicators of success in society, and each indicator represents a completely independent way in which a person could "make it" in life (e.g., successful professional, successful entrepreneur, successful homemaker, etc.). Now, summing up the scores on those variables would be like adding apples to oranges, and the resulting sum score will not be a reliable indicator of a single underlying dimension. Thus, one should treat such data as multivariate indicators of success in a MANOVA.

To index

Contrast Analysis and Post hoc Tests

• Why Compare Individual Sets of Means?

• Contrast Analysis

• Post hoc Comparisons

For more introductory topics, see the topic name.

• Basic Ideas

• Complex Designs

• Analysis of Covariance (ANCOVA)

• Multivariate Designs: MANOVA/MANCOVA

• Assumptions and Effects of Violating Assumptions

See also Methods for Analysis of Variance, Variance Components and Mixed Model ANOVA/ANCOVA, and Experimental Design (DOE).

Why Compare Individual Sets of Means?

Usually, experimental hypotheses are stated in terms that are more specific than simply main effects or interactions. We may have the specific hypothesis that a particular textbook will improve math skills in males, but not in females, while another book would be about equally effective for both genders, but less effective overall for males. Now generally, we are predicting an interaction here: the effectiveness of the book is modified (qualified) by the student's gender. However, we have a particular prediction concerning the nature of the interaction: we expect a significant difference between genders for one book, but not the other. This type of specific prediction is usually tested via contrast analysis.

Contrast Analysis

Briefly, contrast analysis allows us to test the statistical significance of predicted specific differences in particular parts of our complex design. It is a major and indispensable component of the analysis of every complex ANOVA design.

Post hoc Comparisons

Sometimes we find effects in our experiment that were not expected. Even though in most cases a creative experimenter will be able to explain almost any pattern of means, it would not be appropriate to analyze and evaluate that pattern as if one had predicted it all along. The problem here is one of capitalizing on chance when performing multiple tests post hoc, that is, without a priori hypotheses. To illustrate this point, let us consider the following "experiment." Imagine we were to write down a number between 1 and 10 on 100 pieces of paper. We then put all of those pieces into a hat and draw 20 samples (of pieces of paper) of 5 observations each, and compute the means (from the numbers written on the pieces of paper) for each group. How likely do you think it is that we will find two sample means that are significantly different from each other? It is very likely! Selecting the extreme means obtained from 20 samples is very different from taking only 2 samples from the hat in the first place, which is what the test via the contrast analysis implies. Without going into further detail, there are several so-called post hoc tests that are explicitly based on the first scenario (taking the extremes from 20 samples), that is, they are based on the assumption that we have chosen for our comparison the most extreme (different) means out of k total means in the design. Those tests apply "corrections" that are designed to offset the advantage of post hoc selection of the most extreme comparisons.

To index

Assumptions and Effects of Violating Assumptions

• Deviation from Normal Distribution

• Homogeneity of Variances

• Homogeneity of Variances and Covariances

• Sphericity and Compound Symmetry

For more introductory topics, see the topic name.

• Basic Ideas

• Complex Designs

• Analysis of Covariance (ANCOVA)

• Multivariate Designs: MANOVA/MANCOVA

• Contrast Analysis and Post hoc Tests

See also Methods for Analysis of Variance, Variance Components and Mixed Model ANOVA/ANCOVA, and Experimental Design (DOE).

Deviation from Normal Distribution

Assumptions. It is assumed that the dependent variable is measured on at least an interval scale level (see Elementary Concepts). Moreover, the dependent variable should be normally distributed within groups.

Effects of violations. Overall, the F test (see also F Distribution) is remarkably robust to deviations from normality (see Lindman, 1974, for a summary). If the kurtosis (see Basic Statistics and Tables) is greater than 0, then the F tends to be too small and we cannot reject the null hypothesis even though it is incorrect. The opposite is the case when the kurtosis is less than 0. The skewness of the distribution usually does not have a sizable effect on the F statistic. If the n per cell is fairly large, then deviations from normality do not matter much at all because of the central limit theorem, according to which the sampling distribution of the mean approximates the normal distribution, regardless of the distribution of the variable in the population. A detailed discussion of the robustness of the F statistic can be found in Box and Anderson (1955), or Lindman (1974).

Homogeneity of Variances

Assumptions. It is assumed that the variances in the different groups of the design are identical; this assumption is called the homogeneity of variances assumption. Remember that at the beginning of this section we computed the error variance (SS error) by adding up the sums of squares within each group. If the variances in the two groups are different from each other, then adding the two together is not appropriate, and will not yield an estimate of the common within-group variance (since no common variance exists).

Effects of violations. Lindman (1974, p. 33) shows that the F statistic is quite robust against violations of this assumption (heterogeneity of variances; see also Box, 1954a, 1954b; Hsu, 1938).

Special case: correlated means and variances. However, one instance when the F statistic is very misleading is when the means are correlated with variances across cells of the design. A scatterplot of variances or standard deviations against the means will detect such correlations. The reason why this is a "dangerous" violation is the following: Imagine that you have 8 cells in the design, 7 with about equal means but one with a much higher mean. The F statistic may suggest to you a statistically significant effect. However, suppose that there also is a much larger variance in the cell with the highest mean, that is, the means and the variances are correlated across cells (the higher the mean the larger the variance). In that case, the high mean in the one cell is actually quite unreliable, as is indicated by the large variance. However, because the overall F statistic is based on a pooled within-cell variance estimate, the high mean is identified as significantly different from the others, when in fact it is not at all significantly different if one based the test on the within-cell variance in that cell alone.

This pattern -- a high mean and a large variance in one cell -- frequently occurs when there are outliers present in the data. One or two extreme cases in a cell with only 10 cases can greatly bias the mean, and will dramatically increase the variance.

Homogeneity of Variances and Covariances

Assumptions. In multivariate designs, with multiple dependent measures, the homogeneity of variances assumption described earlier also applies. However, since there are multiple dependent variables, it is also required that their intercorrelations (covariances) are homogeneous across the cells of the design. There are various specific tests of this assumption.

Effects of violations. The multivariate equivalent of the F test is Wilks' lambda. Not much is known about the robustness of Wilks' lambda to violations of this assumption. However, because the interpretation of MANOVA results usually rests on the interpretation of significant univariate effects (after the overall test is significant), the above discussion concerning univariate ANOVA basically applies, and important significant univariate effects should be carefully scrutinized.

Special case: ANCOVA. A special serious violation of the homogeneity of variances/covariances assumption may occur when covariates are involved in the design. Specifically, if the correlations of the covariates with the dependent measure(s) are very different in different cells of the design, gross misinterpretations of results may occur. Remember that in ANCOVA, we in essence perform a regression analysis within each cell to partition out the variance component due to the covariates. The homogeneity of variances/covariances assumption implies that we perform this regression analysis subject to the constraint that all regression equations (slopes) across the cells of the design are the same. If this is not the case, serious biases may occur. There are specific tests of this assumption, and it is advisable to look at those tests to ensure that the regression equations in different cells are approximately the same.

Sphericity and Compound Symmetry

Reasons for Using the Multivariate Approach to Repeated Measures ANOVA. In repeated measures ANOVA containing repeated measures factors with more than two levels, additional special assumptions enter the picture: The compound symmetry assumption and the assumption of sphericity. Because these assumptions rarely hold (see below), the MANOVA approach to repeated measures ANOVA has gained popularity in recent years (both tests are automatically computed in ANOVA/MANOVA). The compound symmetry assumption requires that the variances (pooled within-group) and covariances (across subjects) of the different repeated measures are homogeneous (identical). This is a sufficient condition for the univariate F test for repeated measures to be valid (i.e., for the reported F values to actually follow the F distribution). However, it is not a necessary condition. The sphericity assumption is a necessary and sufficient condition for the F test to be valid; it states that the within-subject "model" consists of independent (orthogonal) components. The nature of these assumptions, and the effects of violations are usually not well-described in ANOVA textbooks; in the following paragraphs we will try to clarify this matter and explain what it means when the results of the univariate approach differ from the multivariate approach to repeated measures ANOVA.

The necessity of independent hypotheses. One general way of looking at ANOVA is to consider it a model fitting procedure. In a sense we bring to our data a set of a priori hypotheses; we then partition the variance (test main effects, interactions) to test those hypotheses. Computationally, this approach translates into generating a set of contrasts (comparisons between means in the design) that specify the main effect and interaction hypotheses. However, if these contrasts are not independent of each other, then the partitioning of variances runs afoul. For example, if two contrasts A and B are identical to each other and we partition out their components from the total variance, then we take the same thing out twice. Intuitively, specifying the two (not independent) hypotheses "the mean in Cell 1 is higher than the mean in Cell 2" and "the mean in Cell 1 is higher than the mean in Cell 2" is silly and simply makes no sense. Thus, hypotheses must be independent of each other, or orthogonal (the term orthogonality was first used by Yates, 1933).

Independent hypotheses in repeated measures. The general algorithm implemented will attempt to generate, for each effect, a set of independent (orthogonal) contrasts. In repeated measures ANOVA, these contrasts specify a set of hypotheses about differences between the levels of the repeated measures factor. However, if these differences are correlated across subjects, then the resulting contrasts are no longer independent. For example, in a study where we measured learning at three times during the experimental session, it may happen that the changes from time 1 to time 2 are negatively correlated with the changes from time 2 to time 3: subjects who learn most of the material between time 1 and time 2 improve less from time 2 to time 3. In fact, in most instances where a repeated measures ANOVA is used, one would probably suspect that the changes across levels are correlated across subjects. However, when this happens, the compound symmetry and sphericity assumptions have been violated, and independent contrasts cannot be computed.

Effects of violations and remedies. When the compound symmetry or sphericity assumptions have been violated, the univariate ANOVA table will give erroneous results. Before multivariate procedures were well understood, various approximations were introduced to compensate for the violations (e.g., Greenhouse & Geisser, 1959; Huynh & Feldt, 1970), and these techniques are still widely used.

MANOVA approach to repeated measures. To summarize, the problem of compound symmetry and sphericity pertains to the fact that multiple contrasts involved in testing repeated measures effects (with more than two levels) are not independent of each other. However, they do not need to be independent of each other if we use multivariate criteria to simultaneously test the statistical significance of the two or more repeated measures contrasts. This "insight" is the reason why MANOVA methods are increasingly applied to test the significance of univariate repeated measures factors with more than two levels. We wholeheartedly endorse this approach because it simply bypasses the assumption of compound symmetry and sphericity altogether.

Cases when the MANOVA approach cannot be used. There are instances (designs) when the MANOVA approach cannot be applied; specifically, when there are few subjects in the design and many levels on the repeated measures factor, there may not be enough degrees of freedom to perform the multivariate analysis. For example, if we have 12 subjects and p = 4 repeated measures factors, each at k = 3 levels, then the four-way interaction would "consume" (k-1)p = 24 = 16 degrees of freedom. However, we have only 12 subjects, so in this instance the multivariate test cannot be performed.

Differences in univariate and multivariate results. Anyone whose research involves extensive repeated measures designs has seen cases when the univariate approach to repeated measures ANOVA gives clearly different results from the multivariate approach. To repeat the point, this means that the differences between the levels of the respective repeated measures factors are in some way correlated across subjects. Sometimes, this insight by itself is of considerable interest.

To index

Methods for Analysis of Variance

Several chapters in this textbook discuss methods for performing analysis of variance. Although many of the available statistics overlap in the different chapters, each is best suited for particular applications.

General ANCOVA/MANCOVA: This chapter includes discussions of full factorial designs, repeated measures designs, mutivariate design (MANOVA), designs with balanced nesting (designs can be unbalanced, i.e., have unequal n), for evaluating planned and post-hoc comparisons, etc.

General Linear Models: This extremely comprehensive chapter discusses a complete implementation of the general linear model, and describes the sigma-restricted as well as the overparameterized approach. This chapter includes information on incomplete designs, complex analysis of covariance designs, nested designs (balanced or unbalanced), mixed model ANOVA designs (with random effects), and huge balanced ANOVA designs (efficiently). It also contains descriptions of six types of Sums of Squares.

General Regression Models: This chapter discusses the between subject designs and multivariate designs which are appropriate for stepwise regression as well as discussing how to perform stepwise and best-subset model building (for continuous as well as categorical predictors).

Mixed ANCOVA and Variance Components: This chapter includes discussions of experiments with random effects (mixed model ANOVA), estimating variance components for random effects, or large main effect designs (e.g., with factors with over 100 levels) with or without random effects, or large designs with many factors, when you do not need to estimate all interactions.

Experimental Design (DOE): This chapter includes discussions of standard experimental designs for industrial/manufacturing applications, including 2**(k-p) and 3**(k-p) designs, central composite and non-factorial designs, designs for mixtures, D and A optimal designs, and designs for arbitrarily constrained experimental regions.

Repeatability and Reproducibility Analysis (in the Process Analysis chapter): This section in the Process Analysis chapter includes a discussion of specialized designs for evaluating the reliability and precision of measurement systems; these designs usually include two or three random factors, and specialized statistics can be computed for evaluating the quality of a measurement system (typically in industrial/manufacturing applications).

Breakdown Tables (in the Basic Statistics chapter): This chapter includes discussions of experiments with only one factor (and many levels), or with multiple factors, when a complete ANOVA table is not required.

To index

© Copyright StatSoft, Inc., 1984-2004

Variance Components and Mixed Model ANOVA/ANCOVA

• Basic Ideas

o Properties of Random Effects

• Estimation of Variance Components (Technical Overview)

o Estimating the Variation of Random Factors

o Estimating Components of Variation

o Testing the Significance of Variance Components

o Estimating the Population Intraclass Correlation

The Variance Components and Mixed Model ANOVA/ANCOVA chapter describes a comprehensive set of techniques for analyzing research designs that include random effects; however, these techniques are also well suited for analyzing large main effect designs (e.g., designs with over 200 levels per factor), designs with many factors where the higher order interactions are not of interest, and analyses involving case weights.

There are several chapters in this textbook that will discuss Analysis of Variance for factorial or specialized designs. For a discussion of these chapters and the types of designs for which they are best suited refer to the section on Methods for Analysis of Variance. Note, however, that the General Linear Models chapter describes how to analyze designs with any number and type of between effects and compute ANOVA-based variance component estimates for any effect in a mixed-model analysis.

Basic Ideas

Experimentation is sometimes mistakenly thought to involve only the manipulation of levels of the independent variables and the observation of subsequent responses on the dependent variables. Independent variables whose levels are determined or set by the experimenter are said to have fixed effects. There is a second class of effects, however, which is often of great interest to the researcher, Random effects are classification effects where the levels of the effects are assumed to be randomly selected from an infinite population of possible levels. Many independent variables of research interest are not fully amenable to experimental manipulation, but nevertheless can be studied by considering them to have random effects. For example, the genetic makeup of individual members of a species cannot at present be (fully) experimentally manipulated, yet it is of great interest to the geneticist to assess the genetic contribution to individual variation on outcomes such as health, behavioral characteristics, and the like. As another example, a manufacturer might want to estimate the components of variation in the characteristics of a product for a random sample of machines operated by a random sample of operators. The statistical analysis of random effects is accomplished by using the random effect model, if all of the independent variables are assumed to have random effects, or by using the mixed model, if some of the independent variables are assumed to have random effects and other independent variables are assumed to have fixed effects.

Properties of random effects. To illustrate some of the properties of random effects, suppose you collected data on the amount of insect damage done to different varieties of wheat. It is impractical to study insect damage for every possible variety of wheat, so to conduct the experiment, you randomly select four varieties of wheat to study. Plant damage is rated for up to a maximum of four plots per variety. Ratings are on a 0 (no damage) to 10 (great damage) scale. The following data for this example are presented in Milliken and Johnson (1992, p. 237).

DATA: wheat.sta 3v

VARIETY PLOT DAMAGE

A

A

A

B

B

B

B

C

C

C

C

D

D 1

2

3

4

5

6

7

8

9

10

11

12

13 3.90

4.05

4.25

3.60

4.20

4.05

3.85

4.15

4.60

4.15

4.40

3.35

3.80

To determine the components of variation in resistance to insect damage for Variety and Plot, an ANOVA can first be performed. Perhaps surprisingly, in the ANOVA, Variety can be treated as a fixed or as a random factor without influencing the results (provided that Type I Sums of squares are used and that Variety is always entered first in the model). The Spreadsheet below shows the ANOVA results of a mixed model analysis treating Variety as a fixed effect and ignoring Plot, i.e., treating the plot-to-plot variation as a measure of random error.

ANOVA Results: DAMAGE (wheat.sta)

Effect Effect

(F/R) df

Effect MS

Effect df

Error MS

Error

F

p

{1}VARIETY Fixed 3 .270053 9 .056435 4.785196 .029275

Another way to perform the same mixed model analysis is to treat Variety as a fixed effect and Plot as a random effect. The Spreadsheet below shows the ANOVA results for this mixed model analysis.

ANOVA Results for Synthesized Errors: DAMAGE (wheat.sta)

df error computed using Satterthwaite method

Effect Effect

(F/R) df

Effect MS

Effect df

Error MS

Error

F

p

{1}VARIETY

{2}PLOT Fixed

Random 3

9 .270053

.056435 9

----- .056435

----- 4.785196

----- .029275

-----

The Spreadsheet below shows the ANOVA results for a random effect model treating Plot as a random effect nested within Variety, which is also treated as a random effect.

ANOVA Results for Synthesized Errors: DAMAGE (wheat.sta)

df error computed using Satterthwaite method

Effect Effect

(F/R) df

Effect MS

Effect df

Error MS

Error

F

p

{1}VARIETY

{2}PLOT Random

Random 3

9 .270053

.056435 9

----- .056435

----- 4.785196

----- .029275

-----

As can be seen, the tests of significance for the Variety effect are identical in all three analyses (and in fact, there are even more ways to produce the same result). When components of variance are estimated, however, the difference between the mixed model (treating Variety as fixed) and the random model (treating Variety as random) becomes apparent. The Spreadsheet below shows the variance component estimates for the mixed model treating Variety as a fixed effect.

Components of Variance (wheat.sta)

Mean Squares Type: 1

Source DAMAGE

{2}PLOT

Error .056435

0.000000

The Spreadsheet below shows the variance component estimates for the random effects model treating Variety and Plot as random effects.

Components of Variance (wheat.sta)

Mean Squares Type: 1

Source DAMAGE

{1}VARIETY

{2}PLOT

Error .067186

.056435

0.000000

As can be seen, the difference in the two sets of estimates is that a variance component is estimated for Variety only when it is considered to be a random effect. This reflects the basic distinction between fixed and random effects. The variation in the levels of random factors is assumed to be representative of the variation of the whole population of possible levels. Thus, variation in the levels of a random factor can be used to estimate the population variation. Even more importantly, covariation between the levels of a random factor and responses on a dependent variable can be used to estimate the population component of variance in the dependent variable attributable to the random factor. The variation in the levels of fixed factors is instead considered to be arbitrarily determined by the experimenter (i.e., the experimenter can make the levels of a fixed factor vary as little or as much as desired). Thus, the variation of a fixed factor cannot be used to estimate its population variance, nor can the population covariance with the dependent variable be meaningfully estimated. With this basic distinction between fixed effects and random effects in mind, we now can look more closely at the properties of variance components.

To index

Estimation of Variance Components (Technical Overview)

The basic goal of variance component estimation is to estimate the population covariation between random factors and the dependent variable. Depending on the method used to estimate variance components, the population variances of the random factors can also be estimated, and significance tests can be performed to test whether the population covariation between the random factors and the dependent variable are nonzero.

Estimating the variation of random factors. The ANOVA method provides an integrative approach to estimating variance components, because ANOVA techniques can be used to estimate the variance of random factors, to estimate the components of variance in the dependent variable attributable to the random factors, and to test whether the variance components differ significantly from zero. The ANOVA method for estimating the variance of the random factors begins by constructing the Sums of squares and cross products (SSCP) matrix for the independent variables. The sums of squares and cross products for the random effects are then residualized on the fixed effects, leaving the random effects independent of the fixed effects, as required in the mixed model (see, for example, Searle, Casella, & McCulloch, 1992). The residualized Sums of squares and cross products for each random factor are then divided by their degrees of freedom to produce the coefficients in the Expected mean squares matrix. Nonzero off-diagonal coefficients for the random effects in this matrix indicate confounding, which must be taken into account when estimating the population variance for each factor. For the wheat.sta data, treating both Variety and Plot as random effects, the coefficients in the Expected mean squares matrix show that the two factors are at least somewhat confounded. The Expected mean squares Spreadsheet is shown below.

Expected Mean Squares (wheat.sta)

Mean Squares Type: 1

Source Effect

(F/R)

VARIETY

PLOT

Error

{1}VARIETY

{2}PLOT

Error Random

Random

3.179487

1.000000

1.000000

1.000000

1.000000

1.000000

The coefficients in the Expected mean squares matrix are used to estimate the population variation of the random effects by equating their variances to their expected mean squares. For example, the estimated population variance for Variety using Type I Sums of squares would be 3.179487 times the Mean square for Variety plus 1 times the Mean square for Plot plus 1 times the Mean square for Error.

The ANOVA method provides an integrative approach to estimating variance components, but it is not without problems (i.e., ANOVA estimates of variance components are generally biased, and can be negative, even though variances, by definition, must be either zero or positive). An alternative to ANOVA estimation is provided by maximum likelihood estimation. Maximum likelihood methods for estimating variance components are based on quadratic forms, and typically, but not always, require iteration to find a solution. Perhaps the simplest form of maximum likelihood estimation is MIVQUE(0) estimation. MIVQUE(0) produces Minimum Variance Quadratic Unbiased Estimators (i.e., MIVQUE). In MIVQUE(0) estimation, there is no weighting of the random effects (thus the 0 [zero] after MIVQUE), so an iterative solution for estimating variance components is not required. MIVQUE(0) estimation begins by constructing the Quadratic sums of squares (SSQ) matrix. The elements for the random effects in the SSQ matrix can most simply be described as the sums of squares of the sums of squares and cross products for each random effect in the model (after residualization on the fixed effects). The elements of this matrix provide coefficients, similar to the elements of the Expected Mean Squares matrix, which are used to estimate the covariances among the random factors and the dependent variable. The SSQ matrix for the wheat.sta data is shown below. Note that the nonzero off-diagonal element for Variety and Plot again shows that the two random factors are at least somewhat confounded.

MIVQUE(0) Variance Component Estimation (wheat.sta)

SSQ Matrix

Source VARIETY PLOT Error DAMAGE

{1}VARIETY

{2}PLOT

Error 31.90533

9.53846

9.53846 9.53846

12.00000

12.00000 9.53846

12.00000

12.00000 2.418964

1.318077

1.318077

Restricted Maximum Likelihood (REML) and Maximum Likelihood (ML) variance component estimation methods are closely related to MIVQUE(0). In fact, in the program, REML and ML use MIVQUE(0) estimates as start values for an iterative solution for the variance components, so the elements of the SSQ matrix serve as initial estimates of the covariances among the random factors and the dependent variable for both REML and ML.

To index

Estimating components of variation. For ANOVA methods for estimating variance components, a solution is found for the system of equations relating the estimated population variances and covariances among the random factors to the estimated population covariances between the random factors and the dependent variable. The solution then defines the variance components. The Spreadsheet below shows the Type I Sums of squares estimates of the variance components for the wheat.sta data.

Components of Variance (wheat.sta)

Mean Squares Type: 1

Source DAMAGE

{1}VARIETY

{2}PLOT

Error 0.067186

0.056435

0.000000

MIVQUE(0) variance components are estimated by inverting the partition of the SSQ matrix that does not include the dependent variable (or finding the generalized inverse, for singular matrices), and postmultiplying the inverse by the dependent variable column vector. This amounts to solving the system of equations that relates the dependent variable to the random independent variables, taking into account the covariation among the independent variables. The MIVQUE(0) estimates for the wheat.sta data are listed in the Spreadsheet shown below.

MIVQUE(0) Variance Component Estimation (wheat.sta)

Variance Components

Source DAMAGE

{1}VARIETY

{2}PLOT

Error 0.056376

0.065028

0.000000

REML and ML variance components are estimated by iteratively optimizing the parameter estimates for the effects in the model. REML differs from ML in that the likelihood of the data is maximized only for the random effects, thus REML is a restricted solution. In both REMLandMLestimation, an iterative solution is found for the weights for the random effects in the model that maximize the likelihood of the data. The program uses MIVQUE(0)) estimates as the start values for both REML and ML estimation, so the relation between these three techniques is close indeed. The statistical theory underlying maximum likelihood variance component estimation techniques is an advanced topic (Searle, Casella, & McCulloch, 1992, is recommended as an authoritative and comprehensive source). Implementation of maximum likelihood estimation algorithms, furthermore, is difficult (see, for example, Hemmerle & Hartley, 1973, and Jennrich & Sampson, 1976, for descriptions of these algorithms), and faulty implementation can lead to variance component estimates that lie outside the parameter space, converge prematurely to nonoptimal solutions, or give nonsensical results. Milliken and Johnson (1992) noted all of these problems with the commercial software packages they used to estimate variance components.

The basic idea behind both REML and ML estimation is to find the set of weights for the random effects in the model that minimize the negative of the natural logarithm times the likelihood of the data (the likelihood of the data can vary from zero to one, so minimizing the negative of the natural logarithm times the likelihood of the data amounts to maximizing the probability, or the likelihood, of the data). The logarithm of the REMLlikelihood and the REML variance component estimates for the wheat.sta data are listed in the last row of the Iteration history Spreadsheet shown below.

Iteration History (wheat.sta)

Variable: DAMAGE

Iter. Log LL Error VARIETY

1

2

3

4

5

6

7 -2.30618

-2.25253

-2.25130

-2.25088

-2.25081

-2.25081

-2.25081 .057430

.057795

.056977

.057005

.057006

.057003

.057003 .068746

.073744

.072244

.073138

.073160

.073155

.073155

The logarithm of the MLlikelihood and the ML estimates for the variance components for the wheat.sta data are listed in the last row of the Iteration history Spreadsheet shown below.

Iteration History (wheat.sta)

Variable: DAMAGE

Iter. Log LL Error VARIETY

1

2

3

4

5

6 -2.53585

-2.48382

-2.48381

-2.48381

-2.48381

-2.48381 .057454

.057427

.057492

.057491

.057492

.057492 .048799

.048541

.048639

.048552

.048552

.048552

As can be seen, the estimates of the variance components for the different methods are quite similar. In general, components of variance using different estimation methods tend to agree fairly well (see, for example, Swallow & Monahan, 1984).

To index

Testing the significance of variance components. When maximum likelihood estimation techniques are used, standard linear model significance testing techniques may not be applicable. ANOVA techniques such as decomposing sums of squares and testing the significance of effects by taking ratios of mean squares are appropriate for linear methods of estimation, but generally are not appropriate for quadratic methods of estimation. When ANOVA methods are used for estimation, standard significance testing techniques can be employed, with the exception that any confounding among random effects must be taken into account.

To test the significance of effects in mixed or random models, error terms must be constructed that contain all the same sources of random variation except for the variation of the respective effect of interest. This is done using Satterthwaite's method of denominator synthesis (Satterthwaite, 1946), which finds the linear combinations of sources of random variation that serve as appropriate error terms for testing the significance of the respective effect of interest. The Spreadsheet below shows the coefficients used to construct these linear combinations for testing the Variety and Plot effects.

Denominator Synthesis: Coefficients (MS Type: 1) (wheat.sta)

The synthesized MS Errors are linear

combinations of the resp. MS effects

Effect (F/R) VARIETY PLOT Error

{1}VARIETY

{2}PLOT Random

Random

1.000000

1.000000

The coefficients show that the Mean square for Variety should be tested against the Mean square for Plot, and that the Mean square for Plot should be tested against the Mean square for Error. Referring back to the Expected mean squares Spreadsheet, it is clear that the denominator synthesis has identified appropriate error terms for testing the Variety and Plot effects. Although this is a simple example, in more complex analyses with various degrees of confounding among the random effects, the denominator synthesis can identify appropriate error terms for testing the random effects that would not be readily apparent.

To perform the tests of significance of the random effects, ratios of appropriate Mean squares are formed to compute F statistics and p levels for each effect. Note that in complex analyses the degrees of freedom for random effects can be fractional rather than integer values, indicating that fractions of sources of variation were used in synthesizing appropriate error terms for testing the random effects. The Spreadsheet displaying the results of the ANOVA for the Variety and Plot random effects is shown below. Note that for this simple design the results are identical to the results presented earlier in the Spreadsheet for the ANOVA treating Plot as a random effect nested within Variety.

ANOVA Results for Synthesized Errors: DAMAGE (wheat.sta)

df error computed using Satterthwaite method

Effect Effect

(F/R) df

Effect MS

Effect df

Error MS

Error

F

p

{1}VARIETY

{2}PLOT Fixed

Random 3

9 .270053

.056435 9

----- .056435

----- 4.785196

----- .029275

-----

As shown in the Spreadsheet, the Variety effect is found to be significant at p < .05, but as would be expected, the Plot effect cannot be tested for significance because plots served as the basic unit of analysis. If data on samples of plants taken within plots were available, a test of the significance of the Plot effect could be constructed.

Appropriate tests of significance for MIVQUE(0) variance component estimates generally cannot be constructed, except in special cases (see Searle, Casella, & McCulloch, 1992). Asymptotic (large sample) tests of significance of REML and ML variance component estimates, however, can be constructed for the parameter estimates from the final iteration of the solution. The Spreadsheet below shows the asymptotic (large sample) tests of significance for the REML estimates for the wheat.sta data.

Restricted Maximum Likelihood Estimates (wheat.sta)

Variable: DAMAGE

-2*Log(Likelihood)=4.50162399

Effect Variance

Comp. Asympt.

Std.Err. Asympt.

z Asympt.

p

{1}VARIETY

Error .073155

.057003 .078019

.027132 .937656

2.100914 .348421

.035648

The Spreadsheet below shows the asymptotic (large sample) tests of significance for the ML estimates for the wheat.sta data.

Maximum Likelihood Estimates (wheat.sta)

Variable: DAMAGE

-2*Log(Likelihood)=4.96761616

Effect Variance

Comp. Asympt.

Std.Err. Asympt.

z Asympt.

p

{1}VARIETY

Error .048552

.057492 .050747

.027598 .956748

2.083213 .338694

.037232

It should be emphasized that the asymptotic tests of significance for REML and ML variance component estimates are based on large sample sizes, which certainly is not the case for the wheat.sta data. For this data set, the tests of significance from both analyses agree in suggesting that the Variety variance component does not differ significantly from zero.

For basic information on ANOVA in linear models, see also Elementary Concepts.

To index

Estimating the population intraclass correlation. Note that if the variance component estimates for the random effects in the model are divided by the sum of all components (including the error component), the resulting percentages are population intraclass correlation coefficients for the respective effects.

To index

© Copyright StatSoft, Inc., 1984-2004

STATISTICA is a trademark of StatSoft, Inc.

© Copyright StatSoft, Inc., 1984-2004

Association Rules

• Association Rules Introductory Overview

• Computational Procedures and Terminology

• Tabular Representation of Associations

• Graphical Representation of Associations

• Interpreting and Comparing Results

Association Rules Introductory Overview

The goal of the techniques described in this section is to detect relationships or associations between specific values of categorical variables in large data sets. This is a common task in many data mining projects as well as in the data mining subcategory text mining. These powerful exploratory techniques have a wide range of applications in many areas of business practice and also research - from the analysis of consumer preferences or human resource management, to the history of language. These techniques enable analysts and researchers to uncover hidden patterns in large data sets, such as "customers who order product A often also order product B or C" or "employees who said positive things about initiative X also frequently complain about issue Y but are happy with issue Z." The implementation of the so-called a-priori algorithm (see Agrawal and Swami, 1993; Agrawal and Srikant, 1994; Han and Lakshmanan, 2001; see also Witten and Frank, 2000) allows you to process rapidly huge data sets for such associations, based on predefined "threshold" values for detection.

How association rules work. The usefulness of this technique to address unique data mining problems is best illustrated in a simple example. Suppose you are collecting data at the check-out cash registers at a large book store. Each customer transaction is logged in a database, and consists of the titles of the books purchased by the respective customer, perhaps additional magazine titles and other gift items that were purchased, and so on. Hence, each record in the database will represent one customer (transaction), and may consist of a single book purchased by that customer, or it may consist of many (perhaps hundreds of) different items that were purchased, arranged in an arbitrary order depending on the order in which the different items (books, magazines, and so on) came down the conveyor belt at the cash register. The purpose of the analysis is to find associations between the items that were purchased, i.e., to derive association rules that identify the items and co-occurrences of different items that appear with the greatest (co-)frequencies. For example, you want to learn which books are likely to be purchased by a customer who you know already purchased (or is about to purchase) a particular book. This type of information could then quickly be used to suggest to the customer those additional titles. You may already be "familiar" with the results of these types of analyses, if you are a customer of various on-line (Web-based) retail businesses; many times when making a purchase on-line, the vendor will suggest similar items (to the ones purchased by you) at the time of "check-out", based on some rules such as "customers who buy book title A are also likely to purchase book title B," and so on.

Unique data analysis requirements. Crosstabulation tables, and in particular Multiple Response tables can be used to analyze data of this kind. However, in cases when the number of different items (categories) in the data is very large (and not known ahead of time), and when the "factorial degree" of important association rules is not known ahead of time, then these tabulation facilities may be too cumbersome to use, or simply not applicable: Consider once more the simple "bookstore-example" discussed earlier. First, the number of book titles is practically unlimited. In other words, if we would make a table where each book title would represent one dimension, and the purchase of that book (yes/no) would be the classes or categories for each dimension, then the complete crosstabulation table would be huge and sparse (consisting mostly of empty cells). Alternatively, we could construct all possible two-way tables from all items available in the store; this would allow us to detect two-way associations (association rules) between items. However, the number of tables that would have to be constructed would again be huge, most of the two-way tables would be sparse, and worse, if there were any three-way association rules "hiding" in the data, we would miss them completely. The a-priori algorithm implemented in Association Rules will not only automatically detect the relationships ("cross-tabulation tables") that are important (i.e., cross-tabulation tables that are not sparse, not containing mostly zero's), but also determine the factorial degree of the tables that contain the important association rules.

To summarize, Association Rules will allow you to find rules of the kind If X then (likely) Y where X and Y can be single values, items, words, etc., or conjunctions of values, items, words, etc. (e.g., if (Car=Porsche and Gender=Male and Age<20) then (Risk=High and Insurance=High)). The program can be used to analyze simple categorical variables, dichotomous variables, and/or multiple response variables. The algorithm will determine association rules without requiring the user to specify the number of distinct categories present in the data, or any prior knowledge regarding the maximum factorial degree or complexity of the important associations. In a sense, the algorithm will construct cross-tabulation tables without the need to specify the number of dimensions for the tables, or the number of categories for each dimension. Hence, this technique is particularly well suited for data and text mining of huge databases.

To index

Computational Procedures and Terminology

Categorical or class variables. Categorical variables are single variables that contains codes or text values to denote distinct classes; for example, a variable Gender would have the categories Male and Female.

Multiple response variables. Multiple response variables usually consist of multiple variables (i.e., a list of variables) that can contain, for each observations, codes or text values describing a single "dimension" or transaction. A good example of a multiple response variable would be if a vendor recorded the purchases made by a customer in a single record, where each record could contain one or more items purchased, in arbitrary order. This is a typical format in which customer transaction data would be kept.

Multiple dichotomies. In this data format, each variable would represent one item or category, and the dichotomous data in each variable would indicate whether or not the respective item or category applies to the respective case. For example, suppose a vendor created a data spreadsheet where each column represented one of the products available for purchase. Each transaction (row of the data spreadsheet) would record whether or not the respective customer did or did not purchase that product, i.e., whether or not the respective transaction involved each item.

Association Rules: If Body then Head. The A-priori algorithm attempts to derive from the data association rules of the form: If "Body" then "Head", where Body and Head stand for simple codes or text values (items), or the conjunction of codes and text values (items; e.g., if (Car=Porsche and Age<20) then (Risk=High and Insurance=High); here the logical conjunction before the then would be the Body, and the logical conjunction following the then would be the Head of the association rule).

Initial Pass Through the Data: The Support Value. First the program will scan all variables to determine the unique codes or text values (items) found in the variables selected for the analysis. In this initial pass, the relative frequencies with which the individual codes or text values occur in each transaction will also be computed. The probability that a transaction contains a particular code or text value is called Support; the Support value is also computed in consecutive passes through the data, as the joint probability (relative frequency of co-occurrence) of pairs, triplets, etc. of codes or text values (items), i.e., separately for the Body and Head of each association rule.

Second Pass Through the Data: The Confidence Value; Correlation Value. After the initial pass through the data, all items with a support value less than some predefined minimum support value will be "remembered" for subsequent passes through the data: Specifically, the conditional probabilities will be computed for all pairs of codes or text values that have support values greater than the minimum support value. This conditional probability - that an observation (transaction) that contains a code or text value X also contains a code or text value Y -- is called the Confidence Value. In general (in later passes through the data) the confidence value denotes the conditional probability of the Head of the association rule, given the Body of the association rule.

In addition, the support value will be computed for each pair of codes or text values, and a Correlation value based on the support values. The correlation value for a pair of codes or text values {X, Y} is computed as the support value for that pair, divided by the square root of the product of the support values for X and Y. After the second pass through the data those pairs of codes or text values that (1) have a confidence value that is greater than some user-defined minimum confidence value, (2) have a support value that is greater than some user-defined minimum support value, and (3) have a correlation value that is greater than some minimum correlation value will be retained.

Subsequent Passes Through The Data: Maximum Item Size in Body, Head. The data in subsequent steps, the data will be further scanned computing support, confidence, and correlation values for pairs of codes or text values (associations between single codes or text values), triplets of codes or text values, and so on. To reiterate, in general, at each association rules will be derived of the general form if "Body" then "Head", where Body and Head stand for simple codes or text values (items), or the conjunction of codes and text values (items).

Unless the process stops because no further associations can be found that satisfy the minimum support, confidence, and correlation conditions, the process could continue to build very complex association rules (e.g., if X1 and X2 .. and X20 then Y1 and Y2 ... and Y20). To avoid excessive complexity, additionally, the user can specify the maximum number of codes or text values (items) in the Body and Head of the association rules; this value is referred to as the maximum item set size in the Body and Head of an association rule.

To index

Tabular Representation of Associations

Association rules are generated of the general form if Body then Head, where Body and Head stand for single codes or text values (items) or conjunctions of codes or text values (items; e.g., if (Car=Porsche and Age<20) then (Risk=High and Insurance=High). The major statistics computed for the association rules are Support (relative frequency of the Body or Head of the rule), Confidence (conditional probability of the Head given the Body of the rule), and Correlation (support for Body and Head, divided by the square root of the product of the support for the Body and the support for the Head). These statistics can be summarized in a spreadsheet, as shown below.

This results spreadsheet shows an example of how association rules can be applied to text mining tasks. This analysis was performed on the paragraphs (dialog spoken by the characters in the play) in the first scene of Shakespeare's "All's Well That Ends Well," after removing a few very frequent words like is, of, etc. The values for support, confidence, and correlation are expressed in percent.

To index

Graphical Representation of Associations

As a result of applying Association Rules data mining techniques to large datasets rules of the form if "Body" then "Head" will be derived, where Body and Head stand for simple codes or text values (items), or the conjunction of codes and text values (items; e.g., if (Car=Porsche and Age<20) then (Risk=High and Insurance=High)). These rules can be reviewed in textual format or tables, or in graphical format (see below).

Association Rules Networks, 2D. For example, consider the data that describe a (fictitious) survey of 100 patrons of sports bars and their preferences for watching various sports on television. This would be an example of simple categorical variables, where each variable represents one sport. For each sport, each respondent indicated how frequently s/he watched the respective type of sport on television. The association rules derived from these data could be summarized as follows:

In this graph, the support values for the Body and Head portions of each association rule are indicated by the sizes and colors of each. The thickness of each line indicates the confidence value (conditional probability of Head given Body) for the respective association rule; the sizes and colors of the circles in the center, above the Implies label, indicate the joint support (for the co-occurences) of the respective Body and Head components of the respective association rules. Hence, in this graphical summary, the strongest support value was found for Swimming=Sometimes, which was associated Gymnastic=Sometimes, Baseball = Sometimes, and Basketball=Sometimes. Incidentally. Unlike simple frequency and crosstabulation tables, the absolute frequencies with which individual codes or text values (items) occur in the data are often not reflected in the association rules; instead, only those codes or text values (items) are retained that show sufficient values for support, confidence, and correlation, i.e., that co-occur with other codes or text values (items) with sufficient relative (co-)frequency.

The results that can be summarized in 2D Association Rules networks can be relatively simple, or complex, as illustrated in the network shown to the left.

This is an example of how association rules can be applied to text mining tasks. This analysis was performed on the paragraphs (dialog spoken by the characters in the play) in the first scene of Shakespeare's "All's Well That Ends Well," after removing a few very frequent words like is, of, etc. Of course, the specific words and phrases removed during the data preparation phase of text (or data) mining projects will depend on the purpose of the research.

Association Rules Networks, 3D. Association rules can be graphically summarized in 2D Association Networks, as well as 3D Association Networks. Shown below are some (very clear) results from an analysis. Respondents in a survey were asked to list their (up to) 3 favorite fast-foods. The association rules derived from those data are summarized in a 3D Association Network display.

As in the 2D Association Network, the support values for the Body and Head portions of each association rule are indicated by the sizes and colors of each circle in the 2D. The thickness of each line indicates the confidence value (joint probability) for the respective association rule; the sizes and colors of the "floating" circles plotted against the (vertical) z-axis indicate the joint support (for the co-occurences) of the respective Body and Head components of the association rules. The plot position of each circle along the vertical z - axis indicates the respective confidence value. Hence, this particular graphical summary clearly shows two simple rules: Respondents who name Pizza as a preferred fast food also mention Hamburger, and vice versa.

To index

Interpreting and Comparing Results

When comparing the results of applying association rules to those from simple frequency or cross-tabulation tables, you may notice that in some cases very high-frequency codes or text values (items) are not part of any association rule. This can sometimes be perplexing.

To illustrate how this pattern of findings can occur, consider this example: Suppose you analyzed data from a survey of insurance rates for different makes of automobiles in America. Simple tabulation would very likely show that many people drive automobiles manufactured by Ford, GM, and Chrysler; however, none of these makes may be associated with particular patterns in insurance rates, i.e., none of these brands may be involved in high-confidence, high-correlation association rules linking them to particular categories of insurance rates. However, when applying association rules methods, automobile makes which occur in the sample with relatively low frequency (e.g., Porsche) may be found to be associated with high insurance rates (allowing you to infer, for example, a rule that if Car=Porsche then Insurance=High). If you only reviewed a simple cross-tabulation table (make of car by insurance rate) this high-confidence association rule may well have gone unnoticed.

To index

© Copyright StatSoft, Inc., 1984-2004

Boosting Trees for Regression and Classification

• Boosting Trees for Regression and Classification Introductory Overview

• Gradient Boosting Trees

• The Problem of Overfitting; Stochastic Gradient Boosting

• Stochastic Gradient Boosting Trees and Classification

• Large Numbers of Categories

Boosting Trees for Regression and Classification Introductory Overview

The general computational approach of stochastic gradient boosting is also known by the names TreeNet (TM Salford Systems, Inc.) and MART (TM Jerill, Inc.). Over the past few years, this technique has emerged as one of the most powerful methods for predictive data mining. Some implementations of these powerful algorithms allow them to be used for regression as well as classification problems, with continuous and/or categorical predictors. Detailed technical descriptions of these methods can be found in Friedman (1999a, b) as well as Hastie, Tibshirani, & Friedman (2001).

Gradient Boosting Trees

The algorithm for Boosting Trees evolved from the application of boosting methods to regression trees. The general idea is to compute a sequence of (very) simple trees, where each successive tree is built for the prediction residuals of the preceding tree. As described in the General Classification and Regression Trees Introductory Overview, this method will build binary trees, i.e., partition the data into two samples at each split node. Now suppose that you were to limit the complexities of the trees to 3 nodes only: a root node and two child nodes, i.e., a single split. Thus, at each step of the boosting (boosting trees algorithm), a simple (best) partitioning of the data is determined, and the deviations of the observed values from the respective means (residuals for each partition) are computed. The next 3-node tree will then be fitted to those residuals, to find another partition that will further reduce the residual (error) variance for the data, given the preceding sequence of trees.

It can be shown that such "additive weighted expansions" of trees can eventually produce an excellent fit of the predicted values to the observed values, even if the specific nature of the relationships between the predictor variables and the dependent variable of interest is very complex (nonlinear in nature). Hence, the method of gradient boosting - fitting a weighted additive expansion of simple trees - represents a very general and powerful machine learning algorithm.

To index

The Problem of Overfitting; Stochastic Gradient Boosting

One of the major problems of all machine learning algorithms is to "know when to stop," i.e., how to prevent the learning algorithm to fit esoteric aspects of the training data that are not likely to improve the predictive validity of the respective model. This issue is also known as the problem of overfitting. To reiterate, this is a general problem applicable to most machine learning algorithms used in predictive data mining. A general solution to this problem is to evaluate the quality of the fitted model by predicting observations in a test-sample of data that have not been used before to estimate the respective model(s). In this manner, one hopes to gage the predictive accuracy of the solution, and to detect when overfitting has occurred (or is starting to occur).

A similar approach is for each consecutive simple tree to be built for only a randomly selected subsample of the full data set. In other words, each consecutive tree is built for the prediction residuals (from all preceding trees) of an independently drawn random sample. The introduction of a certain degree of randomness into the analysis in this manner can serve as a powerful safeguard against overfitting (since each consecutive tree is built for a different sample of observations), and yield models (additive weighted expansions of simple trees) that generalize well to new observations, i.e., exhibit good predictive validity. This technique, i.e., performing consecutive boosting computations on independently drawn samples of observations, is knows as stochastic gradient boosting.

Below is a plot of the prediction error function for the training data over successive trees and also an independently sampled testing data set at each stage.

With this graph, you can identify very quickly the point where the model (consisting of a certain number of successive trees) begins to overfit the data. Notice how the prediction error for the training data steadily decreases as more and more additive terms (trees) are added to the model. However, somewhere past 35 trees, the performance for independently sampled testing data actually begins to deteriorate, clearly indicating the point where the model begins to overfit the data.

To index

Stochastic Gradient Boosting Trees and Classification

So far, the discussion of boosting trees has exclusively focused on regression problems, i.e., on the prediction of a continuous dependent variable. The technique can easily be expanded to handle classification problems as well (this is described in detail in Friedman, 1999a, section 4.6; in particular, see Algorithm 6):

First, different boosting trees are built for (fitted to) each category or class of the categorical dependent variable, after creating a coded variable (vector) of values for each class with the values 1 or 0 to indicate whether or not an observation does or does not belong to the respective class. In successive boosting steps, the algorithm will apply the logistic transformation (see also Nonlinear Estimation) to compute the residuals for subsequent boosting steps. To compute the final classification probabilities, the logistic transformation is again applied to the predictions for each 0/1 coded vector (class). This algorithm is described in detail in Friedman (1999a; see also Hastie, Tibshirani, and Freedman, 2001, for a description of this general procedure).

Large Numbers of Categories

Note that the procedure for applying this method to classification problems requires that separate sequences of (boosted) trees be built for each category or class. Hence, the computational effort generally becomes larger by a multiple of what it takes to solve a simple regression prediction problem (for a single continuous dependent variable). Therefore, it is not prudent to analyze categorical dependent variables (class variables) with more than, approximately, 100 or so classes; past that point, the computations performed may require an unreasonable amount of effort and time. (For example, a problem with 200 boosting steps and 100 categories or classes for the dependent variable would yield 200 * 100 = 20,000 individual trees!)

To index

© Copyright StatSoft, Inc., 1984-2003

Process Analysis

• Sampling Plans

o General Purpose

o Computational Approach

o Means for H0 and H1

o Alpha and Beta Error Probabilities

o Fixed Sampling Plans

o Sequential Sampling Plans

o Summary

• Process (Machine) Capability Analysis

o Introductory Overview

o Computational Approach

o Process Capability Indices

o Process Performance vs. Process Capability

o Using Experiments to Improve Process Capability

o Testing the Normality Assumption

o Tolerance Limits

• Gage Repeatability and Reproducibility

o Introductory Overview

o Computational Approach

o Plots of Repeatability and Reproducibility

o Components of Variance

o Summary

• Non-Normal Distributions

o Introductory Overview

o Fitting Distributions by Moments

o Assessing the Fit: Quantile and Probability Plots

o Non-Normal Process Capability Indices (Percentile Method)

• Weibull and Reliability/Failure Time Analysis

o General Purpose

o The Weibull Distribution

o Censored Observations

o Two- and three-parameter Weibull Distribution

o Parameter Estimation

o Goodness of Fit Indices

o Interpreting Results

o Grouped Data

o Modified Failure Order for Multiple-Censored Data

o Weibull CDF, Reliability, and Hazard Functions

Sampling plans are discussed in detail in Duncan (1974) and Montgomery (1985); most process capability procedures (and indices) were only recently introduced to the US from Japan (Kane, 1986), however, they are discussed in three excellent recent hands-on books by Bohte (1988), Hart and Hart (1989), and Pyzdek (1989); detailed discussions of these methods can also be found in Montgomery (1991).

Step-by-step instructions for the computation and interpretation of capability indices are also provided in the Fundamental Statistical Process Control Reference Manual published by the ASQC (American Society for Quality Control) and AIAG (Automotive Industry Action Group, 1991; referenced as ASQC/AIAG, 1991). Repeatability and reproducibility (R & R) methods are discussed in Grant and Leavenworth (1980), Pyzdek (1989) and Montgomery (1991); a more detailed discussion of the subject (of variance estimation) is also provided in Duncan (1974).

Step-by-step instructions on how to conduct and analyze R & R experiments are presented in the Measurement Systems Analysis Reference Manual published by ASQC/AIAG (1990). In the following topics, we will briefly introduce the purpose and logic of each of these procedures. For more information on analyzing designs with random effects and for estimating components of variance, see the Variance Components chapter.

Sampling Plans

• General Purpose

• Computational Approach

• Means for H0 and H1

• Alpha and Beta Error Probabilities

• Fixed Sampling Plans

• Sequential Sampling Plans

• Summary

General Purpose

A common question that quality control engineers face is to determine how many items from a batch (e.g., shipment from a supplier) to inspect in order to ensure that the items (products) in that batch are of acceptable quality. For example, suppose we have a supplier of piston rings for small automotive engines that our company produces, and our goal is to establish a sampling procedure (of piston rings from the delivered batches) that ensures a specified quality. In principle, this problem is similar to that of on-line quality control discussed in Quality Control. In fact, you may want to read that section at this point to familiarize yourself with the issues involved in industrial statistical quality control.

Acceptance sampling. The procedures described here are useful whenever we need to decide whether or not a batch or lot of items complies with specifications, without having to inspect 100% of the items in the batch. Because of the nature of the problem -- whether or not to accept a batch -- these methods are also sometimes discussed under the heading of acceptance sampling.

Advantages over 100% inspection. An obvious advantage of acceptance sampling over 100% inspection of the batch or lot is that reviewing only a sample requires less time, effort, and money. In some cases, inspection of an item is destructive (e.g., stress testing of steel), and testing 100% would destroy the entire batch. Finally, from a managerial standpoint, rejecting an entire batch or shipment (based on acceptance sampling) from a supplier, rather than just a certain percent of defective items (based on 100% inspection) often provides a stronger incentive to the supplier to adhere to quality standards.

Computational Approach

In principle, the computational approach to the question of how large a sample to take is straightforward. Elementary Concepts discusses the concept of the sampling distribution. Briefly, if we were to take repeated samples of a particular size from a population of, for example, piston rings and compute their average diameters, then the distribution of those averages (means) would approach the normal distribution with a particular mean and standard deviation (or standard error; in sampling distributions the term standard error is preferred, in order to distinguish the variability of the means from the variability of the items in the population). Fortunately, we do not need to take repeated samples from the population in order to estimate the location (mean) and variability (standard error) of the sampling distribution. If we have a good idea (estimate) of what the variability (standard deviation or sigma) is in the population, then we can infer the sampling distribution of the mean. In principle, this information is sufficient to estimate the sample size that is needed in order to detect a certain change in quality (from target specifications). Without going into the details about the computational procedures involved, let us next review the particular information that the engineer must supply in order to estimate required sample sizes.

Means for H0 and H1

To formalize the inspection process of, for example, a shipment of piston rings, we can formulate two alternative hypotheses: First, we may hypothesize that the average piston ring diameters comply with specifications. This hypothesis is called the null hypothesis (H0). The second and alternative hypothesis (H1) is that the diameters of the piston rings delivered to us deviate from specifications by more than a certain amount. Note that we may specify these types of hypotheses not just for measurable variables such as diameters of piston rings, but also for attributes. For example, we may hypothesize (H1) that the number of defective parts in the batch exceeds a certain percentage. Intuitively, it should be clear that the larger the difference between H0 and H1, the smaller the sample necessary to detect this difference (see Elementary Concepts).

Alpha and Beta Error Probabilities

To return to the piston rings example, there are two types of mistakes that we can make when inspecting a batch of piston rings that has just arrived at our plant. First, we may erroneously reject H0, that is, reject the batch because we erroneously conclude that the piston ring diameters deviate from target specifications. The probability of committing this mistake is usually called the alpha error probability. The second mistake that we can make is to erroneously not reject H0 (accept the shipment of piston rings), when, in fact, the mean piston ring diameter deviates from the target specification by a certain amount. The probability of committing this mistake is usually called the beta error probability. Intuitively, the more certain we want to be, that is, the lower we set the alpha and beta error probabilities, the larger the sample will have to be; in fact, in order to be 100% certain, we would have to measure every single piston ring delivered to our company.

Fixed Sampling Plans

To construct a simple sampling plan, we would first decide on a sample size, based on the means under H0/H1 and the particular alpha and beta error probabilities. Then, we would take a single sample of this fixed size and, based on the mean in this sample, decide whether to accept or reject the batch. This procedure is referred to as a fixed sampling plan.

Operating characteristic (OC) curve. The power of the fixed sampling plan can be summarized via the operating characteristic curve. In that plot, the probability of rejecting H0 (and accepting H1) is plotted on the Y axis, as a function of an actual shift from the target (nominal) specification to the respective values shown on the X axis of the plot (see example below). This probability is, of course, one minus the beta error probability of erroneously rejecting H1 and accepting H0; this value is referred to as the power of the fixed sampling plan to detect deviations. Also indicated in this plot are the power functions for smaller sample sizes.

Sequential Sampling Plans

As an alternative to the fixed sampling plan, we could randomly choose individual piston rings and record their deviations from specification. As we continue to measure each piston ring, we could keep a running total of the sum of deviations from specification. Intuitively, if H1 is true, that is, if the average piston ring diameter in the batch is not on target, then we would expect to observe a slowly increasing or decreasing cumulative sum of deviations, depending on whether the average diameter in the batch is larger or smaller than the specification, respectively. It turns out that this kind of sequential sampling of individual items from the batch is a more sensitive procedure than taking a fixed sample. In practice, we continue sampling until we either accept or reject the batch.

Using a sequential sampling plan. Typically, we would produce a graph in which the cumulative deviations from specification (plotted on the Y-axis) are shown for successively sampled items (e.g., piston rings, plotted on the X-axis). Then two sets of lines are drawn in this graph to denote the "corridor" along which we will continue to draw samples, that is, as long as the cumulative sum of deviations from specifications stays within this corridor, we continue sampling.

If the cumulative sum of deviations steps outside the corridor we stop sampling. If the cumulative sum moves above the upper line or below the lowest line, we reject the batch. If the cumulative sum steps out of the corridor to the inside, that is, if it moves closer to the center line, we accept the batch (since this indicates zero deviation from specification). Note that the inside area starts only at a certain sample number; this indicates the minimum number of samples necessary to accept the batch (with the current error probability).

Summary

To summarize, the idea of (acceptance) sampling is to use statistical "inference" to accept or reject an entire batch of items, based on the inspection of only relatively few items from that batch. The advantage of applying statistical reasoning to this decision is that we can be explicit about the probabilities of making a wrong decision.

Whenever possible, sequential sampling plans are preferable to fixed sampling plans because they are more powerful. In most cases, relative to the fixed sampling plan, using sequential plans requires fewer items to be inspected in order to arrive at a decision with the same degree of certainty.

To index

Process (Machine) Capability Analysis

• Introductory Overview

• Computational Approach

• Process Capability Indices

• Process Performance vs. Process Capability

• Using Experiments to Improve Process Capability

• Testing the Normality Assumption

• Tolerance Limits

Introductory Overview

See also, Non-Normal Distributions.

Quality Control describes numerous methods for monitoring the quality of a production process. However, once a process is under control the question arises, "to what extent does the long-term performance of the process comply with engineering requirements or managerial goals?" For example, to return to our piston ring example, how many of the piston rings that we are using fall within the design specification limits? In more general terms, the question is, "how capable is our process (or supplier) in terms of producing items within the specification limits?" Most of the procedures and indices described here were only recently introduced to the US by Ford Motor Company (Kane, 1986). They allow us to summarize the process capability in terms of meaningful percentages and indices.

In this topic, the computation and interpretation of process capability indices will first be discussed for the normal distribution case. If the distribution of the quality characteristic of interest does not follow the normal distribution, modified capability indices can be computed based on the percentiles of a fitted non-normal distribution.

Order of business. Note that it makes little sense to examine the process capability if the process is not in control. If the means of successively taken samples fluctuate widely, or are clearly off the target specification, then those quality problems should be addressed first. Therefore, the first step towards a high-quality process is to bring the process under control, using the charting techniques available in Quality Control.

Computational Approach

Once a process is in control, we can ask the question concerning the process capability. Again, the approach to answering this question is based on "statistical" reasoning, and is actually quite similar to that presented earlier in the context of sampling plans. To return to the piston ring example, given a sample of a particular size, we can estimate the standard deviation of the process, that is, the resultant ring diameters. We can then draw a histogram of the distribution of the piston ring diameters. As we discussed earlier, if the distribution of the diameters is normal, then we can make inferences concerning the proportion of piston rings within specification limits.

(For non-normal distributions, see Percentile Method. Let us now review some of the major indices that are commonly used to describe process capability.

Capability Analysis - Process Capability Indices

Process range. First, it is customary to establish the ± 3 sigma limits around the nominal specifications. Actually, the sigma limits should be the same as the ones used to bring the process under control using Shewhart control charts (see Quality Control). These limits denote the range of the process (i.e., process range). If we use the ± 3 sigma limits then, based on the normal distribution, we can estimate that approximately 99% of all piston rings fall within these limits.

Specification limits LSL, USL. Usually, engineering requirements dictate a range of acceptable values. In our example, it may have been determined that acceptable values for the piston ring diameters would be 74.0 ± .02 millimeters. Thus, the lower specification limit (LSL) for our process is 74.0 - 0.02 = 73.98; the upper specification limit (USL) is 74.0 + 0.02 = 74.02. The difference between USL and LSL is called the specification range.

Potential capability (Cp). This is the simplest and most straightforward indicator of process capability. It is defined as the ratio of the specification range to the process range; using ± 3 sigma limits we can express this index as:

Cp = (USL-LSL)/(6*Sigma)

Put into words, this ratio expresses the proportion of the range of the normal curve that falls within the engineering specification limits (provided that the mean is on target, that is, that the process is centered, see below).

Bhote (1988) reports that prior to the widespread use of statistical quality control techniques (prior to 1980), the normal quality of US manufacturing processes was approximately Cp = .67. This means that the two 33/2 percent tail areas of the normal curve fall outside specification limits. As of 1988, only about 30% of US processes are at or below this level of quality (see Bhote, 1988, p. 51). Ideally, of course, we would like this index to be greater than 1, that is, we would like to achieve a process capability so that no (or almost no) items fall outside specification limits. Interestingly, in the early 1980's the Japanese manufacturing industry adopted as their standard Cp = 1.33! The process capability required to manufacture high-tech products is usually even higher than this; Minolta has established a Cp index of 2.0 as their minimum standard (Bhote, 1988, p. 53), and as the standard for its suppliers. Note that high process capability usually implies lower, not higher costs, taking into account the costs due to poor quality. We will return to this point shortly.

Capability ratio (Cr). This index is equivalent to Cp; specifically, it is computed as 1/Cp (the inverse of Cp).

Lower/upper potential capability: Cpl, Cpu. A major shortcoming of the Cp (and Cr) index is that it may yield erroneous information if the process is not on target, that is, if it is not centered. We can express non-centering via the following quantities. First, upper and lower potential capability indices can be computed to reflect the deviation of the observed process mean from the LSL and USL.. Assuming ± 3 sigma limits as the process range, we compute:

Cpl = (Mean - LSL)/3*Sigma

and

Cpu = (USL - Mean)/3*Sigma

Obviously, if these values are not identical to each other, then the process is not centered.

Non-centering correction (K). We can correct Cp for the effects of non-centering. Specifically, we can compute:

K=abs(D - Mean)/(1/2*(USL - LSL))

where

D = (USL+LSL)/2.

This correction factor expresses the non-centering (target specification minus mean) relative to the specification range.

Demonstrated excellence (Cpk). Finally, we can adjust Cp for the effect of non-centering by computing:

Cpk = (1-k)*Cp

If the process is perfectly centered, then k is equal to zero, and Cpk is equal to Cp. However, as the process drifts from the target specification, k increases and Cpk becomes smaller than Cp.

Potential Capability II: Cpm. A recent modification (Chan, Cheng, & Spiring, 1988) to Cp is directed at adjusting the estimate of sigma for the effect of (random) non-centering. Specifically, we may compute the alternative sigma (Sigma2) as:

Sigma2 = { (xi - TS)2/(n-1)}½

where:

Sigma2 is the alternative estimate of sigma

xi is the value of the i'th observation in the sample

TS is the target or nominal specification

n is the number of observations in the sample

We may then use this alternative estimate of sigma to compute Cp as before; however, we will refer to the resultant index as Cpm.

Process Performance vs. Process Capability

When monitoring a process via a quality control chart (e.g., the X-bar and R-chart; Quality Control) it is often useful to compute the capability indices for the process. Specifically, when the data set consists of multiple samples, such as data collected for the quality control chart, then one can compute two different indices of variability in the data. One is the regular standard deviation for all observations, ignoring the fact that the data consist of multiple samples; the other is to estimate the process's inherent variation from the within-sample variability. For example, when plotting X-bar and R-charts one may use the common estimator R-bar/d2 for the process sigma (e.g., see Duncan, 1974; Montgomery, 1985, 1991). Note however, that this estimator is only valid if the process is statistically stable. For a detailed discussion of the difference between the total process variation and the inherent variation refer to ASQC/AIAG reference manual (ASQC/AIAG, 1991, page 80).

When the total process variability is used in the standard capability computations, the resulting indices are usually referred to as process performance indices (as they describe the actual performance of the process), while indices computed from the inherent variation (within- sample sigma) are referred to as capability indices (since they describe the inherent capability of the process).

Using Experiments to Improve Process Capability

As mentioned before, the higher the Cp index, the better the process -- and there is virtually no upper limit to this relationship. The issue of quality costs, that is, the losses due to poor quality, is discussed in detail in the context of Taguchi robust design methods (see Experimental Design). In general, higher quality usually results in lower costs overall; even though the costs of production may increase, the losses due to poor quality, for example, due to customer complaints, loss of market share, etc. are usually much greater. In practice, two or three well-designed experiments carried out over a few weeks can often achieve a Cp of 5 or higher. If you are not familiar with the use of designed experiments, but are concerned with the quality of a process, we strongly recommend that you review the methods detailed in Experimental Design.

Testing the Normality Assumption

The indices we have just reviewed are only meaningful if, in fact, the quality characteristic that is being measured is normally distributed. A specific test of the normality assumption (Kolmogorov-Smirnov and Chi-square test of goodness-of-fit) is available; these tests are described in most statistics textbooks, and they are also discussed in greater detail in Nonparametrics and Distribution Fitting.

A visual check for normality is to examine the probability-probability and quantile- quantile plots for the normal distribution. For more information, see Process Analysis and Non-Normal Distributions.

Tolerance Limits

Before the introduction of process capability indices in the early 1980's, the common method for estimating the characteristics of a production process was to estimate and examine the tolerance limits of the process (see, for example, Hald, 1952). The logic of this procedure is as follows. Let us assume that the respective quality characteristic is normally distributed in the population of items produced; we can then estimate the lower and upper interval limits that will ensure with a certain level of confidence (probability) that a certain percent of the population is included in those limits. Put another way, given:

1. a specific sample size (n),

2. the process mean,

3. the process standard deviation (sigma),

4. a confidence level, and

5. the percent of the population that we want to be included in the interval,

we can compute the corresponding tolerance limits that will satisfy all these parameters. You can also compute parameter-free tolerance limits that are not based on the assumption of normality (Scheffe & Tukey, 1944, p. 217; Wilks, 1946, p. 93; see also Duncan, 1974, or Montgomery, 1985, 1991).

See also, Non-Normal Distributions.

To index

Gage Repeatability and Reproducibility

• Introductory Overview

• Computational Approach

• Plots of Repeatability and Reproducibility

• Components of Variance

• Summary

Introductory Overview

Gage repeatability and reproducibility analysis addresses the issue of precision of measurement. The purpose of repeatability and reproducibility experiments is to determine the proportion of measurement variability that is due to (1) the items or parts being measured (part-to-part variation), (2) the operator or appraiser of the gages (reproducibility), and (3) errors (unreliabilities) in the measurements over several trials by the same operators of the same parts (repeatability). In the ideal case, all variability in measurements will be due to the part-to- part variation, and only a negligible proportion of the variability will be due to operator reproducibility and trial-to-trial repeatability.

To return to the piston ring example , if we require detection of deviations from target specifications of the magnitude of .01 millimeters, then we obviously need to use gages of sufficient precision. The procedures described here allow the engineer to evaluate the precision of gages and different operators (users) of those gages, relative to the variability of the items in the population.

You can compute the standard indices of repeatability, reproducibility, and part-to-part variation, based either on ranges (as is still common in these types of experiments) or from the analysis of variance (ANOVA) table (as, for example, recommended in ASQC/AIAG, 1990, page 65). The ANOVA table will also contain an F test (statistical significance test) for the operator-by-part interaction, and report the estimated variances, standard deviations, and confidence intervals for the components of the ANOVA model.

Finally, you can compute the respective percentages of total variation, and report so-called percent-of-tolerance statistics. These measures are briefly discussed in the following sections of this introduction. Additional information can be found in Duncan (1974), Montgomery (1991), or the DataMyte Handbook (1992); step-by-step instructions and examples are also presented in the ASQC/AIAG Measurement systems analysis reference manual (1990) and the ASQC/AIAG Fundamental statistical process control reference manual (1991).

Note that there are several other statistical procedures which may be used to analyze these types of designs; see the section on Methods for Analysis of Variance for details. In particular the methods discussed in the Variance Components and Mixed Model ANOVA/ANCOVA chapter are very efficient for analyzing very large nested designs (e.g., with more than 200 levels overall), or hierarchically nested designs (with or without random factors).

Computational Approach

One may think of each measurement as consisting of the following components:

1. a component due to the characteristics of the part or item being measured,

2. a component due to the reliability of the gage, and

3. a component due to the characteristics of the operator (user) of the gage.

The method of measurement (measurement system) is reproducible if different users of the gage come up with identical or very similar measurements. A measurement method is repeatable if repeated measurements of the same part produces identical results. Both of these characteristics -- repeatability and reproducibility -- will affect the precision of the measurement system.

We can design an experiment to estimate the magnitudes of each component, that is, the repeatability, reproducibility, and the variability between parts, and thus assess the precision of the measurement system. In essence, this procedure amounts to an analysis of variance (ANOVA) on an experimental design which includes as factors different parts, operators, and repeated measurements (trials). We can then estimate the corresponding variance components (the term was first used by Daniels, 1939) to assess the repeatability (variance due to differences across trials), reproducibility (variance due to differences across operators), and variability between parts (variance due to differences across parts). If you are not familiar with the general idea of ANOVA, you may want to refer to ANOVA/MANOVA. In fact, the extensive features provided there can also be used to analyze repeatability and reproducibility studies.

Plots of Repeatability and Reproducibility

There are several ways to summarize via graphs the findings from a repeatability and reproducibility experiment. For example, suppose we are manufacturing small kilns that are used for drying materials for other industrial production processes. The kilns should operate at a target temperature of around 100 degrees Celsius. In this study, 5 different engineers (operators) measured the same sample of 8 kilns (parts), three times each (three trials). We can plot the mean ratings of the 8 parts by operator. If the measurement system is reproducible, then the pattern of means across parts should be quite consistent across the 5 engineers who participated in the study.

R and S charts. Quality Control discusses in detail the idea of R (range) and S (sigma) plots for controlling process variability. We can apply those ideas here and produce a plot of ranges (or sigmas) by operators or by parts; these plots will allow us to identify outliers among operators or parts. If one operator produced particularly wide ranges of measurements, we may want to find out why that particular person had problems producing reliable measurements (e.g., perhaps he or she failed to understand the instructions for using the measurement gage).

Analogously, producing an R chart by parts may allow us to identify parts that are particularly difficult to measure reliably; again, inspecting that particular part may give us some insights into the weaknesses in our measurement system.

Repeatability and reproducibility summary plot. The summary plot shows the individual measurements by each operator; specifically, the measurements are shown in terms of deviations from the respective average rating for the respective part. Each trial is represented by a point, and the different measurement trials for each operator for each part are connected by a vertical line. Boxes drawn around the measurements give us a general idea of a particular operator's bias (see graph below).

Components of Variance (see also the Variance Components chapter)

Percent of Process Variation and Tolerance. The Percent Tolerance allows you to evaluate the performance of the measurement system with regard to the overall process variation, and the respective tolerance range. You can specify the tolerance range (Total tolerance for parts) and the Number of sigma intervals. The latter value is used in the computations to define the range (spread) of the respective (repeatability, reproducibility, part-to- part, etc.) variability. Specifically, the default value (5.15) defines 5.15 times the respective sigma estimate as the respective range of values; if the data are normally distributed, then this range defines 99% of the space under the normal curve, that is, the range that will include 99% of all values (or reproducibility/repeatability errors) due to the respective source of variation.

Percent of process variation. This value reports the variability due to different sources relative to the total variability (range) in the measurements.

Analysis of Variance. Rather than computing variance components estimates based on ranges, an accurate method for computing these estimates is based on the ANOVA mean squares (see Duncan, 1974, ASQC/AIAG, 1990 ).

One may treat the three factors in the R & R experiment (Operator, Parts, Trials) as random factors in a three-way ANOVA model (see also General ANOVA/MANOVA). For details concerning the different models that are typically considered, refer to ASQC/AIAG (1990, pages 92-95), or to Duncan (1974, pages 716-734). Customarily, it is assumed that all interaction effects by the trial factor are non-significant. This assumption seems reasonable, since, for example, it is difficult to imagine how the measurement of some parts will be systematically different in successive trials, in particular when parts and trials are randomized.

However, the Operator by Parts interaction may be important. For example, it is conceivable that certain less experienced operators will be more prone to particular biases, and hence will arrive at systematically different measurements for particular parts. If so, then one would expect a significant two-way interaction (again, refer to General ANOVA/MANOVA if you are not familiar with ANOVA terminology).

In the case when the two-way interaction is statistically significant, then one can separately estimate the variance components due to operator variability, and due to the operator by parts variability

In the case of significant interactions, the combined repeatability and reproducibility variability is defined as the sum of three components: repeatability (gage error), operator variability, and the operator-by-part variability.

If the Operator by Part interaction is not statistically significant a simpler additive model can be used without interactions.

Summary

To summarize, the purpose of the repeatability and reproducibility procedures is to allow the quality control engineer to assess the precision of the measurement system (gages) used in the quality control process. Obviously, if the measurement system is not repeatable (large variability across trials) or reproducible (large variability across operators) relative to the variability between parts, then the measurement system is not sufficiently precise to be used in the quality control efforts. For example, it should not be used in charts produced via Quality Control, or product capability analyses and acceptance sampling procedures via Process Analysis.

To index

Non-Normal Distributions

• Introductory Overview

• Fitting Distributions by Moments

• Assessing the Fit: Quantile and Probability Plots

• Non-Normal Process Capability Indices (Percentile Method)

Introductory Overview

General Purpose. The concept of process capability is described in detail in the Process Capability Overview. To reiterate, when judging the quality of a (e.g., production) process it is useful to estimate the proportion of items produced that fall outside a predefined acceptable specification range. For example, the so-called Cp index is computed as:

Cp - (USL-LSL)/(6*sigma)

where sigma is the estimated process standard deviation, and USL and LSL are the upper and lower specification limits, respectively. If the distribution of the respective quality characteristic or variable (e.g., size of piston rings) is normal, and the process is perfectly centered (i.e., the mean is equal to the design center), then this index can be interpreted as the proportion of the range of the standard normal curve (the process width) that falls within the engineering specification limits. If the process is not centered, an adjusted index Cpk is used instead.

Non-Normal Distributions. You can fit non-normal distributions to the observed histogram, and compute capability indices based on the respective fitted non-normal distribution (via the percentile method). In addition, instead of computing capability indices by fitting specific distributions, you can compute capability indices based on two different general families of distributions -- the Johnson distributions (Johnson, 1965; see also Hahn and Shapiro, 1967) and Pearson distributions (Johnson, Nixon, Amos, and Pearson, 1963; Gruska, Mirkhani, and Lamberson, 1989; Pearson and Hartley, 1972) -- which allow the user to approximate a wide variety of continuous distributions. For all distributions, the user can also compute the table of expected frequencies, the expected number of observations beyond specifications, and quantile-quantile and probability-probability plots. The specific method for computing process capability indices from these distributions is described in Clements (1989).

Quantile-quantile plots and probability-probability plots. There are various methods for assessing the quality of respective fit to the observed data. In addition to the table of observed and expected frequencies for different intervals, and the Kolmogorov-Smirnov and Chi-square goodness-of-fit tests, you can compute quantile and probability plots for all distributions. These scatterplots are constructed so that if the observed values follow the respective distribution, then the points will form a straight line in the plot. These plots are described further below.

Fitting Distributions by Moments

In addition to the specific continuous distributions described above, you can fit general "families" of distributions -- the so-called Johnson and Pearson curves -- with the goal to match the first four moments of the observed distribution.

General approach. The shapes of most continuous distributions can be sufficiently summarized in the first four moments. Put another way, if one fits to a histogram of observed data a distribution that has the same mean (first moment), variance (second moment), skewness (third moment) and kurtosis (fourth moment) as the observed data, then one can usually approximate the overall shape of the distribution very well. Once a distribution has been fitted, one can then calculate the expected percentile values under the (standardized) fitted curve, and estimate the proportion of items produced by the process that fall within the specification limits.

Johnson curves. Johnson (1949) described a system of frequency curves that represents transformations of the standard normal curve (see Hahn and Shapiro, 1967, for details). By applying these transformations to a standard normal variable, a wide variety of non- normal distributions can be approximated, including distributions which are bounded on either one or both sides (e.g., U-shaped distributions). The advantage of this approach is that once a particular Johnson curve has been fit, the normal integral can be used to compute the expected percentage points under the respective curve. Methods for fitting Johnson curves, so as to approximate the first four moments of an empirical distribution, are described in detail in Hahn and Shapiro, 1967, pages 199-220; and Hill, Hill, and Holder, 1976.

Pearson curves. Another system of distributions was proposed by Karl Pearson (e.g., see Hahn and Shapiro, 1967, pages 220-224). The system consists of seven solutions (of 12 originally enumerated by Pearson) to a differential equation which also approximate a wide range of distributions of different shapes. Gruska, Mirkhani, and Lamberson (1989) describe in detail how the different Pearson curves can be fit to an empirical distribution. A method for computing specific Pearson percentiles is also described in Davis and Stephens (1983).

Assessing the Fit: Quantile and Probability Plots

For each distribution, you can compute the table of expected and observed frequencies and the respective Chi-square goodness-of-fit test, as well as the Kolmogorov-Smirnov d test. However, the best way to assess the quality of the fit of a theoretical distribution to an observed distribution is to review the plot of the observed distribution against the theoretical fitted distribution. There are two standard types of plots used for this purpose: Quantile- quantile plots and probability-probability plots.

Quantile-quantile plots. In quantile-quantile plots (or Q-Q plots for short), the observed values of a variable are plotted against the theoretical quantiles. To produce a Q-Q plot, you first sort the n observed data points into ascending order, so that:

x1 x2 ... xn

These observed values are plotted against one axis of the graph; on the other axis the plot will show:

F-1 ((i-radj)/(n+nadj))

where i is the rank of the respective observation, radj and nadj are adjustment factors ( 0.5) and F-1 denotes the inverse of the probability integral for the respective standardized distribution. The resulting plot (see example below) is a scatterplot of the observed values against the (standardized) expected values, given the respective distribution. Note that, in addition to the inverse probability integral value, you can also show the respective cumulative probability values on the opposite axis, that is, the plot will show not only the standardized values for the theoretical distribution, but also the respective p-values.

A good fit of the theoretical distribution to the observed values would be indicated by this plot if the plotted values fall onto a straight line. Note that the adjustment factors radj and nadj ensure that the p-value for the inverse probability integral will fall between 0 and 1, but not including 0 and 1 (see Chambers, Cleveland, Kleiner, and Tukey, 1983).

Probability-probability plots. In probability-probability plots (or P-P plots for short) the observed cumulative distribution function is plotted against the theoretical cumulative distribution function. As in the Q-Q plot, the values of the respective variable are first sorted into ascending order. The i'th observation is plotted against one axis as i/n (i.e., the observed cumulative distribution function), and against the other axis as F(x(i)), where F(x(i)) stands for the value of the theoretical cumulative distribution function for the respective observation x(i). If the theoretical cumulative distribution approximates the observed distribution well, then all points in this plot should fall onto the diagonal line (as in the graph below).

Non-Normal Process Capability Indices (Percentile Method)

As described earlier, process capability indices are generally computed to evaluate the quality of a process, that is, to estimate the relative range of the items manufactured by the process (process width) with regard to the engineering specifications. For the standard, normal- distribution-based, process capability indices, the process width is typically defined as 6 times sigma, that is, as plus/minus 3 times the estimated process standard deviation. For the standard normal curve, these limits (zl = -3 and zu = +3) translate into the 0.135 percentile and 99.865 percentile, respectively. In the non-normal case, the 3 times sigma limits as well as the mean (zM = 0.0) can be replaced by the corresponding standard values, given the same percentiles, under the non- normal curve. This procedure is described in detail by Clements (1989).

Process capability indices. Shown below are the formulas for the non-normal process capability indices:

Cp = (USL-LSL)/(Up-Lp)

CpL = (M-LSL)/(M-Lp)

CpU = (USL-M)/(Up-M)

Cpk = Min(CpU, CpL)

In these equations, M represents the 50'th percentile value for the respective fitted distribution, and Up and Lp are the 99.865 and .135 percentile values, respectively, if the computations are based on a process width of ±3 times sigma. Note that the values for Up and Lp may be different, if the process width is defined by different sigma limits (e.g., ±2 times sigma).

To index

Weibull and Reliability/Failure Time Analysis

• General Purpose

• The Weibull Distribution

• Censored Observations

• Two- and three-parameter Weibull Distribution

• Parameter Estimation

• Goodness of Fit Indices

• Interpreting Results

• Grouped Data

• Modified Failure Order for Multiple-Censored Data

• Weibull CDF, reliability, and hazard functions

A key aspect of product quality is product reliability. A number of specialized techniques have been developed to quantify reliability and to estimate the "life expectancy" of a product. Standard references and textbooks describing these techniques include Lawless (1982), Nelson (1990), Lee (1980, 1992), and Dodson (1994); the relevant functions of the Weibull distribution (hazard, CDF, reliability) are also described in the Weibull CDF, reliability, and hazard functions section. Note that very similar statistical procedures are used in the analysis of survival data (see also the description of Survival Analysis), and, for example, the descriptions in Lee's book (Lee, 1992) are primarily addressed to biomedical research applications. An excellent overview with many examples of engineering applications is provided by Dodson (1994).

General Purpose

The reliability of a product or component constitutes an important aspect of product quality. Of particular interest is the quantification of a product's reliability, so that one can derive estimates of the product's expected useful life. For example, suppose you are flying a small single engine aircraft. It would be very useful (in fact vital) information to know what the probability of engine failure is at different stages of the engine's "life" (e.g., after 500 hours of operation, 1000 hours of operation, etc.). Given a good estimate of the engine's reliability, and the confidence limits of this estimate, one can then make a rational decision about when to swap or overhaul the engine.

The Weibull Distribution

A useful general distribution for describing failure time data is the Weibull distribution (see also Weibull CDF, reliability, and hazard functions). The distribution is named after the Swedish professor Waloddi Weibull, who demonstrated the appropriateness of this distribution for modeling a wide variety of different data sets (see also Hahn and Shapiro, 1967; for example, the Weibull distribution has been used to model the life times of electronic components, relays, ball bearings, or even some businesses).

Hazard function and the bathtub curve. It is often meaningful to consider the function that describes the probability of failure during a very small time increment (assuming that no failures have occurred prior to that time). This function is called the hazard function (or, sometimes, also the conditional failure, intensity, or force of mortality function), and is generally defined as:

h(t) = f(t)/(1-F(t))

where h(t) stands for the hazard function (of time t), and f(t) and F(t) are the probability density and cumulative distribution functions, respectively. The hazard (conditional failure) function for most machines (components, devices) can best be described in terms of the "bathtub" curve: Very early during the life of a machine, the rate of failure is relatively high (so-called Infant Mortality Failures); after all components settle, and the electronic parts are burned in, the failure rate is relatively constant and low. Then, after some time of operation, the failure rate again begins to increase (so-called Wear-out Failures), until all components or devices will have failed.

For example, new automobiles often suffer several small failures right after they were purchased. Once these have been "ironed out," a (hopefully) long relatively trouble-free period of operation will follow. Then, as the car reaches a particular age, it becomes more prone to breakdowns, until finally, after 20 years and 250000 miles, practically all cars will have failed. A typical bathtub hazard function is shown below.

The Weibull distribution is flexible enough for modeling the key stages of this typical bathtub-shaped hazard function. Shown below are the hazard functions for shape parameters c=.5, c=1, c=2, and c=5.

Clearly, the early ("infant mortality") "phase" of the bathtub can be approximated by a Weibull hazard function with shape parameter c<1; the constant hazard phase of the bathtub can be modeled with a shape parameter c=1, and the final ("wear-out") stage of the bathtub with c>1.

Cumulative distribution and reliability functions. Once a Weibull distribution (with a particular set of parameters) has been fit to the data, a number of additional important indices and measures can be estimated. For example, you can compute the cumulative distribution function (commonly denoted as F(t)) for the fitted distribution, along with the standard errors for this function. Thus, you can determine the percentiles of the cumulative survival (and failure) distribution, and, for example, predict the time at which a predetermined percentage of components can be expected to have failed.

The reliability function (commonly denoted as R(t)) is the complement to the cumulative distribution function (i.e., R(t)=1-F(t)); the reliability function is also sometimes referred to as the survivorship or survival function (since it describes the probability of not failing or of surviving until a certain time t; e.g., see Lee, 1992). Shown below is the reliability function for the Weibull distribution, for different shape parameters.

For shape parameters less than 1, the reliability decreases sharply very early in the respective product's life, and then slowly thereafter. For shape parameters greater than 1, the initial drop in reliability is small, and then the reliability drops relatively sharply at some point later in time. The point where all curves intersect is called the characteristic life: regardless of the shape parameter, 63.2 percent of the population will have failed at or before this point (i.e., R(t) = 1-0.632 = .368). This point in time is also equal to the respective scale parameter b of the two-parameter Weibull distribution (with = 0; otherwise it is equal to b+ ).

The formulas for the Weibull cumulative distribution, reliability, and hazard functions are shown in the Weibull CDF, reliability, and hazard functions section.

Censored Observations

In most studies of product reliability, not all items in the study will fail. In other words, by the end of the study the researcher only knows that a certain number of items have not failed for a particular amount of time, but has no knowledge of the exact failure times (i.e., "when the items would have failed"). Those types of data are called censored observations. The issue of censoring, and several methods for analyzing censored data sets, are also described in great detail in the context of Survival Analysis. Censoring can occur in many different ways.

Type I and II censoring. So-called Type I censoring describes the situation when a test is terminated at a particular point in time, so that the remaining items are only known not to have failed up to that time (e.g., we start with 100 light bulbs, and terminate the experiment after a certain amount of time). In this case, the censoring time is often fixed, and the number of items failing is a random variable. In Type II censoring the experiment would be continued until a fixed proportion of items have failed (e.g., we stop the experiment after exactly 50 light bulbs have failed). In this case, the number of items failing is fixed, and time is the random variable.

Left and right censoring. An additional distinction can be made to reflect the "side" of the time dimension at which censoring occurs. In the examples described above, the censoring always occurred on the right side (right censoring), because the researcher knows when exactly the experiment started, and the censoring always occurs on the right side of the time continuum. Alternatively, it is conceivable that the censoring occurs on the left side (left censoring). For example, in biomedical research one may know that a patient entered the hospital at a particular date, and that s/he survived for a certain amount of time thereafter; however, the researcher does not know when exactly the symptoms of the disease first occurred or were diagnosed.

Single and multiple censoring. Finally, there are situations in which censoring can occur at different times (multiple censoring), or only at a particular point in time (single censoring). To return to the light bulb example, if the experiment is terminated at a particular point in time, then a single point of censoring exists, and the data set is said to be single-censored. However, in biomedical research multiple censoring often exists, for example, when patients are discharged from a hospital after different amounts (times) of treatment, and the researcher knows that the patient survived up to those (differential) points of censoring.

The methods described in this section are applicable primarily to right censoring, and single- as well as multiple-censored data.

Two- and three-parameter Weibull distribution

The Weibull distribution is bounded on the left side. If you look at the probability density function, you can see that that the term x- must be greater than 0. In most cases, the location parameter (theta) is known (usually 0): it identifies the smallest possible failure time. However, sometimes the probability of failure of an item is 0 (zero) for some time after a study begins, and in that case it may be necessary to estimate a location parameter that is greater than 0. There are several methods for estimating the location parameter of the three-parameter Weibull distribution. To identify situations when the location parameter is greater than 0, Dodson (1994) recommends to look for downward of upward sloping tails on a probability plot (see below), as well as large (>6) values for the shape parameter after fitting the two-parameter Weibull distribution, which may indicate a non-zero location parameter.

Parameter Estimation

Maximum likelihood estimation. Standard iterative function minimization methods can be used to compute maximum likelihood parameter estimates for the two- and three parameter Weibull distribution. The specific methods for estimating the parameters are described in Dodson (1994); a detailed description of a Newton-Raphson iterative method for estimating the maximum likelihood parameters for the two-parameter distribution is provided in Keats and Lawrence (1997).

The estimation of the location parameter for the three-parameter Weibull distribution poses a number of special problems, which are detailed in Lawless (1982). Specifically, when the shape parameter is less than 1, then a maximum likelihood solution does not exist for the parameters. In other instances, the likelihood function may contain more than one maximum (i.e., multiple local maxima). In the latter case, Lawless basically recommends using the smallest failure time (or a value that is a little bit less) as the estimate of the location parameter.

Nonparametric (rank-based) probability plots. One can derive a descriptive estimate of the cumulative distribution function (regardless of distribution) by first rank-ordering the observations, and then computing any of the following expressions:

Median rank:

F(t) = (j-0.3)/(n+0.4)

Mean rank:

F(t) = j/(n+1)

White's plotting position:

F(t) = (j-3/8)/(n+1/4)

where j denotes the failure order (rank; for multiple-censored data a weighted average ordered failure is computed; see Dodson, p. 21), and n is the total number of observations. One can then construct the following plot.

Note that the horizontal Time axis is scaled logarithmically; on the vertical axis the quantity log(log(100/(100-F(t))) is plotted (a probability scale is shown on the left-y axis). From this plot the parameters of the two-parameter Weibull distribution can be estimated; specifically, the shape parameter is equal to the slope of the linear fit-line, and the scale parameter can be estimated as exp(-intercept/slope).

Estimating the location parameter from probability plots. It is apparent in the plot shown above that the regression line provides a good fit to the data. When the location parameter is misspecified (e.g., not equal to zero), then the linear fit is worse as compared to the case when it is appropriately specified. Therefore, one can compute the probability plot for several values of the location parameter, and observe the quality of the fit. These computations are summarized in the following plot.

Here the common R-square measure (correlation squared) is used to express the quality of the linear fit in the probability plot, for different values of the location parameter shown on the horizontal x axis (this plot is based on the example data set in Dodson, 1994, Table 2.9). This plot is often very useful when the maximum likelihood estimation procedure for the three-parameter Weibull distribution fails, because it shows whether or not a unique (single) optimum value for the location parameter exists (as in the plot shown above).

Hazard plotting. Another method for estimating the parameters for the two-parameter Weibull distribution is via hazard plotting (as discussed earlier, the hazard function describes the probability of failure during a very small time increment, assuming that no failures have occurred prior to that time). This method is very similar to the probability plotting method. First plot the cumulative hazard function against the logarithm of the survival times; then fit a linear regression line and compute the slope and intercept of that line. As in probability plotting, the shape parameter can then be estimated as the slope of the regression line, and the scale parameter as exp(-intercept/slope). See Dodson (1994) for additional details; see also Weibull CDF, reliability, and hazard functions.

Method of moments. This method -- to approximate the moments of the observed distribution by choosing the appropriate parameters for the Weibull distribution -- is also widely described in the literature. In fact, this general method is used for fitting the Johnson curves general non-normal distribution to the data, to compute non-normal process capability indices (see Fitting Distributions by Moments). However, the method is not suited for censored data sets, and is therefore not very useful for the analysis of failure time data.

Comparing the estimation methods. Dodson (1994) reports the result of a Monte Carlo simulation study, comparing the different methods of estimation. In general, the maximum likelihood estimates proved to be best for large sample sizes (e.g., n>15), while probability plotting and hazard plotting appeared to produce better (more accurate) estimates for smaller samples.

A note of caution regarding maximum likelihood based confidence limits. Many software programs will compute confidence intervals for maximum likelihood estimates, and for the reliability function based on the standard errors of the maximum likelihood estimates. Dodson (1994) cautions against the interpretation of confidence limits computed from maximum likelihood estimates, or more precisely, estimates that involve the information matrix for the estimated parameters. When the shape parameter is less than 2, the variance estimates computed for maximum likelihood estimates lack accuracy, and it is advisable to compute the various results graphs based on nonparametric confidence limits as well.

Goodness of Fit Indices

A number of different tests have been proposed for evaluating the quality of the fit of the Weibull distribution to the observed data. These tests are discussed and compared in detail in Lawless (1982).

Hollander-Proschan. This test compares the theoretical reliability function to the Kaplan-Meier estimate. The actual computations for this test are somewhat complex, and you may refer to Dodson (1994, Chapter 4) for a detailed description of the computational formulas. The Hollander-Proschan test is applicable to complete, single-censored, and multiple-censored data sets; however, Dodson (1994) cautions that the test may sometimes indicate a poor fit when the data are heavily single-censored. The Hollander-Proschan C statistic can be tested against the normal distribution (z).

Mann-Scheuer-Fertig. This test, proposed by Mann, Scheuer, and Fertig (1973), is described in detail in, for example, Dodson (1994) or Lawless (1982). The null hypothesis for this test is that the population follows the Weibull distribution with the estimated parameters. Nelson (1982) reports this test to have reasonably good power, and this test can be applied to Type II censored data. For computational details refer to Dodson (1994) or Lawless (1982); the critical values for the test statistic have been computed based on Monte Carlo studies, and have been tabulated for n (sample sizes) between 3 and 25.

Anderson-Darling. The Anderson-Darling procedure is a general test to compare the fit of an observed cumulative distribution function to an expected cumulative distribution function. However, this test is only applicable to complete data sets (without censored observations). The critical values for the Anderson-Darling statistic have been tabulated (see, for example, Dodson, 1994, Table 4.4) for sample sizes between 10 and 40; this test is not computed for n less than 10 and greater than 40.

Interpreting Results

Once a satisfactory fit of the Weibull distribution to the observed failure time data has been obtained, there are a number of different plots and tables that are of interest to understand the reliability of the item under investigation. If a good fit for the Weibull cannot be established, distribution-free reliability estimates (and graphs) should be reviewed to determine the shape of the reliability function.

Reliability plots. This plot will show the estimated reliability function along with the confidence limits.

Note that nonparametric (distribution-free) estimates and their standard errors can also be computed and plotted.

Hazard plots. As mentioned earlier, the hazard function describes the probability of failure during a very small time increment (assuming that no failures have occurred prior to that time). The plot of hazard as a function of time gives valuable information about the conditional failure probability.

Percentiles of the reliability function. Based on the fitted Weibull distribution, one can compute the percentiles of the reliability (survival) function, along with the confidence limits for these estimates (for maximum likelihood parameter estimates). These estimates are particularly valuable for determining the percentages of items that can be expected to have failed at particular points in time.

Grouped Data

In some cases, failure time data are presented in grouped form. Specifically, instead of having available the precise failure time for each observation, only aggregate information is available about the number of items that failed or were censored in a particular time interval. Such life-table data input is also described in the context of the Survival Analysis chapter. There are two general approaches for fitting the Weibull distribution to grouped data.

First, one can treat the tabulated data as if they were continuous. In other words, one can "expand" the tabulated values into continuous data by assuming (1) that each observation in a given time interval failed exactly at the interval mid-point (interpolating out "half a step" for the last interval), and (2) that censoring occurred after the failures in each interval (in other words, censored observations are sorted after the observed failures). Lawless (1982) advises that this method is usually satisfactory if the class intervals are relatively narrow.

Alternatively, you may treat the data explicitly as a tabulated life table, and use a weighted least squares methods algorithm (based on Gehan and Siddiqui, 1973; see also Lee, 1992) to fit the Weibull distribution (Lawless, 1982, also describes methods for computing maximum likelihood parameter estimates from grouped data).

Modified Failure Order for Multiple-Censored Data

For multiple-censored data a weighted average ordered failure is calculated for each failure after the first censored data point. These failure orders are then used to compute the median rank, to estimate the cumulative distribution function.

The modified failure order j is computed as (see Dodson 1994):

Ij = ((n+1)-Op)/(1+c)

where:

Ij is the increment for the j'th failure

n is the total number of data points

Op is the failure order of the previous observation (and Oj = Op + Ij)

c is the number of data points remaining in the data set, including the current data point

The median rank is then computed as:

F(t) = (Ij -0.3)/(n+0.4)

where Ij denotes the modified failure order, and n is the total number of observations.

Weibull CDF, Reliability, and Hazard

Density function. The Weibull distribution (Weibull, 1939, 1951; see also Lieblein, 1955) has density function (for positive parameters b, c, and ):

f(x) = c/b*[(x- )/b]c-1 * e^{-[(x- )/b]c}

< x, b > 0, c > 0

where

b is the scale parameter of the distribution

c is the shape parameter of the distribution

is the location parameter of the distribution

e is the base of the natural logarithm, sometimes called Euler's e (2.71...)

Cumulative distribution function (CDF). The Weibull distribution has the cumulative distribution function (for positive parameters b, c, and ):

F(x) = 1 - exp{-[(x- )/b]c}

using the same notation and symbols as described above for the density function.

Reliability function. The Weibull reliability function is the complement of the cumulative distribution function:

R(x) = 1 - F(x)

Hazard function. The hazard function describes the probability of failure during a very small time increment, assuming that no failures have occurred prior to that time. The Weibull distribution has the hazard function (for positive parameters b, c, and ):

h(t) = f(t)/R(t) = [c*(x- )(c-1)] / bc

using the same notation and symbols as described above for the density and reliability functions.

Cumulative hazard function. The Weibull distribution has the cumulative hazard function (for positive parameters b, c, and ):

H(t) = (x- ) / bc

using the same notation and symbols as described above for the density and reliability functions.

To index

© Copyright StatSoft, Inc., 1984-2003

Quality Control Charts

• General Purpose

• General Approach

• Establishing Control Limits

• Common Types of Charts

• Short Run Control Charts

o Short Run Charts for Variables

o Short Run Charts for Attributes

• Unequal Sample Sizes

• Control Charts for Variables vs. Charts for Attributes

• Control Charts for Individual Observations

• Out-of-Control Process: Runs Tests

• Operating Characteristic (OC) Curves

• Process Capability Indices

• Other Specialized Control Charts

General Purpose

In all production processes, we need to monitor the extent to which our products meet specifications. In the most general terms, there are two "enemies" of product quality: (1) deviations from target specifications, and (2) excessive variability around target specifications. During the earlier stages of developing the production process, designed experiments are often used to optimize these two quality characteristics (see Experimental Design); the methods provided in Quality Control are on-line or in-process quality control procedures to monitor an on-going production process. For detailed descriptions of these charts and extensive annotated examples, see Buffa (1972), Duncan (1974) Grant and Leavenworth (1980), Juran (1962), Juran and Gryna (1970), Montgomery (1985, 1991), Shirland (1993), or Vaughn (1974). Two recent excellent introductory texts with a "how-to" approach are Hart & Hart (1989) and Pyzdek (1989); two recent German language texts on this subject are Rinne and Mittag (1995) and Mittag (1993).

To index

General Approach

The general approach to on-line quality control is straightforward: We simply extract samples of a certain size from the ongoing production process. We then produce line charts of the variability in those samples, and consider their closeness to target specifications. If a trend emerges in those lines, or if samples fall outside pre-specified limits, then we declare the process to be out of control and take action to find the cause of the problem. These types of charts are sometimes also referred to as Shewhart control charts (named after W. A. Shewhart who is generally credited as being the first to introduce these methods; see Shewhart, 1931).

Interpreting the chart. The most standard display actually contains two charts (and two histograms); one is called an X-bar chart, the other is called an R chart.

In both line charts, the horizontal axis represents the different samples; the vertical axis for the X-bar chart represents the means for the characteristic of interest; the vertical axis for the R chart represents the ranges. For example, suppose we wanted to control the diameter of piston rings that we are producing. The center line in the X-bar chart would represent the desired standard size (e.g., diameter in millimeters) of the rings, while the center line in the R chart would represent the acceptable (within-specification) range of the rings within samples; thus, this latter chart is a chart of the variability of the process (the larger the variability, the larger the range). In addition to the center line, a typical chart includes two additional horizontal lines to represent the upper and lower control limits (UCL, LCL, respectively); we will return to those lines shortly. Typically, the individual points in the chart, representing the samples, are connected by a line. If this line moves outside the upper or lower control limits or exhibits systematic patterns across consecutive samples (see Runs Tests), then a quality problem may potentially exist.

To index

Establishing Control Limits

Even though one could arbitrarily determine when to declare a process out of control (that is, outside the UCL-LCL range), it is common practice to apply statistical principles to do so. Elementary Concepts discusses the concept of the sampling distribution, and the characteristics of the normal distribution. The method for constructing the upper and lower control limits is a straightforward application of the principles described there.

Example. Suppose we want to control the mean of a variable, such as the size of piston rings. Under the assumption that the mean (and variance) of the process does not change, the successive sample means will be distributed normally around the actual mean. Moreover, without going into details regarding the derivation of this formula, we also know (because of the central limit theorem, and thus approximate normal distribution of the means; see, for example, Hoyer and Ellis, 1996) that the distribution of sample means will have a standard deviation of Sigma (the standard deviation of individual data points or measurements) over the square root of n (the sample size). It follows that approximately 95% of the sample means will fall within the limits ± 1.96 * Sigma/Square Root(n) (refer to Elementary Concepts for a discussion of the characteristics of the normal distribution and the central limit theorem). In practice, it is common to replace the 1.96 with 3 (so that the interval will include approximately 99% of the sample means), and to define the upper and lower control limits as plus and minus 3 sigma limits, respectively.

General case. The general principle for establishing control limits just described applies to all control charts. After deciding on the characteristic we want to control, for example, the standard deviation, we estimate the expected variability of the respective characteristic in samples of the size we are about to take. Those estimates are then used to establish the control limits on the chart.

To index

Common Types of Charts

The types of charts are often classified according to the type of quality characteristic that they are supposed to monitor: there are quality control charts for variables and control charts for attributes. Specifically, the following charts are commonly constructed for controlling variables:

• X-bar chart. In this chart the sample means are plotted in order to control the mean value of a variable (e.g., size of piston rings, strength of materials, etc.).

• R chart. In this chart, the sample ranges are plotted in order to control the variability of a variable.

• S chart. In this chart, the sample standard deviations are plotted in order to control the variability of a variable.

• S**2 chart. In this chart, the sample variances are plotted in order to control the variability of a variable.

For controlling quality characteristics that represent attributes of the product, the following charts are commonly constructed:

• C chart. In this chart (see example below), we plot the number of defectives (per batch, per day, per machine, per 100 feet of pipe, etc.). This chart assumes that defects of the quality attribute are rare, and the control limits in this chart are computed based on the Poisson distribution (distribution of rare events).

• U chart. In this chart we plot the rate of defectives, that is, the number of defectives divided by the number of units inspected (the n; e.g., feet of pipe, number of batches). Unlike the C chart, this chart does not require a constant number of units, and it can be used, for example, when the batches (samples) are of different sizes.

• Np chart. In this chart, we plot the number of defectives (per batch, per day, per machine) as in the C chart. However, the control limits in this chart are not based on the distribution of rare events, but rather on the binomial distribution. Therefore, this chart should be used if the occurrence of defectives is not rare (e.g., they occur in more than 5% of the units inspected). For example, we may use this chart to control the number of units produced with minor flaws.

• P chart. In this chart, we plot the percent of defectives (per batch, per day, per machine, etc.) as in the U chart. However, the control limits in this chart are not based on the distribution of rare events but rather on the binomial distribution (of proportions). Therefore, this chart is most applicable to situations where the occurrence of defectives is not rare (e.g., we expect the percent of defectives to be more than 5% of the total number of units produced).

All of these charts can be adapted for short production runs (short run charts), and for multiple process streams.

To index

Short Run Charts

The short run control chart, or control chart for short production runs, plots observations of variables or attributes for multiple parts on the same chart. Short run control charts were developed to address the requirement that several dozen measurements of a process must be collected before control limits are calculated. Meeting this requirement is often difficult for operations that produce a limited number of a particular part during a production run.

For example, a paper mill may produce only three or four (huge) rolls of a particular kind of paper (i.e., part) and then shift production to another kind of paper. But if variables, such as paper thickness, or attributes, such as blemishes, are monitored for several dozen rolls of paper of, say, a dozen different kinds, control limits for thickness and blemishes could be calculated for the transformed (within the short production run) variable values of interest. Specifically, these transformations will rescale the variable values of interest such that they are of compatible magnitudes across the different short production runs (or parts). The control limits computed for those transformed values could then be applied in monitoring thickness, and blemishes, regardless of the types of paper (parts) being produced. Statistical process control procedures could be used to determine if the production process is in control, to monitor continuing production, and to establish procedures for continuous quality improvement.

For additional discussions of short run charts refer to Bothe (1988), Johnson (1987), or Montgomery (1991).

Short Run Charts for Variables

Nominal chart, target chart. There are several different types of short run charts. The most basic are the nominal short run chart, and the target short run chart. In these charts, the measurements for each part are transformed by subtracting a part-specific constant. These constants can either be the nominal values for the respective parts (nominal short run chart), or they can be target values computed from the (historical) means for each part (Target X-bar and R chart). For example, the diameters of piston bores for different engine blocks produced in a factory can only be meaningfully compared (for determining the consistency of bore sizes) if the mean differences between bore diameters for different sized engines are first removed. The nominal or target short run chart makes such comparisons possible. Note that for the nominal or target chart it is assumed that the variability across parts is identical, so that control limits based on a common estimate of the process sigma are applicable.

Standardized short run chart. If the variability of the process for different parts cannot be assumed to be identical, then a further transformation is necessary before the sample means for different parts can be plotted in the same chart. Specifically, in the standardized short run chart the plot points are further transformed by dividing the deviations of sample means from part means (or nominal or target values for parts) by part-specific constants that are proportional to the variability for the respective parts. For example, for the short run X-bar and R chart, the plot points (that are shown in the X-bar chart) are computed by first subtracting from each sample mean a part specific constant (e.g., the respective part mean, or nominal value for the respective part), and then dividing the difference by another constant, for example, by the average range for the respective chart. These transformations will result in comparable scales for the sample means for different parts.

Short Run Charts for Attributes

For attribute control charts (C, U, Np, or P charts), the estimate of the variability of the process (proportion, rate, etc.) is a function of the process average (average proportion, rate, etc.; for example, the standard deviation of a proportion p is equal to the square root of p*(1- p)/n). Hence, only standardized short run charts are available for attributes. For example, in the short run P chart, the plot points are computed by first subtracting from the respective sample p values the average part p's, and then dividing by the standard deviation of the average p's.

To index

Unequal Sample Sizes

When the samples plotted in the control chart are not of equal size, then the control limits around the center line (target specification) cannot be represented by a straight line. For example, to return to the formula Sigma/Square Root(n) presented earlier for computing control limits for the X-bar chart, it is obvious that unequal n's will lead to different control limits for different sample sizes. There are three ways of dealing with this situation.

Average sample size. If one wants to maintain the straight-line control limits (e.g., to make the chart easier to read and easier to use in presentations), then one can compute the average n per sample across all samples, and establish the control limits based on the average sample size. This procedure is not "exact," however, as long as the sample sizes are reasonably similar to each other, this procedure is quite adequate.

Variable control limits. Alternatively, one may compute different control limits for each sample, based on the respective sample sizes. This procedure will lead to variable control limits, and result in step-chart like control lines in the plot. This procedure ensures that the correct control limits are computed for each sample. However, one loses the simplicity of straight-line control limits.

Stabilized (normalized) chart. The best of two worlds (straight line control limits that are accurate) can be accomplished by standardizing the quantity to be controlled (mean, proportion, etc.) according to units of sigma. The control limits can then be expressed in straight lines, while the location of the sample points in the plot depend not only on the characteristic to be controlled, but also on the respective sample n's. The disadvantage of this procedure is that the values on the vertical (Y) axis in the control chart are in terms of sigma rather than the original units of measurement, and therefore, those numbers cannot be taken at face value (e.g., a sample with a value of 3 is 3 times sigma away from specifications; in order to express the value of this sample in terms of the original units of measurement, we need to perform some computations to convert this number back).

To index

Control Charts for Variables vs. Charts for Attributes

Sometimes, the quality control engineer has a choice between variable control charts and attribute control charts.

Advantages of attribute control charts. Attribute control charts have the advantage of allowing for quick summaries of various aspects of the quality of a product, that is, the engineer may simply classify products as acceptable or unacceptable, based on various quality criteria. Thus, attribute charts sometimes bypass the need for expensive, precise devices and time-consuming measurement procedures. Also, this type of chart tends to be more easily understood by managers unfamiliar with quality control procedures; therefore, it may provide more persuasive (to management) evidence of quality problems.

Advantages of variable control charts. Variable control charts are more sensitive than attribute control charts (see Montgomery, 1985, p. 203). Therefore, variable control charts may alert us to quality problems before any actual "unacceptables" (as detected by the attribute chart) will occur. Montgomery (1985) calls the variable control charts leading indicators of trouble that will sound an alarm before the number of rejects (scrap) increases in the production process.

Control Chart for Individual Observations

Variable control charts can by constructed for individual observations taken from the production line, rather than samples of observations. This is sometimes necessary when testing samples of multiple observations would be too expensive, inconvenient, or impossible. For example, the number of customer complaints or product returns may only be available on a monthly basis; yet, one would like to chart those numbers to detect quality problems. Another common application of these charts occurs in cases when automated testing devices inspect every single unit that is produced. In that case, one is often primarily interested in detecting small shifts in the product quality (for example, gradual deterioration of quality due to machine wear). The CUSUM, MA, and EWMA charts of cumulative sums and weighted averages discussed below may be most applicable in those situations.

To index

Out-Of-Control Process: Runs Tests

As mentioned earlier in the introduction, when a sample point (e.g., mean in an X-bar chart) falls outside the control lines, one has reason to believe that the process may no longer be in control. In addition, one should look for systematic patterns of points (e.g., means) across samples, because such patterns may indicate that the process average has shifted. These tests are also sometimes referred to as AT&T runs rules (see AT&T, 1959) or tests for special causes (e.g., see Nelson, 1984, 1985; Grant and Leavenworth, 1980; Shirland, 1993). The term special or assignable causes as opposed to chance or common causes was used by Shewhart to distinguish between a process that is in control, with variation due to random (chance) causes only, from a process that is out of control, with variation that is due to some non-chance or special (assignable) factors (cf. Montgomery, 1991, p. 102).

As the sigma control limits discussed earlier, the runs rules are based on "statistical" reasoning. For example, the probability of any sample mean in an X-bar control chart falling above the center line is equal to 0.5, provided (1) that the process is in control (i.e., that the center line value is equal to the population mean), (2) that consecutive sample means are independent (i.e., not auto-correlated), and (3) that the distribution of means follows the normal distribution. Simply stated, under those conditions there is a 50-50 chance that a mean will fall above or below the center line. Thus, the probability that two consecutive means will fall above the center line is equal to 0.5 times 0.5 = 0.25.

Accordingly, the probability that 9 consecutive samples (or a run of 9 samples) will fall on the same side of the center line is equal to 0.5**9 = .00195. Note that this is approximately the probability with which a sample mean can be expected to fall outside the 3- times sigma limits (given the normal distribution, and a process in control). Therefore, one could look for 9 consecutive sample means on the same side of the center line as another indication of an out-of-control condition. Refer to Duncan (1974) for details concerning the "statistical" interpretation of the other (more complex) tests.

Zone A, B, C. Customarily, to define the runs tests, the area above and below the chart center line is divided into three "zones."

By default, Zone A is defined as the area between 2 and 3 times sigma above and below the center line; Zone B is defined as the area between 1 and 2 times sigma, and Zone C is defined as the area between the center line and 1 times sigma.

9 points in Zone C or beyond (on one side of central line). If this test is positive (i.e., if this pattern is detected), then the process average has probably changed. Note that it is assumed that the distribution of the respective quality characteristic in the plot is symmetrical around the mean. This is, for example, not the case for R charts, S charts, or most attribute charts. However, this is still a useful test to alert the quality control engineer to potential shifts in the process. For example, successive samples with less-than-average variability may be worth investigating, since they may provide hints on how to decrease the variation in the process.

6 points in a row steadily increasing or decreasing. This test signals a drift in the process average. Often, such drift can be the result of tool wear, deteriorating maintenance, improvement in skill, etc. (Nelson, 1985).

14 points in a row alternating up and down. If this test is positive, it indicates that two systematically alternating causes are producing different results. For example, one may be using two alternating suppliers, or monitor the quality for two different (alternating) shifts.

2 out of 3 points in a row in Zone A or beyond. This test provides an "early warning" of a process shift. Note that the probability of a false-positive (test is positive but process is in control) for this test in X-bar charts is approximately 2%.

4 out of 5 points in a row in Zone B or beyond. Like the previous test, this test may be considered to be an "early warning indicator" of a potential process shift. The false- positive error rate for this test is also about 2%.

15 points in a row in Zone C (above and below the center line). This test indicates a smaller variability than is expected (based on the current control limits).

8 points in a row in Zone B, A, or beyond, on either side of the center line (without points in Zone C). This test indicates that different samples are affected by different factors, resulting in a bimodal distribution of means. This may happen, for example, if different samples in an X-bar chart where produced by one of two different machines, where one produces above average parts, and the other below average parts.

To index

Operating Characteristic (OC) Curves

A common supplementary plot to standard quality control charts is the so-called operating characteristic or OC curve (see example below). One question that comes to mind when using standard variable or attribute charts is how sensitive is the current quality control procedure? Put in more specific terms, how likely is it that you will not find a sample (e.g., mean in an X-bar chart) outside the control limits (i.e., accept the production process as "in control"), when, in fact, it has shifted by a certain amount? This probability is usually referred to as the (beta) error probability, that is, the probability of erroneously accepting a process (mean, mean proportion, mean rate defectives, etc.) as being "in control." Note that operating characteristic curves pertain to the false-acceptance probability using the sample-outside-of- control-limits criterion only, and not the runs tests described earlier.

Operating characteristic curves are extremely useful for exploring the power of our quality control procedure. The actual decision concerning sample sizes should depend not only on the cost of implementing the plan (e.g., cost per item sampled), but also on the costs resulting from not detecting quality problems. The OC curve allows the engineer to estimate the probabilities of not detecting shifts of certain sizes in the production quality.

Process Capability Indices

For variable control charts, it is often desired to include so-called process capability indices in the summary graph. In short, process capability indices express (as a ratio) the proportion of parts or items produced by the current process that fall within user-specified limits (e.g., engineering tolerances).

For example, the so-called Cp index is computed as:

Cp = (USL-LSL)/(6*sigma)

where sigma is the estimated process standard deviation, and USL and LSL are the upper and lower specification (engineering) limits, respectively. If the distribution of the respective quality characteristic or variable (e.g., size of piston rings) is normal, and the process is perfectly centered (i.e., the mean is equal to the design center), then this index can be interpreted as the proportion of the range of the standard normal curve (the process width) that falls within the engineering specification limits. If the process is not centered, an adjusted index Cpk is used instead. For a "capable" process, the Cp index should be greater than 1, that is, the specification limits would be larger than 6 times the sigma limits, so that over 99% of all items or parts produced could be expected to fall inside the acceptable engineering specifications. For a detailed discussion of this and other indices, refer to Process Analysis.

To index

Other Specialized Control Charts

The types of control charts mentioned so far are the "workhorses" of quality control, and they are probably the most widely used methods. However, with the advent of inexpensive desktop computing, procedures requiring more computational effort have become increasingly popular.

X-bar Charts For Non-Normal Data. The control limits for standard X-bar charts are constructed based on the assumption that the sample means are approximately normally distributed. Thus, the underlying individual observations do not have to be normally distributed, since, as the sample size increases, the distribution of the means will become approximately normal (i.e., see discussion of the central limit theorem in the Elementary Concepts; however, note that for R, S¸ and S**2 charts, it is assumed that the individual observations are normally distributed). Shewhart (1931) in his original work experimented with various non-normal distributions for individual observations, and evaluated the resulting distributions of means for samples of size four. He concluded that, indeed, the standard normal distribution-based control limits for the means are appropriate, as long as the underlying distribution of observations are approximately normal. (See also Hoyer and Ellis, 1996, for an introduction and discussion of the distributional assumptions for quality control charting.)

However, as Ryan (1989) points out, when the distribution of observations is highly skewed and the sample sizes are small, then the resulting standard control limits may produce a large number of false alarms (increased alpha error rate), as well as a larger number of false negative ("process-is-in-control") readings (increased beta-error rate). You can compute control limits (as well as process capability indices) for X-bar charts based on so-called Johnson curves(Johnson, 1949), which allow to approximate the skewness and kurtosis for a large range of non-normal distributions (see also Fitting Distributions by Moments, in Process Analysis). These non- normal X-bar charts are useful when the distribution of means across the samples is clearly skewed, or otherwise non-normal.

Hotelling T**2 Chart. When there are multiple related quality characteristics (recorded in several variables), we can produce a simultaneous plot (see example below) for all means based on Hotelling multivariate T**2 statistic (first proposed by Hotelling, 1947).

Cumulative Sum (CUSUM) Chart. The CUSUM chart was first introduced by Page (1954); the mathematical principles involved in its construction are discussed in Ewan (1963), Johnson (1961), and Johnson and Leone (1962).

If one plots the cumulative sum of deviations of successive sample means from a target specification, even minor, permanent shifts in the process mean will eventually lead to a sizable cumulative sum of deviations. Thus, this chart is particularly well-suited for detecting such small permanent shifts that may go undetected when using the X-bar chart. For example, if, due to machine wear, a process slowly "slides" out of control to produce results above target specifications, this plot would show a steadily increasing (or decreasing) cumulative sum of deviations from specification.

To establish control limits in such plots, Barnhard (1959) proposed the so-called V- mask, which is plotted after the last sample (on the right). The V-mask can be thought of as the upper and lower control limits for the cumulative sums. However, rather than being parallel to the center line; these lines converge at a particular angle to the right, producing the appearance of a V rotated on its side. If the line representing the cumulative sum crosses either one of the two lines, the process is out of control.

Moving Average (MA) Chart. To return to the piston ring example, suppose we are mostly interested in detecting small trends across successive sample means. For example, we may be particularly concerned about machine wear, leading to a slow but constant deterioration of quality (i.e., deviation from specification). The CUSUM chart described above is one way to monitor such trends, and to detect small permanent shifts in the process average. Another way is to use some weighting scheme that summarizes the means of several successive samples; moving such a weighted mean across the samples will produce a moving average chart (as shown in the following graph).

Exponentially-weighted Moving Average (EWMA) Chart. The idea of moving averages of successive (adjacent) samples can be generalized. In principle, in order to detect a trend we need to weight successive samples to form a moving average; however, instead of a simple arithmetic moving average, we could compute a geometric moving average (this chart (see graph below) is also called Geometric Moving Average chart, see Montgomery, 1985, 1991).

Specifically, we could compute each data point for the plot as:

zt = *x-bart + (1- )*zt-1

In this formula, each point zt is computed as (lambda) times the respective mean x-bart, plus one minus times the previous (computed) point in the plot. The parameter (lambda) here should assume values greater than 0 and less than 1. Without going into detail (see Montgomery, 1985, p. 239), this method of averaging specifies that the weight of historically "old" sample means decreases geometrically as one continues to draw samples. The interpretation of this chart is much like that of the moving average chart, and it allows us to detect small shifts in the means, and, therefore, in the quality of the production process.

Regression Control Charts. Sometimes we want to monitor the relationship between two aspects of our production process. For example, a post office may want to monitor the number of worker-hours that are spent to process a certain amount of mail. These two variables should roughly be linearly correlated with each other, and the relationship can probably be described in terms of the well-known Pearson product-moment correlation coefficient r. This statistic is also described in Basic Statistics. The regression control chart contains a regression line that summarizes the linear relationship between the two variables of interest. The individual data points are also shown in the same graph. Around the regression line we establish a confidence interval within which we would expect a certain proportion (e.g., 95%) of samples to fall. Outliers in this plot may indicate samples where, for some reason, the common relationship between the two variables of interest does not hold.

Applications. There are many useful applications for the regression control chart. For example, professional auditors may use this chart to identify retail outlets with a greater than expected number of cash transactions given the overall volume of sales, or grocery stores with a greater than expected number of coupons redeemed, given the total sales. In both instances, outliers in the regression control charts (e.g., too many cash transactions; too many coupons redeemed) may deserve closer scrutiny.

Pareto Chart Analysis. Quality problems are rarely spread evenly across the different aspects of the production process or different plants. Rather, a few "bad apples" often account for the majority of problems. This principle has come to be known as the Pareto principle, which basically states that quality losses are mal-distributed in such a way that a small percentage of possible causes are responsible for the majority of the quality problems. For example, a relatively small number of "dirty" cars are probably responsible for the majority of air pollution; the majority of losses in most companies result from the failure of only one or two products. To illustrate the "bad apples", one plots the Pareto chart,

which simply amounts to a histogram showing the distribution of the quality loss (e.g., dollar loss) across some meaningful categories; usually, the categories are sorted into descending order of importance (frequency, dollar amounts, etc.). Very often, this chart provides useful guidance as to where to direct quality improvement efforts.

To index

© Copyright StatSoft, Inc., 1984-2003

Distribution Tables

Compared to probability calculators (e.g., the one included in STATISTICA), the traditional format of distribution tables such as those presented below, has the advantage of showing many values simultaneously and, thus, enables the user to examine and quickly explore ranges of probabilities.

• Z Table

• t Table

• Chi-Square Table

• F Tables for:

o alpha=.10

o alpha=.05

o alpha=.025

o alpha=.01

Note that all table values were calculated using the distribution facilities in STATISTICA BASIC, and they were verified against other published tables.

Standard Normal (Z) Table

The Standard Normal distribution is used in various hypothesis tests including tests on single means, the difference between two means, and tests on proportions. The Standard Normal distribution has a mean of 0 and a standard deviation of 1. The animation above shows various (left) tail areas for this distribution. For more information on the Normal Distribution as it is used in statistical testing, see the chapter on Elementary Concepts. See also, the Normal Distribution.

As shown in the illustration below, the values inside the given table represent the areas under the standard normal curve for values between 0 and the relative z-score. For example, to determine the area under the curve between 0 and 2.36, look in the intersecting cell for the row labeled 2.30 and the column labeled 0.06. The area under the curve is .4909. To determine the area between 0 and a negative value, look in the intersecting cell of the row and column which sums to the absolute value of the number in question. For example, the area under the curve between -1.3 and 0 is equal to the area under the curve between 1.3 and 0, so look at the cell on the 1.3 row and the 0.00 column (the area is 0.4032).

Area between 0 and z

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359

0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753

0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141

0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517

0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879

0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224

0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549

0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852

0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133

0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389

1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621

1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830

1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015

1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177

1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319

1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441

1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545

1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633

1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706

1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767

2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817

2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857

2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890

2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916

2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936

2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952

2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964

2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974

2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981

2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986

3.0 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990

To index

Student's t Table

The Shape of the Student's t distribution is determined by the degrees of freedom. As shown in the animation above, its shape changes as the degrees of freedom increases. For more information on how this distribution is used in hypothesis testing, see t-test for independent samples and t-test for dependent samples in the chapter on Basic Statistics and Tables. See also, Student's t Distribution. As indicated by the chart below, the areas given at the top of this table are the right tail areas for the t-value inside the table. To determine the 0.05 critical value from the t-distribution with 6 degrees of freedom, look in the 0.05 column at the 6 row: t(.05,6) = 1.943180.

t table with right tail probabilities

df\p 0.40 0.25 0.10 0.05 0.025 0.01 0.005 0.0005

1 0.324920 1.000000 3.077684 6.313752 12.70620 31.82052 63.65674 636.6192

2 0.288675 0.816497 1.885618 2.919986 4.30265 6.96456 9.92484 31.5991

3 0.276671 0.764892 1.637744 2.353363 3.18245 4.54070 5.84091 12.9240

4 0.270722 0.740697 1.533206 2.131847 2.77645 3.74695 4.60409 8.6103

5 0.267181 0.726687 1.475884 2.015048 2.57058 3.36493 4.03214 6.8688

6 0.264835 0.717558 1.439756 1.943180 2.44691 3.14267 3.70743 5.9588

7 0.263167 0.711142 1.414924 1.894579 2.36462 2.99795 3.49948 5.4079

8 0.261921 0.706387 1.396815 1.859548 2.30600 2.89646 3.35539 5.0413

9 0.260955 0.702722 1.383029 1.833113 2.26216 2.82144 3.24984 4.7809

10 0.260185 0.699812 1.372184 1.812461 2.22814 2.76377 3.16927 4.5869

11 0.259556 0.697445 1.363430 1.795885 2.20099 2.71808 3.10581 4.4370

12 0.259033 0.695483 1.356217 1.782288 2.17881 2.68100 3.05454 4.3178

13 0.258591 0.693829 1.350171 1.770933 2.16037 2.65031 3.01228 4.2208

14 0.258213 0.692417 1.345030 1.761310 2.14479 2.62449 2.97684 4.1405

15 0.257885 0.691197 1.340606 1.753050 2.13145 2.60248 2.94671 4.0728

16 0.257599 0.690132 1.336757 1.745884 2.11991 2.58349 2.92078 4.0150

17 0.257347 0.689195 1.333379 1.739607 2.10982 2.56693 2.89823 3.9651

18 0.257123 0.688364 1.330391 1.734064 2.10092 2.55238 2.87844 3.9216

19 0.256923 0.687621 1.327728 1.729133 2.09302 2.53948 2.86093 3.8834

20 0.256743 0.686954 1.325341 1.724718 2.08596 2.52798 2.84534 3.8495

21 0.256580 0.686352 1.323188 1.720743 2.07961 2.51765 2.83136 3.8193

22 0.256432 0.685805 1.321237 1.717144 2.07387 2.50832 2.81876 3.7921

23 0.256297 0.685306 1.319460 1.713872 2.06866 2.49987 2.80734 3.7676

24 0.256173 0.684850 1.317836 1.710882 2.06390 2.49216 2.79694 3.7454

25 0.256060 0.684430 1.316345 1.708141 2.05954 2.48511 2.78744 3.7251

26 0.255955 0.684043 1.314972 1.705618 2.05553 2.47863 2.77871 3.7066

27 0.255858 0.683685 1.313703 1.703288 2.05183 2.47266 2.77068 3.6896

28 0.255768 0.683353 1.312527 1.701131 2.04841 2.46714 2.76326 3.6739

29 0.255684 0.683044 1.311434 1.699127 2.04523 2.46202 2.75639 3.6594

30 0.255605 0.682756 1.310415 1.697261 2.04227 2.45726 2.75000 3.6460

inf 0.253347 0.674490 1.281552 1.644854 1.95996 2.32635 2.57583 3.2905

To index

Chi-Square Table

Like the Student's t-Distribution, the Chi-square distribtuion's shape is determined by its degrees of freedom. The animation above shows the shape of the Chi-square distribution as the degrees of freedom increase (1, 2, 5, 10, 25 and 50). For examples of tests of hypothesis which use the Chi-square distribution, see Statistics in crosstabulation tables in the Basic Statistics and Tables chapter as well as the Nonlinear Estimation chapter. See also, Chi-square Distribution. As shown in the illustration below, the values inside this table are critical values of the Chi-square distribution with the corresponding degrees of freedom. To determine the value from a Chi-square distribution (with a specific degree of freedom) which has a given area above it, go to the given area column and the desired degree of freedom row. For example, the .25 critical value for a Chi-square with 4 degrees of freedom is 5.38527. This means that the area to the right of 5.38527 in a Chi-square distribution with 4 degrees of freedom is .25.

Right tail areas for the Chi-square Distribution

df\area .995 .990 .975 .950 .900 .750 .500 .250 .100 .050 .025 .010 .005

1 0.00004 0.00016 0.00098 0.00393 0.01579 0.10153 0.45494 1.32330 2.70554 3.84146 5.02389 6.63490 7.87944

2 0.01003 0.02010 0.05064 0.10259 0.21072 0.57536 1.38629 2.77259 4.60517 5.99146 7.37776 9.21034 10.59663

3 0.07172 0.11483 0.21580 0.35185 0.58437 1.21253 2.36597 4.10834 6.25139 7.81473 9.34840 11.34487 12.83816

4 0.20699 0.29711 0.48442 0.71072 1.06362 1.92256 3.35669 5.38527 7.77944 9.48773 11.14329 13.27670 14.86026

5 0.41174 0.55430 0.83121 1.14548 1.61031 2.67460 4.35146 6.62568 9.23636 11.07050 12.83250 15.08627 16.74960

6 0.67573 0.87209 1.23734 1.63538 2.20413 3.45460 5.34812 7.84080 10.64464 12.59159 14.44938 16.81189 18.54758

7 0.98926 1.23904 1.68987 2.16735 2.83311 4.25485 6.34581 9.03715 12.01704 14.06714 16.01276 18.47531 20.27774

8 1.34441 1.64650 2.17973 2.73264 3.48954 5.07064 7.34412 10.21885 13.36157 15.50731 17.53455 20.09024 21.95495

9 1.73493 2.08790 2.70039 3.32511 4.16816 5.89883 8.34283 11.38875 14.68366 16.91898 19.02277 21.66599 23.58935

10 2.15586 2.55821 3.24697 3.94030 4.86518 6.73720 9.34182 12.54886 15.98718 18.30704 20.48318 23.20925 25.18818

11 2.60322 3.05348 3.81575 4.57481 5.57778 7.58414 10.34100 13.70069 17.27501 19.67514 21.92005 24.72497 26.75685

12 3.07382 3.57057 4.40379 5.22603 6.30380 8.43842 11.34032 14.84540 18.54935 21.02607 23.33666 26.21697 28.29952

13 3.56503 4.10692 5.00875 5.89186 7.04150 9.29907 12.33976 15.98391 19.81193 22.36203 24.73560 27.68825 29.81947

14 4.07467 4.66043 5.62873 6.57063 7.78953 10.16531 13.33927 17.11693 21.06414 23.68479 26.11895 29.14124 31.31935

15 4.60092 5.22935 6.26214 7.26094 8.54676 11.03654 14.33886 18.24509 22.30713 24.99579 27.48839 30.57791 32.80132

16 5.14221 5.81221 6.90766 7.96165 9.31224 11.91222 15.33850 19.36886 23.54183 26.29623 28.84535 31.99993 34.26719

17 5.69722 6.40776 7.56419 8.67176 10.08519 12.79193 16.33818 20.48868 24.76904 27.58711 30.19101 33.40866 35.71847

18 6.26480 7.01491 8.23075 9.39046 10.86494 13.67529 17.33790 21.60489 25.98942 28.86930 31.52638 34.80531 37.15645

19 6.84397 7.63273 8.90652 10.11701 11.65091 14.56200 18.33765 22.71781 27.20357 30.14353 32.85233 36.19087 38.58226

20 7.43384 8.26040 9.59078 10.85081 12.44261 15.45177 19.33743 23.82769 28.41198 31.41043 34.16961 37.56623 39.99685

21 8.03365 8.89720 10.28290 11.59131 13.23960 16.34438 20.33723 24.93478 29.61509 32.67057 35.47888 38.93217 41.40106

22 8.64272 9.54249 10.98232 12.33801 14.04149 17.23962 21.33704 26.03927 30.81328 33.92444 36.78071 40.28936 42.79565

23 9.26042 10.19572 11.68855 13.09051 14.84796 18.13730 22.33688 27.14134 32.00690 35.17246 38.07563 41.63840 44.18128

24 9.88623 10.85636 12.40115 13.84843 15.65868 19.03725 23.33673 28.24115 33.19624 36.41503 39.36408 42.97982 45.55851

25 10.51965 11.52398 13.11972 14.61141 16.47341 19.93934 24.33659 29.33885 34.38159 37.65248 40.64647 44.31410 46.92789

26 11.16024 12.19815 13.84390 15.37916 17.29188 20.84343 25.33646 30.43457 35.56317 38.88514 41.92317 45.64168 48.28988

27 11.80759 12.87850 14.57338 16.15140 18.11390 21.74940 26.33634 31.52841 36.74122 40.11327 43.19451 46.96294 49.64492

28 12.46134 13.56471 15.30786 16.92788 18.93924 22.65716 27.33623 32.62049 37.91592 41.33714 44.46079 48.27824 50.99338

29 13.12115 14.25645 16.04707 17.70837 19.76774 23.56659 28.33613 33.71091 39.08747 42.55697 45.72229 49.58788 52.33562

30 13.78672 14.95346 16.79077 18.49266 20.59923 24.47761 29.33603 34.79974 40.25602 43.77297 46.97924 50.89218 53.67196

To index

F Distribution Tables

The F distribution is a right-skewed distribution used most commonly in Analysis of Variance (see ANOVA/MANOVA). The F distribution is a ratio of two Chi-square distributions, and a specific F distribution is denoted by the degrees of freedom for the numerator Chi-square and the degrees of freedom for the denominator Chi-square. An example of the F(10,10) distribution is shown in the animation above. When referencing the F distribution, the numerator degrees of freedom are always given first, as switching the order of degrees of freedom changes the distribution (e.g., F(10,12) does not equal F(12,10)). For the four F tables below, the rows represent denominator degrees of freedom and the columns represent numerator degrees of freedom. The right tail area is given in the name of the table. For example, to determine the .05 critical value for an F distribution with 10 and 12 degrees of freedom, look in the 10 column (numerator) and 12 row (denominator) of the F Table for alpha=.05. F(.05, 10, 12) = 2.7534.

F Table for alpha=.10 .

df2/df1 1 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60 120 INF

1 39.86346 49.50000 53.59324 55.83296 57.24008 58.20442 58.90595 59.43898 59.85759 60.19498 60.70521 61.22034 61.74029 62.00205 62.26497 62.52905 62.79428 63.06064 63.32812

2 8.52632 9.00000 9.16179 9.24342 9.29263 9.32553 9.34908 9.36677 9.38054 9.39157 9.40813 9.42471 9.44131 9.44962 9.45793 9.46624 9.47456 9.48289 9.49122

3 5.53832 5.46238 5.39077 5.34264 5.30916 5.28473 5.26619 5.25167 5.24000 5.23041 5.21562 5.20031 5.18448 5.17636 5.16811 5.15972 5.15119 5.14251 5.13370

4 4.54477 4.32456 4.19086 4.10725 4.05058 4.00975 3.97897 3.95494 3.93567 3.91988 3.89553 3.87036 3.84434 3.83099 3.81742 3.80361 3.78957 3.77527 3.76073

5 4.06042 3.77972 3.61948 3.52020 3.45298 3.40451 3.36790 3.33928 3.31628 3.29740 3.26824 3.23801 3.20665 3.19052 3.17408 3.15732 3.14023 3.12279 3.10500

6 3.77595 3.46330 3.28876 3.18076 3.10751 3.05455 3.01446 2.98304 2.95774 2.93693 2.90472 2.87122 2.83634 2.81834 2.79996 2.78117 2.76195 2.74229 2.72216

7 3.58943 3.25744 3.07407 2.96053 2.88334 2.82739 2.78493 2.75158 2.72468 2.70251 2.66811 2.63223 2.59473 2.57533 2.55546 2.53510 2.51422 2.49279 2.47079

8 3.45792 3.11312 2.92380 2.80643 2.72645 2.66833 2.62413 2.58935 2.56124 2.53804 2.50196 2.46422 2.42464 2.40410 2.38302 2.36136 2.33910 2.31618 2.29257

9 3.36030 3.00645 2.81286 2.69268 2.61061 2.55086 2.50531 2.46941 2.44034 2.41632 2.37888 2.33962 2.29832 2.27683 2.25472 2.23196 2.20849 2.18427 2.15923

10 3.28502 2.92447 2.72767 2.60534 2.52164 2.46058 2.41397 2.37715 2.34731 2.32260 2.28405 2.24351 2.20074 2.17843 2.15543 2.13169 2.10716 2.08176 2.05542

11 3.22520 2.85951 2.66023 2.53619 2.45118 2.38907 2.34157 2.30400 2.27350 2.24823 2.20873 2.16709 2.12305 2.10001 2.07621 2.05161 2.02612 1.99965 1.97211

12 3.17655 2.80680 2.60552 2.48010 2.39402 2.33102 2.28278 2.24457 2.21352 2.18776 2.14744 2.10485 2.05968 2.03599 2.01149 1.98610 1.95973 1.93228 1.90361

13 3.13621 2.76317 2.56027 2.43371 2.34672 2.28298 2.23410 2.19535 2.16382 2.13763 2.09659 2.05316 2.00698 1.98272 1.95757 1.93147 1.90429 1.87591 1.84620

14 3.10221 2.72647 2.52222 2.39469 2.30694 2.24256 2.19313 2.15390 2.12195 2.09540 2.05371 2.00953 1.96245 1.93766 1.91193 1.88516 1.85723 1.82800 1.79728

15 3.07319 2.69517 2.48979 2.36143 2.27302 2.20808 2.15818 2.11853 2.08621 2.05932 2.01707 1.97222 1.92431 1.89904 1.87277 1.84539 1.81676 1.78672 1.75505

16 3.04811 2.66817 2.46181 2.33274 2.24376 2.17833 2.12800 2.08798 2.05533 2.02815 1.98539 1.93992 1.89127 1.86556 1.83879 1.81084 1.78156 1.75075 1.71817

17 3.02623 2.64464 2.43743 2.30775 2.21825 2.15239 2.10169 2.06134 2.02839 2.00094 1.95772 1.91169 1.86236 1.83624 1.80901 1.78053 1.75063 1.71909 1.68564

18 3.00698 2.62395 2.41601 2.28577 2.19583 2.12958 2.07854 2.03789 2.00467 1.97698 1.93334 1.88681 1.83685 1.81035 1.78269 1.75371 1.72322 1.69099 1.65671

19 2.98990 2.60561 2.39702 2.26630 2.17596 2.10936 2.05802 2.01710 1.98364 1.95573 1.91170 1.86471 1.81416 1.78731 1.75924 1.72979 1.69876 1.66587 1.63077

20 2.97465 2.58925 2.38009 2.24893 2.15823 2.09132 2.03970 1.99853 1.96485 1.93674 1.89236 1.84494 1.79384 1.76667 1.73822 1.70833 1.67678 1.64326 1.60738

21 2.96096 2.57457 2.36489 2.23334 2.14231 2.07512 2.02325 1.98186 1.94797 1.91967 1.87497 1.82715 1.77555 1.74807 1.71927 1.68896 1.65691 1.62278 1.58615

22 2.94858 2.56131 2.35117 2.21927 2.12794 2.06050 2.00840 1.96680 1.93273 1.90425 1.85925 1.81106 1.75899 1.73122 1.70208 1.67138 1.63885 1.60415 1.56678

23 2.93736 2.54929 2.33873 2.20651 2.11491 2.04723 1.99492 1.95312 1.91888 1.89025 1.84497 1.79643 1.74392 1.71588 1.68643 1.65535 1.62237 1.58711 1.54903

24 2.92712 2.53833 2.32739 2.19488 2.10303 2.03513 1.98263 1.94066 1.90625 1.87748 1.83194 1.78308 1.73015 1.70185 1.67210 1.64067 1.60726 1.57146 1.53270

25 2.91774 2.52831 2.31702 2.18424 2.09216 2.02406 1.97138 1.92925 1.89469 1.86578 1.82000 1.77083 1.71752 1.68898 1.65895 1.62718 1.59335 1.55703 1.51760

26 2.90913 2.51910 2.30749 2.17447 2.08218 2.01389 1.96104 1.91876 1.88407 1.85503 1.80902 1.75957 1.70589 1.67712 1.64682 1.61472 1.58050 1.54368 1.50360

27 2.90119 2.51061 2.29871 2.16546 2.07298 2.00452 1.95151 1.90909 1.87427 1.84511 1.79889 1.74917 1.69514 1.66616 1.63560 1.60320 1.56859 1.53129 1.49057

28 2.89385 2.50276 2.29060 2.15714 2.06447 1.99585 1.94270 1.90014 1.86520 1.83593 1.78951 1.73954 1.68519 1.65600 1.62519 1.59250 1.55753 1.51976 1.47841

29 2.88703 2.49548 2.28307 2.14941 2.05658 1.98781 1.93452 1.89184 1.85679 1.82741 1.78081 1.73060 1.67593 1.64655 1.61551 1.58253 1.54721 1.50899 1.46704

30 2.88069 2.48872 2.27607 2.14223 2.04925 1.98033 1.92692 1.88412 1.84896 1.81949 1.77270 1.72227 1.66731 1.63774 1.60648 1.57323 1.53757 1.49891 1.45636

40 2.83535 2.44037 2.22609 2.09095 1.99682 1.92688 1.87252 1.82886 1.79290 1.76269 1.71456 1.66241 1.60515 1.57411 1.54108 1.50562 1.46716 1.42476 1.37691

60 2.79107 2.39325 2.17741 2.04099 1.94571 1.87472 1.81939 1.77483 1.73802 1.70701 1.65743 1.60337 1.54349 1.51072 1.47554 1.43734 1.39520 1.34757 1.29146

120 2.74781 2.34734 2.12999 1.99230 1.89587 1.82381 1.76748 1.72196 1.68425 1.65238 1.60120 1.54500 1.48207 1.44723 1.40938 1.36760 1.32034 1.26457 1.19256

inf 2.70554 2.30259 2.08380 1.94486 1.84727 1.77411 1.71672 1.67020 1.63152 1.59872 1.54578 1.48714 1.42060 1.38318 1.34187 1.29513 1.23995 1.16860 1.00000

To index

F Table for alpha=.05 .

df2/df1 1 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60 120 INF

1 161.4476 199.5000 215.7073 224.5832 230.1619 233.9860 236.7684 238.8827 240.5433 241.8817 243.9060 245.9499 248.0131 249.0518 250.0951 251.1432 252.1957 253.2529 254.3144

2 18.5128 19.0000 19.1643 19.2468 19.2964 19.3295 19.3532 19.3710 19.3848 19.3959 19.4125 19.4291 19.4458 19.4541 19.4624 19.4707 19.4791 19.4874 19.4957

3 10.1280 9.5521 9.2766 9.1172 9.0135 8.9406 8.8867 8.8452 8.8123 8.7855 8.7446 8.7029 8.6602 8.6385 8.6166 8.5944 8.5720 8.5494 8.5264

4 7.7086 6.9443 6.5914 6.3882 6.2561 6.1631 6.0942 6.0410 5.9988 5.9644 5.9117 5.8578 5.8025 5.7744 5.7459 5.7170 5.6877 5.6581 5.6281

5 6.6079 5.7861 5.4095 5.1922 5.0503 4.9503 4.8759 4.8183 4.7725 4.7351 4.6777 4.6188 4.5581 4.5272 4.4957 4.4638 4.4314 4.3985 4.3650

6 5.9874 5.1433 4.7571 4.5337 4.3874 4.2839 4.2067 4.1468 4.0990 4.0600 3.9999 3.9381 3.8742 3.8415 3.8082 3.7743 3.7398 3.7047 3.6689

7 5.5914 4.7374 4.3468 4.1203 3.9715 3.8660 3.7870 3.7257 3.6767 3.6365 3.5747 3.5107 3.4445 3.4105 3.3758 3.3404 3.3043 3.2674 3.2298

8 5.3177 4.4590 4.0662 3.8379 3.6875 3.5806 3.5005 3.4381 3.3881 3.3472 3.2839 3.2184 3.1503 3.1152 3.0794 3.0428 3.0053 2.9669 2.9276

9 5.1174 4.2565 3.8625 3.6331 3.4817 3.3738 3.2927 3.2296 3.1789 3.1373 3.0729 3.0061 2.9365 2.9005 2.8637 2.8259 2.7872 2.7475 2.7067

10 4.9646 4.1028 3.7083 3.4780 3.3258 3.2172 3.1355 3.0717 3.0204 2.9782 2.9130 2.8450 2.7740 2.7372 2.6996 2.6609 2.6211 2.5801 2.5379

11 4.8443 3.9823 3.5874 3.3567 3.2039 3.0946 3.0123 2.9480 2.8962 2.8536 2.7876 2.7186 2.6464 2.6090 2.5705 2.5309 2.4901 2.4480 2.4045

12 4.7472 3.8853 3.4903 3.2592 3.1059 2.9961 2.9134 2.8486 2.7964 2.7534 2.6866 2.6169 2.5436 2.5055 2.4663 2.4259 2.3842 2.3410 2.2962

13 4.6672 3.8056 3.4105 3.1791 3.0254 2.9153 2.8321 2.7669 2.7144 2.6710 2.6037 2.5331 2.4589 2.4202 2.3803 2.3392 2.2966 2.2524 2.2064

14 4.6001 3.7389 3.3439 3.1122 2.9582 2.8477 2.7642 2.6987 2.6458 2.6022 2.5342 2.4630 2.3879 2.3487 2.3082 2.2664 2.2229 2.1778 2.1307

15 4.5431 3.6823 3.2874 3.0556 2.9013 2.7905 2.7066 2.6408 2.5876 2.5437 2.4753 2.4034 2.3275 2.2878 2.2468 2.2043 2.1601 2.1141 2.0658

16 4.4940 3.6337 3.2389 3.0069 2.8524 2.7413 2.6572 2.5911 2.5377 2.4935 2.4247 2.3522 2.2756 2.2354 2.1938 2.1507 2.1058 2.0589 2.0096

17 4.4513 3.5915 3.1968 2.9647 2.8100 2.6987 2.6143 2.5480 2.4943 2.4499 2.3807 2.3077 2.2304 2.1898 2.1477 2.1040 2.0584 2.0107 1.9604

18 4.4139 3.5546 3.1599 2.9277 2.7729 2.6613 2.5767 2.5102 2.4563 2.4117 2.3421 2.2686 2.1906 2.1497 2.1071 2.0629 2.0166 1.9681 1.9168

19 4.3807 3.5219 3.1274 2.8951 2.7401 2.6283 2.5435 2.4768 2.4227 2.3779 2.3080 2.2341 2.1555 2.1141 2.0712 2.0264 1.9795 1.9302 1.8780

20 4.3512 3.4928 3.0984 2.8661 2.7109 2.5990 2.5140 2.4471 2.3928 2.3479 2.2776 2.2033 2.1242 2.0825 2.0391 1.9938 1.9464 1.8963 1.8432

21 4.3248 3.4668 3.0725 2.8401 2.6848 2.5727 2.4876 2.4205 2.3660 2.3210 2.2504 2.1757 2.0960 2.0540 2.0102 1.9645 1.9165 1.8657 1.8117

22 4.3009 3.4434 3.0491 2.8167 2.6613 2.5491 2.4638 2.3965 2.3419 2.2967 2.2258 2.1508 2.0707 2.0283 1.9842 1.9380 1.8894 1.8380 1.7831

23 4.2793 3.4221 3.0280 2.7955 2.6400 2.5277 2.4422 2.3748 2.3201 2.2747 2.2036 2.1282 2.0476 2.0050 1.9605 1.9139 1.8648 1.8128 1.7570

24 4.2597 3.4028 3.0088 2.7763 2.6207 2.5082 2.4226 2.3551 2.3002 2.2547 2.1834 2.1077 2.0267 1.9838 1.9390 1.8920 1.8424 1.7896 1.7330

25 4.2417 3.3852 2.9912 2.7587 2.6030 2.4904 2.4047 2.3371 2.2821 2.2365 2.1649 2.0889 2.0075 1.9643 1.9192 1.8718 1.8217 1.7684 1.7110

26 4.2252 3.3690 2.9752 2.7426 2.5868 2.4741 2.3883 2.3205 2.2655 2.2197 2.1479 2.0716 1.9898 1.9464 1.9010 1.8533 1.8027 1.7488 1.6906

27 4.2100 3.3541 2.9604 2.7278 2.5719 2.4591 2.3732 2.3053 2.2501 2.2043 2.1323 2.0558 1.9736 1.9299 1.8842 1.8361 1.7851 1.7306 1.6717

28 4.1960 3.3404 2.9467 2.7141 2.5581 2.4453 2.3593 2.2913 2.2360 2.1900 2.1179 2.0411 1.9586 1.9147 1.8687 1.8203 1.7689 1.7138 1.6541

29 4.1830 3.3277 2.9340 2.7014 2.5454 2.4324 2.3463 2.2783 2.2229 2.1768 2.1045 2.0275 1.9446 1.9005 1.8543 1.8055 1.7537 1.6981 1.6376

30 4.1709 3.3158 2.9223 2.6896 2.5336 2.4205 2.3343 2.2662 2.2107 2.1646 2.0921 2.0148 1.9317 1.8874 1.8409 1.7918 1.7396 1.6835 1.6223

40 4.0847 3.2317 2.8387 2.6060 2.4495 2.3359 2.2490 2.1802 2.1240 2.0772 2.0035 1.9245 1.8389 1.7929 1.7444 1.6928 1.6373 1.5766 1.5089

60 4.0012 3.1504 2.7581 2.5252 2.3683 2.2541 2.1665 2.0970 2.0401 1.9926 1.9174 1.8364 1.7480 1.7001 1.6491 1.5943 1.5343 1.4673 1.3893

120 3.9201 3.0718 2.6802 2.4472 2.2899 2.1750 2.0868 2.0164 1.9588 1.9105 1.8337 1.7505 1.6587 1.6084 1.5543 1.4952 1.4290 1.3519 1.2539

inf 3.8415 2.9957 2.6049 2.3719 2.2141 2.0986 2.0096 1.9384 1.8799 1.8307 1.7522 1.6664 1.5705 1.5173 1.4591 1.3940 1.3180 1.2214 1.0000

To index

F Table for alpha=.025 .

df2/df1 1 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60 120 INF

1 647.7890 799.5000 864.1630 899.5833 921.8479 937.1111 948.2169 956.6562 963.2846 968.6274 976.7079 984.8668 993.1028 997.2492 1001.414 1005.598 1009.800 1014.020 1018.258

2 38.5063 39.0000 39.1655 39.2484 39.2982 39.3315 39.3552 39.3730 39.3869 39.3980 39.4146 39.4313 39.4479 39.4562 39.465 39.473 39.481 39.490 39.498

3 17.4434 16.0441 15.4392 15.1010 14.8848 14.7347 14.6244 14.5399 14.4731 14.4189 14.3366 14.2527 14.1674 14.1241 14.081 14.037 13.992 13.947 13.902

4 12.2179 10.6491 9.9792 9.6045 9.3645 9.1973 9.0741 8.9796 8.9047 8.8439 8.7512 8.6565 8.5599 8.5109 8.461 8.411 8.360 8.309 8.257

5 10.0070 8.4336 7.7636 7.3879 7.1464 6.9777 6.8531 6.7572 6.6811 6.6192 6.5245 6.4277 6.3286 6.2780 6.227 6.175 6.123 6.069 6.015

6 8.8131 7.2599 6.5988 6.2272 5.9876 5.8198 5.6955 5.5996 5.5234 5.4613 5.3662 5.2687 5.1684 5.1172 5.065 5.012 4.959 4.904 4.849

7 8.0727 6.5415 5.8898 5.5226 5.2852 5.1186 4.9949 4.8993 4.8232 4.7611 4.6658 4.5678 4.4667 4.4150 4.362 4.309 4.254 4.199 4.142

8 7.5709 6.0595 5.4160 5.0526 4.8173 4.6517 4.5286 4.4333 4.3572 4.2951 4.1997 4.1012 3.9995 3.9472 3.894 3.840 3.784 3.728 3.670

9 7.2093 5.7147 5.0781 4.7181 4.4844 4.3197 4.1970 4.1020 4.0260 3.9639 3.8682 3.7694 3.6669 3.6142 3.560 3.505 3.449 3.392 3.333

10 6.9367 5.4564 4.8256 4.4683 4.2361 4.0721 3.9498 3.8549 3.7790 3.7168 3.6209 3.5217 3.4185 3.3654 3.311 3.255 3.198 3.140 3.080

11 6.7241 5.2559 4.6300 4.2751 4.0440 3.8807 3.7586 3.6638 3.5879 3.5257 3.4296 3.3299 3.2261 3.1725 3.118 3.061 3.004 2.944 2.883

12 6.5538 5.0959 4.4742 4.1212 3.8911 3.7283 3.6065 3.5118 3.4358 3.3736 3.2773 3.1772 3.0728 3.0187 2.963 2.906 2.848 2.787 2.725

13 6.4143 4.9653 4.3472 3.9959 3.7667 3.6043 3.4827 3.3880 3.3120 3.2497 3.1532 3.0527 2.9477 2.8932 2.837 2.780 2.720 2.659 2.595

14 6.2979 4.8567 4.2417 3.8919 3.6634 3.5014 3.3799 3.2853 3.2093 3.1469 3.0502 2.9493 2.8437 2.7888 2.732 2.674 2.614 2.552 2.487

15 6.1995 4.7650 4.1528 3.8043 3.5764 3.4147 3.2934 3.1987 3.1227 3.0602 2.9633 2.8621 2.7559 2.7006 2.644 2.585 2.524 2.461 2.395

16 6.1151 4.6867 4.0768 3.7294 3.5021 3.3406 3.2194 3.1248 3.0488 2.9862 2.8890 2.7875 2.6808 2.6252 2.568 2.509 2.447 2.383 2.316

17 6.0420 4.6189 4.0112 3.6648 3.4379 3.2767 3.1556 3.0610 2.9849 2.9222 2.8249 2.7230 2.6158 2.5598 2.502 2.442 2.380 2.315 2.247

18 5.9781 4.5597 3.9539 3.6083 3.3820 3.2209 3.0999 3.0053 2.9291 2.8664 2.7689 2.6667 2.5590 2.5027 2.445 2.384 2.321 2.256 2.187

19 5.9216 4.5075 3.9034 3.5587 3.3327 3.1718 3.0509 2.9563 2.8801 2.8172 2.7196 2.6171 2.5089 2.4523 2.394 2.333 2.270 2.203 2.133

20 5.8715 4.4613 3.8587 3.5147 3.2891 3.1283 3.0074 2.9128 2.8365 2.7737 2.6758 2.5731 2.4645 2.4076 2.349 2.287 2.223 2.156 2.085

21 5.8266 4.4199 3.8188 3.4754 3.2501 3.0895 2.9686 2.8740 2.7977 2.7348 2.6368 2.5338 2.4247 2.3675 2.308 2.246 2.182 2.114 2.042

22 5.7863 4.3828 3.7829 3.4401 3.2151 3.0546 2.9338 2.8392 2.7628 2.6998 2.6017 2.4984 2.3890 2.3315 2.272 2.210 2.145 2.076 2.003

23 5.7498 4.3492 3.7505 3.4083 3.1835 3.0232 2.9023 2.8077 2.7313 2.6682 2.5699 2.4665 2.3567 2.2989 2.239 2.176 2.111 2.041 1.968

24 5.7166 4.3187 3.7211 3.3794 3.1548 2.9946 2.8738 2.7791 2.7027 2.6396 2.5411 2.4374 2.3273 2.2693 2.209 2.146 2.080 2.010 1.935

25 5.6864 4.2909 3.6943 3.3530 3.1287 2.9685 2.8478 2.7531 2.6766 2.6135 2.5149 2.4110 2.3005 2.2422 2.182 2.118 2.052 1.981 1.906

26 5.6586 4.2655 3.6697 3.3289 3.1048 2.9447 2.8240 2.7293 2.6528 2.5896 2.4908 2.3867 2.2759 2.2174 2.157 2.093 2.026 1.954 1.878

27 5.6331 4.2421 3.6472 3.3067 3.0828 2.9228 2.8021 2.7074 2.6309 2.5676 2.4688 2.3644 2.2533 2.1946 2.133 2.069 2.002 1.930 1.853

28 5.6096 4.2205 3.6264 3.2863 3.0626 2.9027 2.7820 2.6872 2.6106 2.5473 2.4484 2.3438 2.2324 2.1735 2.112 2.048 1.980 1.907 1.829

29 5.5878 4.2006 3.6072 3.2674 3.0438 2.8840 2.7633 2.6686 2.5919 2.5286 2.4295 2.3248 2.2131 2.1540 2.092 2.028 1.959 1.886 1.807

30 5.5675 4.1821 3.5894 3.2499 3.0265 2.8667 2.7460 2.6513 2.5746 2.5112 2.4120 2.3072 2.1952 2.1359 2.074 2.009 1.940 1.866 1.787

40 5.4239 4.0510 3.4633 3.1261 2.9037 2.7444 2.6238 2.5289 2.4519 2.3882 2.2882 2.1819 2.0677 2.0069 1.943 1.875 1.803 1.724 1.637

60 5.2856 3.9253 3.3425 3.0077 2.7863 2.6274 2.5068 2.4117 2.3344 2.2702 2.1692 2.0613 1.9445 1.8817 1.815 1.744 1.667 1.581 1.482

120 5.1523 3.8046 3.2269 2.8943 2.6740 2.5154 2.3948 2.2994 2.2217 2.1570 2.0548 1.9450 1.8249 1.7597 1.690 1.614 1.530 1.433 1.310

inf 5.0239 3.6889 3.1161 2.7858 2.5665 2.4082 2.2875 2.1918 2.1136 2.0483 1.9447 1.8326 1.7085 1.6402 1.566 1.484 1.388 1.268 1.000

To index

F Table for alpha=.01 .

df2/df1 1 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60 120 INF

1 4052.181 4999.500 5403.352 5624.583 5763.650 5858.986 5928.356 5981.070 6022.473 6055.847 6106.321 6157.285 6208.730 6234.631 6260.649 6286.782 6313.030 6339.391 6365.864

2 98.503 99.000 99.166 99.249 99.299 99.333 99.356 99.374 99.388 99.399 99.416 99.433 99.449 99.458 99.466 99.474 99.482 99.491 99.499

3 34.116 30.817 29.457 28.710 28.237 27.911 27.672 27.489 27.345 27.229 27.052 26.872 26.690 26.598 26.505 26.411 26.316 26.221 26.125

4 21.198 18.000 16.694 15.977 15.522 15.207 14.976 14.799 14.659 14.546 14.374 14.198 14.020 13.929 13.838 13.745 13.652 13.558 13.463

5 16.258 13.274 12.060 11.392 10.967 10.672 10.456 10.289 10.158 10.051 9.888 9.722 9.553 9.466 9.379 9.291 9.202 9.112 9.020

6 13.745 10.925 9.780 9.148 8.746 8.466 8.260 8.102 7.976 7.874 7.718 7.559 7.396 7.313 7.229 7.143 7.057 6.969 6.880

7 12.246 9.547 8.451 7.847 7.460 7.191 6.993 6.840 6.719 6.620 6.469 6.314 6.155 6.074 5.992 5.908 5.824 5.737 5.650

8 11.259 8.649 7.591 7.006 6.632 6.371 6.178 6.029 5.911 5.814 5.667 5.515 5.359 5.279 5.198 5.116 5.032 4.946 4.859

9 10.561 8.022 6.992 6.422 6.057 5.802 5.613 5.467 5.351 5.257 5.111 4.962 4.808 4.729 4.649 4.567 4.483 4.398 4.311

10 10.044 7.559 6.552 5.994 5.636 5.386 5.200 5.057 4.942 4.849 4.706 4.558 4.405 4.327 4.247 4.165 4.082 3.996 3.909

11 9.646 7.206 6.217 5.668 5.316 5.069 4.886 4.744 4.632 4.539 4.397 4.251 4.099 4.021 3.941 3.860 3.776 3.690 3.602

12 9.330 6.927 5.953 5.412 5.064 4.821 4.640 4.499 4.388 4.296 4.155 4.010 3.858 3.780 3.701 3.619 3.535 3.449 3.361

13 9.074 6.701 5.739 5.205 4.862 4.620 4.441 4.302 4.191 4.100 3.960 3.815 3.665 3.587 3.507 3.425 3.341 3.255 3.165

14 8.862 6.515 5.564 5.035 4.695 4.456 4.278 4.140 4.030 3.939 3.800 3.656 3.505 3.427 3.348 3.266 3.181 3.094 3.004

15 8.683 6.359 5.417 4.893 4.556 4.318 4.142 4.004 3.895 3.805 3.666 3.522 3.372 3.294 3.214 3.132 3.047 2.959 2.868

16 8.531 6.226 5.292 4.773 4.437 4.202 4.026 3.890 3.780 3.691 3.553 3.409 3.259 3.181 3.101 3.018 2.933 2.845 2.753

17 8.400 6.112 5.185 4.669 4.336 4.102 3.927 3.791 3.682 3.593 3.455 3.312 3.162 3.084 3.003 2.920 2.835 2.746 2.653

18 8.285 6.013 5.092 4.579 4.248 4.015 3.841 3.705 3.597 3.508 3.371 3.227 3.077 2.999 2.919 2.835 2.749 2.660 2.566

19 8.185 5.926 5.010 4.500 4.171 3.939 3.765 3.631 3.523 3.434 3.297 3.153 3.003 2.925 2.844 2.761 2.674 2.584 2.489

20 8.096 5.849 4.938 4.431 4.103 3.871 3.699 3.564 3.457 3.368 3.231 3.088 2.938 2.859 2.778 2.695 2.608 2.517 2.421

21 8.017 5.780 4.874 4.369 4.042 3.812 3.640 3.506 3.398 3.310 3.173 3.030 2.880 2.801 2.720 2.636 2.548 2.457 2.360

22 7.945 5.719 4.817 4.313 3.988 3.758 3.587 3.453 3.346 3.258 3.121 2.978 2.827 2.749 2.667 2.583 2.495 2.403 2.305

23 7.881 5.664 4.765 4.264 3.939 3.710 3.539 3.406 3.299 3.211 3.074 2.931 2.781 2.702 2.620 2.535 2.447 2.354 2.256

24 7.823 5.614 4.718 4.218 3.895 3.667 3.496 3.363 3.256 3.168 3.032 2.889 2.738 2.659 2.577 2.492 2.403 2.310 2.211

25 7.770 5.568 4.675 4.177 3.855 3.627 3.457 3.324 3.217 3.129 2.993 2.850 2.699 2.620 2.538 2.453 2.364 2.270 2.169

26 7.721 5.526 4.637 4.140 3.818 3.591 3.421 3.288 3.182 3.094 2.958 2.815 2.664 2.585 2.503 2.417 2.327 2.233 2.131

27 7.677 5.488 4.601 4.106 3.785 3.558 3.388 3.256 3.149 3.062 2.926 2.783 2.632 2.552 2.470 2.384 2.294 2.198 2.097

28 7.636 5.453 4.568 4.074 3.754 3.528 3.358 3.226 3.120 3.032 2.896 2.753 2.602 2.522 2.440 2.354 2.263 2.167 2.064

29 7.598 5.420 4.538 4.045 3.725 3.499 3.330 3.198 3.092 3.005 2.868 2.726 2.574 2.495 2.412 2.325 2.234 2.138 2.034

30 7.562 5.390 4.510 4.018 3.699 3.473 3.304 3.173 3.067 2.979 2.843 2.700 2.549 2.469 2.386 2.299 2.208 2.111 2.006

40 7.314 5.179 4.313 3.828 3.514 3.291 3.124 2.993 2.888 2.801 2.665 2.522 2.369 2.288 2.203 2.114 2.019 1.917 1.805

60 7.077 4.977 4.126 3.649 3.339 3.119 2.953 2.823 2.718 2.632 2.496 2.352 2.198 2.115 2.028 1.936 1.836 1.726 1.601

120 6.851 4.787 3.949 3.480 3.174 2.956 2.792 2.663 2.559 2.472 2.336 2.192 2.035 1.950 1.860 1.763 1.656 1.533 1.381

inf 6.635 4.605 3.782 3.319 3.017 2.802 2.639 2.511 2.407 2.321 2.185 2.039 1.878 1.791 1.696 1.592 1.473 1.325 1.000

To index

© Copyright StatSoft, Inc., 1984-2003

STATISTICA is a trademark of StatSoft, Inc.

© Copyright StatSoft, Inc., 1984-2003

Experimental Design (Industrial DOE)

• DOE Overview

o Experiments in Science and Industry

o Differences in techniques

o Overview

o General Ideas

o Computational Problems

o Components of Variance, Denominator Synthesis

o Summary

• 2**(k-p) Fractional Factorial Designs

o Basic Idea

o Generating the Design

o The Concept of Design Resolution

o Plackett-Burman (Hadamard Matrix) Designs for Screening

o Enhancing Design Resolution via Foldover

o Aliases of Interactions: Design Generators

o Blocking

o Replicating the Design

o Adding Center Points

o Analyzing the Results of a 2**(k-p) Experiment

o Graph Options

o Summary

• 2**(k-p) Maximally Unconfounded and Minimum Aberration Designs

o Basic Idea

o Design Criteria

o Summary

• 3**(k-p) , Box-Behnken, and Mixed 2 and 3 Level Factorial Designs

o Overview

o Designing 3**(k-p) Experiments

o An Example 3**(4-1) Design in 9 Blocks

o Box-Behnken Designs

o Analyzing the 3**(k-p) Design

o ANOVA Parameter Estimates

o Graphical Presentation of Results

o Designs for Factors at 2 and 3 Levels

• Central Composite and Non-Factorial Response Surface Designs

o Overview

o Design Considerations

o Alpha for Rotatability and Orthogonality

o Available Standard Designs

o Analyzing Central Composite Designs

o The Fitted Response Surface

o Categorized Response Surfaces

• Latin Square Designs

o Overview

o Latin Square Designs

o Analyzing the Design

o Very Large Designs, Random Effects, Unbalanced Nesting

• Taguchi Methods: Robust Design Experiments

o Overview

o Quality and Loss Functions

o Signal-to-Noise (S/N) Ratios

o Orthogonal Arrays

o Analyzing Designs

o Accumulation Analysis

o Summary

• Mixture designs and triangular surfaces

o Overview

o Triangular Coordinates

o Triangular Surfaces and Contours

o The Canonical Form of Mixture Polynomials

o Common Models for Mixture Data

o Standard Designs for Mixture Experiments

o Lower Constraints

o Upper and Lower Constraints

o Analyzing Mixture Experiments

o Analysis of Variance

o Parameter Estimates

o Pseudo-Components

o Graph Options

• Designs for constrained surfaces and mixtures

o Overview

o Designs for Constrained Experimental Regions

o Linear Constraints

o The Piepel & Snee Algorithm

o Choosing Points for the Experiment

o Analyzing Designs for Constrained Surfaces and Mixtures

• Constructing D- and A-optimal designs

o Overview

o Basic Ideas

o Measuring Design Efficiency

o Constructing Optimal Designs

o General Recommendations

o Avoiding Matrix Singularity

o "Repairing" Designs

o Constrained Experimental Regions and Optimal Design

• Special Topics

o Profiling Predicted Responses and Response Desirability

o Residuals Analysis

o Box-Cox Transformations of Dependent Variables

DOE Overview

Experiments in Science and Industry

Experimental methods are widely used in research as well as in industrial settings, however, sometimes for very different purposes. The primary goal in scientific research is usually to show the statistical significance of an effect that a particular factor exerts on the dependent variable of interest (for details concerning the concept of statistical significance see Elementary Concepts).

In industrial settings, the primary goal is usually to extract the maximum amount of unbiased information regarding the factors affecting a production process from as few (costly) observations as possible. While in the former application (in science) analysis of variance (ANOVA) techniques are used to uncover the interactive nature of reality, as manifested in higher-order interactions of factors, in industrial settings interaction effects are often regarded as a "nuisance" (they are often of no interest; they only complicate the process of identifying important factors).

Differences in techniques

These differences in purpose have a profound effect on the techniques that are used in the two settings. If you review a standard ANOVA text for the sciences, for example the classic texts by Winer (1962) or Keppel (1982), you will find that they will primarily discuss designs with up to, perhaps, five factors (designs with more than six factors are usually impractical; see the ANOVA/MANOVA chapter). The focus of these discussions is how to derive valid and robust statistical significance tests. However, if you review standard texts on experimentation in industry (Box, Hunter, and Hunter, 1978; Box and Draper, 1987; Mason, Gunst, and Hess, 1989; Taguchi, 1987) you will find that they will primarily discuss designs with many factors (e.g., 16 or 32) in which interaction effects cannot be evaluated, and the primary focus of the discussion is how to derive unbiased main effect (and, perhaps, two-way interaction) estimates with a minimum number of observations.

This comparison can be expanded further, however, a more detailed description of experimental design in industry will now be discussed and other differences will become clear. Note that the General Linear Models and ANOVA/MANOVA chapters contain detailed discussions of typical design issues in scientific research; the General Linear Model procedure is a very comprehensive implementation of the general linear model approach to ANOVA/MANOVA (univariate and multivariate ANOVA). There are of course applications in industry where general ANOVA designs, as used in scientific research, can be immensely useful. You may want to read the General Linear Models and ANOVA/MANOVA chapters to gain a more general appreciation of the range of methods encompassed by the term Experimental Design.

Overview

The general ideas and principles on which experimentation in industry is based, and the types of designs used will be discussed in the following paragraphs. The following paragraphs are meant to be introductory in nature. However, it is assumed that you are familiar with the basic ideas of analysis of variance and the interpretation of main effects and interactions in ANOVA. Otherwise, it is strongly recommend that you read the Introductory Overview section for ANOVA/MANOVA and the General Linear Models chapter.

General Ideas

In general, every machine used in a production process allows its operators to adjust various settings, affecting the resultant quality of the product manufactured by the machine. Experimentation allows the production engineer to adjust the settings of the machine in a systematic manner and to learn which factors have the greatest impact on the resultant quality. Using this information, the settings can be constantly improved until optimum quality is obtained. To illustrate this reasoning, here are a few examples:

Example 1: Dyestuff manufacture. Box and Draper (1987, page 115) report an experiment concerned with the manufacture of certain dyestuff. Quality in this context can be described in terms of a desired (specified) hue and brightness and maximum fabric strength. Moreover, it is important to know what to change in order to produce a different hue and brightness should the consumers' taste change. Put another way, the experimenter would like to identify the factors that affect the brightness, hue, and strength of the final product. In the example described by Box and Draper, there are 6 different factors that are evaluated in a 2**(6-0) design (the 2**(k-p) notation is explained below). The results of the experiment show that the three most important factors determining fabric strength are the Polysulfide index, Time, and Temperature (see Box and Draper, 1987, page 116). One can summarize the expected effect (predicted means) for the variable of interest (i.e., fabric strength in this case) in a so- called cube-plot. This plot shows the expected (predicted) mean fabric strength for the respective low and high settings for each of the three variables (factors).

Example 1.1: Screening designs. In the previous example, 6 different factors were simultaneously evaluated. It is not uncommon, that there are very many (e.g., 100) different factors that may potentially be important. Special designs (e.g., Plackett-Burman designs, see Plackett and Burman, 1946) have been developed to screen such large numbers of factors in an efficient manner, that is, with the least number of observations necessary. For example, you can design and analyze an experiment with 127 factors and only 128 runs (observations); still, you will be able to estimate the main effects for each factor, and thus, you can quickly identify which ones are important and most likely to yield improvements in the process under study.

Example 2: 3**3 design. Montgomery (1976, page 204) describes an experiment conducted in order identify the factors that contribute to the loss of soft drink syrup due to frothing during the filling of five- gallon metal containers. Three factors where considered: (a) the nozzle configuration, (b) the operator of the machine, and (c) the operating pressure. Each factor was set at three different levels, resulting in a complete 3**(3-0) experimental design (the 3**(k-p) notation is explained below).

Moreover, two measurements were taken for each combination of factor settings, that is, the 3**(3-0) design was completely replicated once.

Example 3: Maximizing yield of a chemical reaction. The yield of many chemical reactions is a function of time and temperature. Unfortunately, these two variables often do not affect the resultant yield in a linear fashion. In other words, it is not so that "the longer the time, the greater the yield" and "the higher the temperature, the greater the yield." Rather, both of these variables are usually related in a curvilinear fashion to the resultant yield.

Thus, in this example your goal as experimenter would be to optimize the yield surface that is created by the two variables: time and temperature.

Example 4: Testing the effectiveness of four fuel additives. Latin square designs are useful when the factors of interest are measured at more than two levels, and the nature of the problem suggests some blocking. For example, imagine a study of 4 fuel additives on the reduction in oxides of nitrogen (see Box, Hunter, and Hunter, 1978, page 263). You may have 4 drivers and 4 cars at your disposal. You are not particularly interested in any effects of particular cars or drivers on the resultant oxide reduction; however, you do not want the results for the fuel additives to be biased by the particular driver or car. Latin square designs allow you to estimate the main effects of all factors in the design in an unbiased manner. With regard to the example, the arrangement of treatment levels in a Latin square design assures that the variability among drivers or cars does not affect the estimation of the effect due to different fuel additives.

Example 5: Improving surface uniformity in the manufacture of polysilicon wafers. The manufacture of reliable microprocessors requires very high consistency in the manufacturing process. Note that in this instance, it is equally, if not more important to control the variability of certain product characteristics than it is to control the average for a characteristic. For example, with regard to the average surface thickness of the polysilicon layer, the manufacturing process may be perfectly under control; yet, if the variability of the surface thickness on a wafer fluctuates widely, the resultant microchips will not be reliable. Phadke (1989) describes how different characteristics of the manufacturing process (such as deposition temperature, deposition pressure, nitrogen flow, etc.) affect the variability of the polysilicon surface thickness on wafers. However, no theoretical model exists that would allow the engineer to predict how these factors affect the uniformness of wafers. Therefore, systematic experimentation with the factors is required to optimize the process. This is a typical example where Taguchi robust design methods would be applied.

Example 6: Mixture designs. Cornell (1990, page 9) reports an example of a typical (simple) mixture problem. Specifically, a study was conducted to determine the optimum texture of fish patties as a result of the relative proportions of different types of fish (Mullet, Sheepshead, and Croaker) that made up the patties. Unlike in non-mixture experiments, the total sum of the proportions must be equal to a constant, for example, to 100%. The results of such experiments are usually graphically represented in so-called triangular (or ternary) graphs.

In general, the overall constraint -- that the three components must sum to a constant -- is reflected in the triangular shape of the graph (see above).

Example 6.1: Constrained mixture designs. It is particularly common in mixture designs that the relative amounts of components are further constrained (in addition to the constraint that they must sum to, for example, 100%). For example, suppose we wanted to design the best-tasting fruit punch consisting of a mixture of juices from five fruits. Since the resulting mixture is supposed to be a fruit punch, pure blends consisting of the pure juice of only one fruit are necessarily excluded. Additional constraints may be placed on the "universe" of mixtures due to cost constraints or other considerations, so that one particular fruit cannot, for example, account for more than 30% of the mixtures (otherwise the fruit punch would be too expensive, the shelf-life would be compromised, the punch could not be produced in large enough quantities, etc.). Such so-called constrained experimental regions present numerous problems, which, however, can be addressed.

In general, under those conditions, one seeks to design an experiment that can potentially extract the maximum amount of information about the respective response function (e.g., taste of the fruit punch) in the experimental region of interest.

Computational Problems

There are basically two general issues to which Experimental Design is addressed:

1. How to design an optimal experiment, and

2. How to analyze the results of an experiment.

With regard to the first question, there are different considerations that enter into the different types of designs, and they will be discussed shortly. In the most general terms, the goal is always to allow the experimenter to evaluate in an unbiased (or least biased) way, the consequences of changing the settings of a particular factor, that is, regardless of how other factors were set. In more technical terms, you attempt to generate designs where main effects are unconfounded among themselves, and in some cases, even unconfounded with the interaction of factors.

Components of Variance, Denominator Synthesis

There are several statistical methods for analyzing designs with random effects (see Methods for Analysis of Variance). The Variance Components and Mixed Model ANOVA/ANCOVA chapter discusses numerous options for estimating variance components for random effects, and for performing approximate F tests based on synthesized error terms.

Summary

Experimental methods are finding increasing use in manufacturing to optimize the production process. Specifically, the goal of these methods is to identify the optimum settings for the different factors that affect the production process. In the discussion so far, the major classes of designs that are typically used in industrial experimentation have been introduced: 2**(k-p) (two-level, multi-factor) designs, screening designs for large numbers of factors, 3**(k-p) (three-level, multi-factor) designs (mixed designs with 2 and 3 level factors are also supported), central composite (or response surface) designs, Latin square designs, Taguchi robust design analysis, mixture designs, and special procedures for constructing experiments in constrained experimental regions. Interestingly, many of these experimental techniques have "made their way" from the production plant into management, and successful implementations have been reported in profit planning in business, cash-flow optimization in banking, etc. (e.g., see Yokyama and Taguchi, 1975).

These techniques will now be described in greater detail in the following sections:

1. 2**(k-p) Fractional Factorial Designs

2. 2**(k-p) Maximally Unconfounded and Minimum Aberration Designs

3. 3**(k-p) , Box-Behnken, and Mixed 2 and 3 Level Factorial Designs

4. Central Composite and Non-Factorial Response Surface Designs

5. Latin Square Designs

6. Taguchi Methods: Robust Design Experiments

7. Mixture designs and triangular surfaces

8. Designs for constrained surfaces and mixtures

9. Constructing D- and A-optimal designs for surfaces and mixtures

2**(k-p) Fractional Factorial Designs at 2 Levels

Basic Idea

In many cases, it is sufficient to consider the factors affecting the production process at two levels. For example, the temperature for a chemical process may either be set a little higher or a little lower, the amount of solvent in a dyestuff manufacturing process can either be slightly increased or decreased, etc. The experimenter would like to determine whether any of these changes affect the results of the production process. The most intuitive approach to study those factors would be to vary the factors of interest in a full factorial design, that is, to try all possible combinations of settings. This would work fine, except that the number of necessary runs in the experiment (observations) will increase geometrically. For example, if you want to study 7 factors, the necessary number of runs in the experiment would be 2**7 = 128. To study 10 factors you would need 2**10 = 1,024 runs in the experiment. Because each run may require time-consuming and costly setting and resetting of machinery, it is often not feasible to require that many different production runs for the experiment. In these conditions, fractional factorials are used that "sacrifice" interaction effects so that main effects may still be computed correctly.

Generating the Design

A technical description of how fractional factorial designs are constructed is beyond the scope of this introduction. Detailed accounts of how to design 2**(k-p) experiments can be found, for example, in Bayne and Rubin (1986), Box and Draper (1987), Box, Hunter, and Hunter (1978), Montgomery (1991), Daniel (1976), Deming and Morgan (1993), Mason, Gunst, and Hess (1989), or Ryan (1989), to name only a few of the many text books on this subject. In general, it will successively "use" the highest-order interactions to generate new factors. For example, consider the following design that includes 11 factors but requires only 16 runs (observations).

Design: 2**(11-7), Resolution III

Run A B C D E F G H I J K

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16 1

1

1

1

1

1

1

1

-1

-1

-1

-1

-1

-1

-1

-1 1

1

1

1

-1

-1

-1

-1

1

1

1

1

-1

-1

-1

-1 1

1

-1

-1

1

1

-1

-1

1

1

-1

-1

1

1

-1

-1 1

-1

1

-1

1

-1

1

-1

1

-1

1

-1

1

-1

1

-1 1

1

-1

-1

-1

-1

1

1

-1

-1

1

1

1

1

-1

-1 1

-1

-1

1

-1

1

1

-1

1

-1

-1

1

-1

1

1

-1 1

-1

-1

1

1

-1

-1

1

-1

1

1

-1

-1

1

1

-1 1

-1

1

-1

-1

1

-1

1

-1

1

-1

1

1

-1

1

-1 1

-1

-1

1

-1

1

1

-1

-1

1

1

-1

1

-1

-1

1 1

1

1

1

-1

-1

-1

-1

-1

-1

-1

-1

1

1

1

1 1

1

-1

-1

1

1

-1

-1

-1

-1

1

1

-1

-1

1

1

Reading the design. The design displayed above should be interpreted as follows. Each column contains +1's or -1's to indicate the setting of the respective factor (high or low, respectively). So for example, in the first run of the experiment, set all factors A through K to the plus setting (e.g., a little higher than before); in the second run, set factors A, B, and C to the positive setting, factor D to the negative setting, and so on. Note that there are numerous options provided to display (and save) the design using notation other than ±1 to denote factor settings. For example, you may use actual values of factors (e.g., 90 degrees Celsius and 100 degrees Celsius) or text labels (Low temperature, High temperature).

Randomizing the runs. Because many other things may change from production run to production run, it is always a good practice to randomize the order in which the systematic runs of the designs are performed.

The Concept of Design Resolution

The design above is described as a 2**(11-7) design of resolution III (three). This means that you study overall k = 11 factors (the first number in parentheses); however, p = 7 of those factors (the second number in parentheses) were generated from the interactions of a full 2**[(11-7) = 4] factorial design. As a result, the design does not give full resolution; that is, there are certain interaction effects that are confounded with (identical to) other effects. In general, a design of resolution R is one where no l-way interactions are confounded with any other interaction of order less than R-l. In the current example, R is equal to 3. Here, no l = 1 level interactions (i.e., main effects) are confounded with any other interaction of order less than R-l = 3-1 = 2. Thus, main effects in this design are confounded with two- way interactions; and consequently, all higher-order interactions are equally confounded. If you had included 64 runs, and generated a 2**(11-5) design, the resultant resolution would have been R = IV (four). You would have concluded that no l=1-way interaction (main effect) is confounded with any other interaction of order less than R-l = 4-1 = 3. In this design then, main effects are not confounded with two-way interactions, but only with three-way interactions. What about the two-way interactions? No l=2-way interaction is confounded with any other interaction of order less than R-l = 4-2 = 2. Thus, the two-way interactions in that design are confounded with each other.

Plackett-Burman (Hadamard Matrix) Designs for Screening

When one needs to screen a large number of factors to identify those that may be important (i.e., those that are related to the dependent variable of interest), one would like to employ a design that allows one to test the largest number of factor main effects with the least number of observations, that is to construct a resolution III design with as few runs as possible. One way to design such experiments is to confound all interactions with "new" main effects. Such designs are also sometimes called saturated designs, because all information in those designs is used to estimate the parameters, leaving no degrees of freedom to estimate the error term for the ANOVA. Because the added factors are created by equating (aliasing, see below), the "new" factors with the interactions of a full factorial design, these designs always will have 2**k runs (e.g., 4, 8, 16, 32, and so on). Plackett and Burman (1946) showed how full factorial design can be fractionalized in a different manner, to yield saturated designs where the number of runs is a multiple of 4, rather than a power of 2. These designs are also sometimes called Hadamard matrix designs. Of course, you do not have to use all available factors in those designs, and, in fact, sometimes you want to generate a saturated design for one more factor than you are expecting to test. This will allow you to estimate the random error variability, and test for the statistical significance of the parameter estimates.

Enhancing Design Resolution via Foldover

One way in which a resolution III design can be enhanced and turned into a resolution IV design is via foldover (e.g., see Box and Draper, 1987, Deming and Morgan, 1993): Suppose you have a 7-factor design in 8 runs:

Design: 2**(7-4) design

Run A B C D E F G

1

2

3

4

5

6

7

8 1

1

1

1

-1

-1

-1

-1 1

1

-1

-1

1

1

-1

-1 1

-1

1

-1

1

-1

1

-1 1

1

-1

-1

-1

-1

1

1 1

-1

1

-1

-1

1

-1

1 1

-1

-1

1

1

-1

-1

1 1

-1

-1

1

-1

1

1

-1

This is a resolution III design, that is, the two-way interactions will be confounded with the main effects. You can turn this design into a resolution IV design via the Foldover (enhance resolution) option. The foldover method copies the entire design and appends it to the end, reversing all signs:

Design: 2**(7-4) design (+Foldover)

Run

A

B

C

D

E

F

G New:

H

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16 1

1

1

1

-1

-1

-1

-1

-1

-1

-1

-1

1

1

1

1 1

1

-1

-1

1

1

-1

-1

-1

-1

1

1

-1

-1

1

1 1

-1

1

-1

1

-1

1

-1

-1

1

-1

1

-1

1

-1

1 1

1

-1

-1

-1

-1

1

1

-1

-1

1

1

1

1

-1

-1 1

-1

1

-1

-1

1

-1

1

-1

1

-1

1

1

-1

1

-1 1

-1

-1

1

1

-1

-1

1

-1

1

1

-1

-1

1

1

-1 1

-1

-1

1

-1

1

1

-1

-1

1

1

-1

1

-1

-1

1 1

1

1

1

1

1

1

1

-1

-1

-1

-1

-1

-1

-1

-1

Thus, the standard run number 1 was -1, -1, -1, 1, 1, 1, -1; the new run number 9 (the first run of the "folded-over" portion) has all signs reversed: 1, 1, 1, -1, -1, -1, 1. In addition to enhancing the resolution of the design, we also have gained an 8'th factor (factor H), which contains all +1's for the first eight runs, and -1's for the folded-over portion of the new design. Note that the resultant design is actually a 2**(8-4) design of resolution IV (see also Box and Draper, 1987, page 160).

Aliases of Interactions: Design Generators

To return to the example of the resolution R = III design, now that you know that main effects are confounded with two-way interactions, you may ask the question, "Which interaction is confounded with which main effect?"

Factor Fractional Design Generators

2**(11-7) design

(Factors are denoted by numbers)

Alias

5

6

7

8

9

10

11 123

234

134

124

1234

12

13

Design generators. The design generators shown above are the "key" to how factors 5 through 11 were generated by assigning them to particular interactions of the first 4 factors of the full factorial 2**4 design. Specifically, factor 5 is identical to the 123 (factor 1 by factor 2 by factor 3) interaction. Factor 6 is identical to the 234 interaction, and so on. Remember that the design is of resolution III (three), and you expect some main effects to be confounded with some two-way interactions; indeed, factor 10 (ten) is identical to the 12 (factor 1 by factor 2) interaction, and factor 11 (eleven) is identical to the 13 (factor 1 by factor 3) interaction. Another way in which these equivalencies are often expressed is by saying that the main effect for factor 10 (ten) is an alias for the interaction of 1 by 2. (The term alias was first used by Finney, 1945).

To summarize, whenever you want to include fewer observations (runs) in your experiment than would be required by the full factorial 2**k design, you "sacrifice" interaction effects and assign them to the levels of factors. The resulting design is no longer a full factorial but a fractional factorial.

The fundamental identity. Another way to summarize the design generators is in a simple equation. Namely, if, for example, factor 5 in a fractional factorial design is identical to the 123 (factor 1 by factor 2 by factor 3) interaction, then it follows that multiplying the coded values for the 123 interaction by the coded values for factor 5 will always result in +1 (if all factor levels are coded ±1); or:

I = 1235

where I stands for +1 (using the standard notation as, for example, found in Box and Draper, 1987). Thus, we also know that factor 1 is confounded with the 235 interaction, factor 2 with the 135, interaction, and factor 3 with the 125 interaction, because, in each instance their product must be equal to 1. The confounding of two-way interactions is also defined by this equation, because the 12 interaction multiplied by the 35 interaction must yield 1, and hence, they are identical or confounded. Therefore, one can summarize all confounding in a design with such a fundamental identity equation.

Blocking

In some production processes, units are produced in natural "chunks" or blocks. You want to make sure that these blocks do not bias your estimates of main effects. For example, you may have a kiln to produce special ceramics, but the size of the kiln is limited so that you cannot produce all runs of your experiment at once. In that case you need to break up the experiment into blocks. However, you do not want to run positive settings of all factors in one block, and all negative settings in the other. Otherwise, any incidental differences between blocks would systematically affect all estimates of the main effects of the factors of interest. Rather, you want to distribute the runs over the blocks so that any differences between blocks (i.e., the blocking factor) do not bias your results for the factor effects of interest. This is accomplished by treating the blocking factor as another factor in the design. Consequently, you "lose" another interaction effect to the blocking factor, and the resultant design will be of lower resolution. However, these designs often have the advantage of being statistically more powerful, because they allow you to estimate and control the variability in the production process that is due to differences between blocks.

Replicating the Design

It is sometimes desirable to replicate the design, that is, to run each combination of factor levels in the design more than once. This will allow you to later estimate the so-called pure error in the experiment. The analysis of experiments is further discussed below; however, it should be clear that, when replicating the design, one can compute the variability of measurements within each unique combination of factor levels. This variability will give an indication of the random error in the measurements (e.g., due to uncontrolled factors, unreliability of the measurement instrument, etc.), because the replicated observations are taken under identical conditions (settings of factor levels). Such an estimate of the pure error can be used to evaluate the size and statistical significance of the variability that can be attributed to the manipulated factors.

Partial replications. When it is not possible or feasible to replicate each unique combination of factor levels (i.e., the full design), one can still gain an estimate of pure error by replicating only some of the runs in the experiment. However, one must be careful to consider the possible bias that may be introduced by selectively replicating only some runs. If one only replicates those runs that are most easily repeated (e.g., gathers information at the points where it is "cheapest"), one may inadvertently only choose those combinations of factor levels that happen to produce very little (or very much) random variability -- causing one to underestimate (or overestimate) the true amount of pure error. Thus, one should carefully consider, typically based on your knowledge about the process that is being studied, which runs should be replicated, that is, which runs will yield a good (unbiased) estimate of pure error.

Adding Center Points

Designs with factors that are set at two levels implicitly assume that the effect of the factors on the dependent variable of interest (e.g., fabric Strength) is linear. It is impossible to test whether or not there is a non-linear (e.g., quadratic) component in the relationship between a factor A and a dependent variable, if A is only evaluated at two points (.i.e., at the low and high settings). If one suspects that the relationship between the factors in the design and the dependent variable is rather curve-linear, then one should include one or more runs where all (continuous) factors are set at their midpoint. Such runs are called center-point runs (or center points), since they are, in a sense, in the center of the design (see graph).

Later in the analysis (see below), one can compare the measurements for the dependent variable at the center point with the average for the rest of the design. This provides a check for curvature (see Box and Draper, 1987): If the mean for the dependent variable at the center of the design is significantly different from the overall mean at all other points of the design, then one has good reason to believe that the simple assumption that the factors are linearly related to the dependent variable, does not hold.

Analyzing the Results of a 2**(k-p) Experiment

Analysis of variance. Next, one needs to determine exactly which of the factors significantly affected the dependent variable of interest. For example, in the study reported by Box and Draper (1987, page 115), it is desired to learn which of the factors involved in the manufacture of dyestuffs affected the strength of the fabric. In this example, factors 1 (Polysulfide), 4 (Time), and 6 (Temperature) significantly affected the strength of the fabric. Note that to simplify matters, only main effects are shown below.

ANOVA; Var.:STRENGTH; R-sqr = .60614; Adj:.56469 (fabrico.sta)

2**(6-0) design; MS Residual = 3.62509

DV: STRENGTH

SS df MS F p

(1)POLYSUFD

(2)REFLUX

(3)MOLES

(4)TIME

(5)SOLVENT

(6)TEMPERTR

Error

Total SS 48.8252

7.9102

.1702

142.5039

2.7639

115.8314

206.6302

524.6348 1

1

1

1

1

1

57

63 48.8252

7.9102

.1702

142.5039

2.7639

115.8314

3.6251

13.46867

2.18206

.04694

39.31044

.76244

31.95269

.000536

.145132

.829252

.000000

.386230

.000001

Pure error and lack of fit. If the experimental design is at least partially replicated, then one can estimate the error variability for the experiment from the variability of the replicated runs. Since those measurements were taken under identical conditions, that is, at identical settings of the factor levels, the estimate of the error variability from those runs is independent of whether or not the "true" model is linear or non-linear in nature, or includes higher-order interactions. The error variability so estimated represents pure error, that is, it is entirely due to unreliabilities in the measurement of the dependent variable. If available, one can use the estimate of pure error to test the significance of the residual variance, that is, all remaining variability that cannot be accounted for by the factors and their interactions that are currently in the model. If, in fact, the residual variability is significantly larger than the pure error variability, then one can conclude that there is still some statistically significant variability left that is attributable to differences between the groups, and hence, that there is an overall lack of fit of the current model.

ANOVA; Var.:STRENGTH; R-sqr = .58547; Adj:.56475 (fabrico.sta)

2**(3-0) design; MS Pure Error = 3.594844

DV: STRENGTH

SS df MS F p

(1)POLYSUFD

(2)TIME

(3)TEMPERTR

Lack of Fit

Pure Error

Total SS 48.8252

142.5039

115.8314

16.1631

201.3113

524.6348 1

1

1

4

56

63 48.8252

142.5039

115.8314

4.0408

3.5948

13.58200

39.64120

32.22154

1.12405

.000517

.000000

.000001

.354464

For example, the table above shows the results for the three factors that were previously identified as most important in their effect on fabric strength; all other factors where ignored in the analysis. As you can see in the row with the label Lack of Fit, when the residual variability for this model (i.e., after removing the three main effects) is compared against the pure error estimated from the within-group variability, the resulting F test is not statistically significant. Therefore, this result additionally supports the conclusion that, indeed, factors Polysulfide, Time, and Temperature significantly affected resultant fabric strength in an additive manner (i.e., there are no interactions). Or, put another way, all differences between the means obtained in the different experimental conditions can be sufficiently explained by the simple additive model for those three variables.

Parameter or effect estimates. Now, look at how these factors affected the strength of the fabrics.

Effect Std.Err. t (57) p

Mean/Interc.

(1)POLYSUFD

(2)REFLUX

(3)MOLES

(4)TIME

(5)SOLVENT

(6)TEMPERTR 11.12344

1.74688

.70313

.10313

2.98438

-.41562

2.69062 .237996

.475992

.475992

.475992

.475992

.475992

.475992 46.73794

3.66997

1.47718

.21665

6.26980

-.87318

5.65267 .000000

.000536

.145132

.829252

.000000

.386230

.000001

The numbers above are the effect or parameter estimates. With the exception of the overall Mean/Intercept, these estimates are the deviations of the mean of the negative settings from the mean of the positive settings for the respective factor. For example, if you change the setting of factor Time from low to high, then you can expect an improvement in Strength by 2.98; if you set the value for factor Polysulfd to its high setting, you can expect a further improvement by 1.75, and so on.

As you can see, the same three factors that were statistically significant show the largest parameter estimates; thus the settings of these three factors were most important for the resultant strength of the fabric.

For analyses including interactions, the interpretation of the effect parameters is a bit more complicated. Specifically, the two-way interaction parameters are defined as half the difference between the main effects of one factor at the two levels of a second factor (see Mason, Gunst, and Hess, 1989, page 127); likewise, the three-way interaction parameters are defined as half the difference between the two-factor interaction effects at the two levels of a third factor, and so on.

Regression coefficients. One can also look at the parameters in the multiple regression model (see Multiple Regression). To continue this example, consider the following prediction equation:

Strength = const + b1 *x1 +... + b6 *x6

Here x1 through x6 stand for the 6 factors in the analysis. The Effect Estimates shown earlier also contains these parameter estimates:

Coeff. Std.Err.

Coeff. -95.%

Cnf.Limt +95.%

Cnf.Limt

Mean/Interc.

(1)POLYSUFD

(2)REFLUX

(3)MOLES

(4)TIME

(5)SOLVENT

(6)TEMPERTR 11.12344

.87344

.35156

.05156

1.49219

-.20781

1.34531 .237996

.237996

.237996

.237996

.237996

.237996

.237996 10.64686

.39686

-.12502

-.42502

1.01561

-.68439

.86873 11.60002

1.35002

.82814

.52814

1.96877

.26877

1.82189

Actually, these parameters contain little "new" information, as they simply are one-half of the parameter values (except for the Mean/Intercept) shown earlier. This makes sense since now, the coefficient can be interpreted as the deviation of the high-setting for the respective factors from the center. However, note that this is only the case if the factor values (i.e., their levels) are coded as -1 and +1, respectively. Otherwise, the scaling of the factor values will affect the magnitude of the parameter estimates. In the example data reported by Box and Draper (1987, page 115), the settings or values for the different factors were recorded on very different scales:

data file: FABRICO.STA [ 64 cases with 9 variables ]

2**(6-0) Design, Box & Draper, p. 117

POLYSUFD REFLUX MOLES TIME SOLVENT TEMPERTR STRENGTH HUE BRIGTHNS

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

. . . 6

7

6

7

6

7

6

7

6

7

6

7

6

7

6

. . . 150

150

170

170

150

150

170

170

150

150

170

170

150

150

170

. . . 1.8

1.8

1.8

1.8

2.4

2.4

2.4

2.4

1.8

1.8

1.8

1.8

2.4

2.4

2.4

. . . 24

24

24

24

24

24

24

24

36

36

36

36

36

36

36

. . . 30

30

30

30

30

30

30

30

30

30

30

30

30

30

30

. . . 120

120

120

120

120

120

120

120

120

120

120

120

120

120

120

. . . 3.4

9.7

7.4

10.6

6.5

7.9

10.3

9.5

14.3

10.5

7.8

17.2

9.4

12.1

9.5

. . . 15.0

5.0

23.0

8.0

20.0

9.0

13.0

5.0

23.0

1.0

11.0

5.0

15.0

8.0

15.0

. . . 36.0

35.0

37.0

34.0

30.0

32.0

28.0

38.0

40.0

32.0

32.0

28.0

34.0

26.0

30.0

. . .

Shown below are the regression coefficient estimates based on the uncoded original factor values:

Regressn

Coeff.

Std.Err.

t (57)

p

Mean/Interc.

(1)POLYSUFD

(2)REFLUX

(3)MOLES

(4)TIME

(5)SOLVENT

(6)TEMPERTR -46.0641

1.7469

.0352

.1719

.2487

-.0346

.2691 8.109341

.475992

.023800

.793320

.039666

.039666

.047599 -5.68037

3.66997

1.47718

.21665

6.26980

-.87318

5.65267 .000000

.000536

.145132

.829252

.000000

.386230

.000001

Because the metric for the different factors is no longer compatible, the magnitudes of the regression coefficients are not compatible either. This is why it is usually more informative to look at the ANOVA parameter estimates (for the coded values of the factor levels), as shown before. However, the regression coefficients can be useful when one wants to make predictions for the dependent variable, based on the original metric of the factors.

Graph Options

Diagnostic plots of residuals. To start with, before accepting a particular "model" that includes a particular number of effects (e.g., main effects for Polysulfide, Time, and Temperature in the current example), one should always examine the distribution of the residual values. These are computed as the difference between the predicted values (as predicted by the current model) and the observed values. You can compute the histogram for these residual values, as well as probability plots (as shown below).

The parameter estimates and ANOVA table are based on the assumption that the residuals are normally distributed (see also Elementary Concepts). The histogram provides one way to check (visually) whether this assumption holds. The so-called normal probability plot is another common tool to assess how closely a set of observed values (residuals in this case) follows a theoretical distribution. In this plot the actual residual values are plotted along the horizontal X-axis; the vertical Y-axis shows the expected normal values for the respective values, after they were rank-ordered. If all values fall onto a straight line, then one can be satisfied that the residuals follow the normal distribution.

Pareto chart of effects. The Pareto chart of effects is often an effective tool for communicating the results of an experiment, in particular to laymen.

In this graph, the ANOVA effect estimates are sorted from the largest absolute value to the smallest absolute value. The magnitude of each effect is represented by a column, and often, a line going across the columns indicates how large an effect has to be (i.e., how long a column must be) to be statistically significant.

Normal probability plot of effects. Another useful, albeit more technical summary graph, is the normal probability plot of the estimates. As in the normal probability plot of the residuals, first the effect estimates are rank ordered, and then a normal z score is computed based on the assumption that the estimates are normally distributed. This z score is plotted on the Y-axis; the observed estimates are plotted on the X-axis (as shown below).

Square and cube plots. These plots are often used to summarize predicted values for the dependent variable, given the respective high and low setting of the factors. The square plot (see below) will show the predicted values (and, optionally, their confidence intervals) for two factors at a time. The cube plot will show the predicted values (and, optionally, confidence intervals) for three factors at a time.

Interaction plots. A general graph for showing the means is the standard interaction plot, where the means are indicated by points connected by lines. This plot (see below) is particularly useful when there are significant interaction effects in the model.

Surface and contour plots. When the factors in the design are continuous in nature, it is often also useful to look at surface and contour plots of the dependent variable as a function of the factors.

These types of plots will further be discussed later in this section, in the context of 3**(k-p), and central composite and response surface designs.

Summary

2**(k-p) designs are the "workhorse" of industrial experiments. The impact of a large number of factors on the production process can simultaneously be assessed with relative efficiency (i.e., with few experimental runs). The logic of these types of experiments is straightforward (each factor has only two settings).

Disadvantages. The simplicity of these designs is also their major flaw. As mentioned before, underlying the use of two-level factors is the belief that the resultant changes in the dependent variable (e.g., fabric strength) are basically linear in nature. This is often not the case, and many variables are related to quality characteristics in a non-linear fashion. In the example above, if you were to continuously increase the temperature factor (which was significantly related to fabric strength), you would of course eventually hit a "peak," and from there on the fabric strength would decrease as the temperature increases. While this types of curvature in the relationship between the factors in the design and the dependent variable can be detected if the design included center point runs, one cannot fit explicit nonlinear (e.g., quadratic) models with 2**(k-p) designs (however, central composite designs will do exactly that).

Another problem of fractional designs is the implicit assumption that higher-order interactions do not matter; but sometimes they do, for example, when some other factors are set to a particular level, temperature may be negatively related to fabric strength. Again, in fractional factorial designs, higher-order interactions (greater than two-way) particularly will escape detection.

To index

2**(k-p) Maximally Unconfounded and Minimum Aberration Designs

Basic Idea

2**(k-p) fractional factorial designs are often used in industrial experimentation because of the economy of data collection that they provide. For example, suppose an engineer needed to investigate the effects of varying 11 factors, each with 2 levels, on a manufacturing process. Let us call the number of factors k, which would be 11 for this example. An experiment using a full factorial design, where the effects of every combination of levels of each factor are studied, would require 2**(k) experimental runs, or 2048 runs for this example. To minimize the data collection effort, the engineer might decide to forego investigation of higher-order interaction effects of the 11 factors, and focus instead on identifying the main effects of the 11 factors and any low-order interaction effects that could be estimated from an experiment using a smaller, more reasonable number of experimental runs. There is another, more theoretical reason for not conducting huge, full factorial 2 level experiments. In general, it is not logical to be concerned with identifying higher-order interaction effects of the experimental factors, while ignoring lower-order nonlinear effects, such as quadratic or cubic effects, which cannot be estimated if only 2 levels of each factor are employed. So althrough practical considerations often lead to the need to design experiments with a reasonably small number of experimental runs, there is a logical justification for such experiments.

The alternative to the 2**(k) full factorial design is the 2**(k-p) fractional factorial design, which requires only a "fraction" of the data collection effort required for full factorial designs. For our example with k=11 factors, if only 64 experimental runs can be conducted, a 2**(11-5) fractional factorial experiment would be designed with 2**6 = 64 experimental runs. In essence, a k-p = 6 way full factorial experiment is designed, with the levels of the p factors being "generated" by the levels of selected higher order interactions of the other 6 factors. Fractional factorials "sacrifice" higher order interaction effects so that lower order effects may still be computed correctly. However, different criteria can be used in choosing the higher order interactions to be used as generators, with different criteria sometimes leading to different "best" designs.

2**(k-p) fractional factorial designs can also include blocking factors. In some production processes, units are produced in natural "chunks" or blocks. To make sure that these blocks do not bias your estimates of the effects for the k factors, blocking factors can be added as additional factors in the design. Consequently, you may "sacrifice" additional interaction effects to generate the blocking factors, but these designs often have the advantage of being statistically more powerful, because they allow you to estimate and control the variability in the production process that is due to differences between blocks.

Design Criteria

Many of the concepts discussed in this overview are also addressed in the Overview of 2**(k-p) Fractional factorial designs. However, a technical description of how fractional factorial designs are constructed is beyond the scope of either introductory overview. Detailed accounts of how to design 2**(k-p) experiments can be found, for example, in Bayne and Rubin (1986), Box and Draper (1987), Box, Hunter, and Hunter (1978), Montgomery (1991), Daniel (1976), Deming and Morgan (1993), Mason, Gunst, and Hess (1989), or Ryan (1989), to name only a few of the many text books on this subject.

In general, the 2**(k-p) maximally unconfounded and minimum aberration designs techniques will successively select which higher-order interactions to use as generators for the p factors. For example, consider the following design that includes 11 factors but requires only 16 runs (observations).

Design: 2**(11-7), Resolution III

Run A B C D E F G H I J K

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16 1

1

1

1

1

1

1

1

-1

-1

-1

-1

-1

-1

-1

-1 1

1

1

1

-1

-1

-1

-1

1

1

1

1

-1

-1

-1

-1 1

1

-1

-1

1

1

-1

-1

1

1

-1

-1

1

1

-1

-1 1

-1

1

-1

1

-1

1

-1

1

-1

1

-1

1

-1

1

-1 1

1

-1

-1

-1

-1

1

1

-1

-1

1

1

1

1

-1

-1 1

-1

-1

1

-1

1

1

-1

1

-1

-1

1

-1

1

1

-1 1

-1

-1

1

1

-1

-1

1

-1

1

1

-1

-1

1

1

-1 1

-1

1

-1

-1

1

-1

1

-1

1

-1

1

1

-1

1

-1 1

-1

-1

1

-1

1

1

-1

-1

1

1

-1

1

-1

-1

1 1

1

1

1

-1

-1

-1

-1

-1

-1

-1

-1

1

1

1

1 1

1

-1

-1

1

1

-1

-1

-1

-1

1

1

-1

-1

1

1

Interpreting the design. The design displayed in the Scrollsheet above should be interpreted as follows. Each column contains +1's or -1's to indicate the setting of the respective factor (high or low, respectively). So for example, in the first run of the experiment, all factors A through K are set to the higher level, and in the second run, factors A, B, and C are set to the higher level, but factor D is set to the lower level, and so on. Notice that the settings for each experimental run for factor E can be produced by multiplying the respective settings for factors A, B, and C. The A x B x C interaction effect therefore cannot be estimated independently of the factor E effect in this design because these two effects are confounded. Likewise, the settings for factor F can be produced by multiplying the respective settings for factors B, C, and D. We say that ABC and BCD are the generators for factors E and F, respectively.

The maximum resolution design criterion. In the Scrollsheet shown above, the design is described as a 2**(11-7) design of resolution III (three). This means that you study overall k = 11 factors, but p = 7 of those factors were generated from the interactions of a full 2**[(11-7) = 4] factorial design. As a result, the design does not give full resolution; that is, there are certain interaction effects that are confounded with (identical to) other effects. In general, a design of resolution R is one where no l-way interactions are confounded with any other interaction of order less than R - l. In the current example, R is equal to 3. Here, no l = 1-way interactions (i.e., main effects) are confounded with any other interaction of order less than R - l = 3 -1 = 2. Thus, main effects in this design are unconfounded with each other, but are confounded with two-factor interactions; and consequently, with other higher-order interactions. One obvious, but nevertheless very important overall design criterion is that the higher-order interactions to be used as generators should be chosen such that the resolution of the design is as high as possible.

The maximum unconfounding design criterion. Maximizing the resolution of a design, however, does not by itself ensure that the selected generators produce the "best" design. Consider, for example, two different resolution IV designs. In both designs, main effects would be unconfounded with each other and 2-factor interactions would be unconfounded with main effects, i.e, no l = 2-way interactions are confounded with any other interaction of order less than R - l = 4 - 2 = 2. The two designs might be different, however, with regard to the degree of confounding for the 2-factor interactions. For resolution IV designs, the "crucial order," in which confounding of effects first appears, is for 2-factor interactions. In one design, none of the "crucial order," 2-factor interactions might be unconfounded with all other 2-factor interactions, while in the other design, virtually all of the 2-factor interactions might be unconfounded with all of the other 2-factor interactions. The second "almost resolution V" design would be preferable to the first "just barely resolution IV" design. This suggests that even though the maximum resolution design criterion should be the primary criterion, a subsidiary criterion might be that generators should be chosen such that the maximum number of interactions of less than or equal to the crucial order, given the resolution, are unconfounded with all other interactions of the crucial order. This is called the maximum unconfounding design criterion, and is one of the optional, subsidiary design criterion to use in a search for a 2**(k-p) design.

The minimum aberration design criterion. The miniminum aberration design criterion is another optional, subsidiary criterion to use in a search for a 2**(k-p) design. In some respects, this criterion is similar to the maximum unconfounding design criterion. Technically, the minimum aberration design is defined as the design of maximum resolution "which minimizes the number of words in the defining relation that are of minimum length" (Fries & Hunter, 1980). Less technically, the criterion apparently operates by choosing generators that produce the smallest number of pairs of confounded interactions of the crucial order. For example, the minimum aberration resolution IV design would have the minimum number of pairs of confounded 2-factor interactions.

To illustrate the difference between the maximum unconfounding and minimum aberration criteria, consider the maximally unconfounded 2**(9-4) design and the minimum aberration 2**(9-4) design, as for example, listed in Box, Hunter, and Hunter (1978). If you compare these two designs, you will find that in the maximally unconfounded design, 15 of the 36 2-factor interactions are unconfounded with any other 2-factor interactions, while in the minimum aberration design, only 8 of the 36 2-factor interactions are unconfounded with any other 2-factor interactions. The minimum aberration design, however, produces 18 pairs of confounded interactions, while the maximally unconfounded design produces 21 pairs of confounded interactions. So, the two criteria lead to the selection of generators producing different "best" designs.

Fortunately, the choice of whether to use the maximum unconfounding criterion or the minimum aberration criterion makes no difference in the design which is selected (except for, perhaps, relabeling of the factors) when there are 11 or fewer factors, with the single exception of the 2**(9-4) design described above (see Chen, Sun, & Wu, 1993). For designs with more than 11 factors, the two criteria can lead to the selection of very different designs, and for lack of better advice, we suggest using both criteria, comparing the designs that are produced, and choosing the design that best suits your needs. We will add, editorially, that maximizing the number of totally unconfounded effects often makes more sense than minimizing the number of pairs of confounded effects.

Summary

2**(k-p) fractional factorial designs are probably the most frequently used type of design in industrial experimentation. Things to consider in designing any 2**(k-p) fractional factorial experiment include the number of factors to be investigated, the number of experimental runs, and whether there will be blocks of experimental runs. Beyond these basic considerations, one should also take into account whether the number of runs will allow a design of the required resolution and degree of confounding for the crucial order of interactions, given the resolution.

To index

3**(k-p), Box-Behnken, and Mixed 2 and 3 Level Factorial Designs

Overview

In some cases, factors that have more than 2 levels have to be examined. For example, if one suspects that the effect of the factors on the dependent variable of interest is not simply linear, then, as discussed earlier (see 2**(k-p) designs), one needs at least 3 levels in order to test for the linear and quadratic effects (and interactions) for those factors. Also, sometimes some factors may be categorical in nature, with more than 2 categories. For example, you may have three different machines that produce a particular part.

Designing 3**(k-p) Experiments

The general mechanism of generating fractional factorial designs at 3 levels (3**(k-p) designs) is very similar to that described in the context of 2**(k-p) designs. Specifically, one starts with a full factorial design, and then uses the interactions of the full design to construct "new" factors (or blocks) by making their factor levels identical to those for the respective interaction terms (i.e., by making the new factors aliases of the respective interactions).

For example, consider the following simple 3**(3-1) factorial design:

3**(3-1) fractional factorial

design, 1 block , 9 runs

Standard

Run

A

B

C

1

2

3

4

5

6

7

8

9 0

0

0

1

1

1

2

2

2 0

1

2

0

1

2

0

1

2 0

2

1

2

1

0

1

0

2

As in the case of 2**(k-p) designs, the design is constructed by starting with the full 3-1=2 factorial design; those factors are listed in the first two columns (factors A and B). Factor C is constructed from the interaction AB of the first two factors. Specifically, the values for factor C are computed as

C = 3 - mod3 (A+B)

Here, mod3(x) stands for the so-called modulo-3 operator, which will first find a number y that is less than or equal to x, and that is evenly divisible by 3, and then compute the difference (remainder) between number y and x. For example, mod3(0) is equal to 0, mod3(1) is equal to 1, mod3(3) is equal to 0, mod3(5) is equal to 2 (3 is the largest number that is less than or equal to 5, and that is evenly divisible by 3; finally, 5-3=2), and so on.

Fundamental identity. If you apply this function to the sum of columns A and B shown above, you will obtain the third column C. Similar to the case of 2**(k-p) designs (see 2**(k-p) designs for a discussion of the fundamental identity in the context of 2**(k-p) designs), this confounding of interactions with "new" main effects can be summarized in an expression:

0 = mod3 (A+B+C)

If you look back at the 3**(3-1) design shown earlier, you will see that, indeed, if you add the numbers in the three columns they will all sum to either 0, 3, or 6, that is, values that are evenly divisible by 3 (and hence: mod3(A+B+C)=0). Thus, one could write as a shortcut notation ABC=0, in order to summarize the confounding of factors in the fractional 3**(k-p) design.

Some of the designs will have fundamental identities that contain the number 2 as a multiplier; e.g.,

0 = mod3 (B+C*2+D+E*2+F)

This notation can be interpreted exactly as before, that is, the modulo3 of the sum B+2*C+D+2*E+F must be equal to 0. The next example shows such an identity.

An Example 3**(4-1) Design in 9 Blocks

Here is the summary for a 4-factor 3-level fractional factorial design in 9 blocks, that requires only 27 runs.

SUMMARY: 3**(4-1) fractional factorial

Design generators: ABCD

Block generators: AB,AC2

Number of factors (independent variables): 4

Number of runs (cases, experiments): 27

Number of blocks: 9

This design will allow you to test for linear and quadratic main effects for 4 factors in 27 observations, which can be gathered in 9 blocks of 3 observations each. The fundamental identity or design generator for the design is ABCD, thus the modulo3 of the sum of the factor levels across the four factors is equal to 0. The fundamental identity also allows you to determine the confounding of factors and interactions in the design (see McLean and Anderson, 1984, for details).

Unconfounded Effects (experi3.sta)

EXPERIM.

DESIGN List of uncorrelated factors and interactions

3**(4-1) fractional factorial design, 9 blocks, 27 runs

Unconf. Effects

(excl. blocks) Unconfounded if

blocks included?

1

2

3

4

5

6

7

8 (1)A (L)

A (Q)

(2)B (L)

B (Q)

(3)C (L)

C (Q)

(4)D (L)

D (Q) Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

As you can see, in this 3**(4-1) design the main effects are not confounded with each other, even when the experiment is run in 9 blocks.

Box-Behnken Designs

In the case of 2**(k-p) designs, Plackett and Burman (1946) developed highly fractionalized designs to screen the maximum number of (main) effects in the least number of experimental runs. The equivalent in the case of 3**(k-p) designs are the so-called Box-Behnken designs (Box and Behnken, 1960; see also Box and Draper, 1984). These designs do not have simple design generators (they are constructed by combining two-level factorial designs with incomplete block designs), and have complex confounding of interaction. However, the designs are economical and therefore particularly useful when it is expensive to perform the necessary experimental runs.

Analyzing the 3**(k-p) Design

The analysis of these types of designs proceeds basically in the same way as was described in the context of 2**(k-p) designs. However, for each effect, one can now test for the linear effect and the quadratic (non-linear effect). For example, when studying the yield of chemical process, then temperature may be related in a non-linear fashion, that is, the maximum yield may be attained when the temperature is set at the medium level. Thus, non-linearity often occurs when a process performs near its optimum.

ANOVA Parameter Estimates

To estimate the ANOVA parameters, the factors levels for the factors in the analysis are internally recoded so that one can test the linear and quadratic components in the relationship between the factors and the dependent variable. Thus, regardless of the original metric of factor settings (e.g., 100 degrees C, 110 degrees C, 120 degrees C), you can always recode those values to -1, 0, and +1 to perform the computations. The resultant ANOVA parameter estimates can be interpreted analogously to the parameter estimates for 2**(k-p) designs.

For example, consider the following ANOVA results:

Factor Effect Std.Err. t (69) p

Mean/Interc.

BLOCKS(1)

BLOCKS(2)

(1)TEMPERAT (L)

TEMPERAT (Q)

(2)TIME (L)

TIME (Q)

(3)SPEED (L)

SPEED (Q)

1L by 2L

1L by 2Q

1Q by 2L

1Q by 2Q 103.6942

.8028

-1.2307

-.3245

-.5111

.0017

.0045

-10.3073

-3.7915

3.9256

.4384

.4747

-2.7499 .390591

1.360542

1.291511

.977778

.809946

.977778

.809946

.977778

.809946

1.540235

1.371941

1.371941

.995575 265.4805

.5901

-.9529

-.3319

-.6311

.0018

.0056

-10.5415

-4.6812

2.5487

.3195

.3460

-2.7621 0.000000

.557055

.343952

.740991

.530091

.998589

.995541

.000000

.000014

.013041

.750297

.730403

.007353

Main-effect estimates. By default, the Effect estimate for the linear effects (marked by the L next to the factor name) can be interpreted as the difference between the average response at the low and high settings for the respective factors. The estimate for the quadratic (non-linear) effect (marked by the Q next to the factor name) can be interpreted as the difference between the average response at the center (medium) settings and the combined high and low settings for the respective factors.

Interaction effect estimates. As in the case of 2**(k-p) designs, the linear-by-linear interaction effect can be interpreted as half the difference between the linear main effect of one factor at the high and low settings of another. Analogously, the interactions by the quadratic components can be interpreted as half the difference between the quadratic main effect of one factor at the respective settings of another; that is, either the high or low setting (quadratic by linear interaction), or the medium or high and low settings combined (quadratic by quadratic interaction).

In practice, and from the standpoint of "interpretability of results," one would usually try to avoid quadratic interactions. For example, a quadratic-by-quadratic A-by-B interaction indicates that the non- linear effect of factor A is modified in a nonlinear fashion by the setting of B. This means that there is a fairly complex interaction between factors present in the data that will make it difficult to understand and optimize the respective process. Sometimes, performing nonlinear transformations (e.g., performing a log transformation) of the dependent variable values can remedy the problem.

Centered and non-centered polynomials. As mentioned above, the interpretation of the effect estimates applies only when you use the default parameterization of the model. In that case, you would code the quadratic factor interactions so that they become maximally "untangled" from the linear main effects.

Graphical Presentation of Results

The same diagnostic plots (e.g., of residuals) are available for 3**(k-p) designs as were described in the context of 2**(k-p) designs. Thus, before interpreting the final results, one should always first look at the distribution of the residuals for the final fitted model. The ANOVA assumes that the residuals (errors) are normally distributed.

Plot of means. When an interaction involves categorical factors (e.g., type of machine, specific operator of machine, and some distinct setting of the machine), then the best way to understand interactions is to look at the respective interaction plot of means.

Surface plot. When the factors in an interaction are continuous in nature, you may want to look at the surface plot that shows the response surface applied by the fitted model. Note that this graph also contains the prediction equation (in terms of the original metric of factors), that produces the respective response surface.

Designs for Factors at 2 and 3 Levels

You can also generate standard designs with 2 and 3 level factors. Specifically, you can generate the standard designs as enumerated by Connor and Young for the US National Bureau of Standards (see McLean and Anderson, 1984). The technical details of the method used to generate these designs are beyond the scope of this introduction. However, in general the technique is, in a sense, a combination of the procedures described in the context of 2**(k-p) and 3**(k-p) designs. It should be noted however, that, while all of these designs are very efficient, they are not necessarily orthogonal with respect to all main effects. This is, however, not a problem, if one uses a general algorithm for estimating the ANOVA parameters and sums of squares, that does not require orthogonality of the design.

The design and analysis of these experiments proceeds along the same lines as discussed in the context of 2**(k-p) and 3**(k-p) experiments.

To index

Central Composite and Non-Factorial Response Surface Designs

Overview

The 2**(k-p) and 3**(k-p) designs all require that the levels of the factors are set at, for example, 2 or 3 levels. In many instances, such designs are not feasible, because, for example, some factor combinations are constrained in some way (e.g., factors A and B cannot be set at their high levels simultaneously). Also, for reasons related to efficiency, which will be discussed shortly, it is often desirable to explore the experimental region of interest at particular points that cannot be represented by a factorial design.

The designs (and how to analyze them) discussed in this section all pertain to the estimation (fitting) of response surfaces, following the general model equation:

y = b0 +b1 *x1 +...+bk *xk + b12 *x1 *x2 +b13 *x1 *x3 +...+bk-1,k *xk-1 *xk + b11 *x1² +...+bkk *xk²

Put into words, one is fitting a model to the observed values of the dependent variable y, that include (1) main effects for factors x1 , ..., xk, (2) their interactions (x1*x2, x1*x3, ... ,xk-1*xk), and (3) their quadratic components (x1**2, ..., xk**2). No assumptions are made concerning the "levels" of the factors, and you can analyze any set of continuous values for the factors.

There are some considerations concerning design efficiency and biases, which have led to standard designs that are ordinarily used when attempting to fit these response surfaces, and those standard designs will be discussed shortly (e.g., see Box, Hunter, and Hunter, 1978; Box and Draper, 1987; Khuri and Cornell, 1987; Mason, Gunst, and Hess, 1989; Montgomery, 1991). But, as will be discussed later, in the context of constrained surface designs and D- and A-optimal designs, these standard designs can sometimes not be used for practical reasons. However, the central composite design analysis options do not make any assumptions about the structure of your data file, that is, the number of distinct factor values, or their combinations across the runs of the experiment, and, hence, these options can be used to analyze any type of design, to fit to the data the general model described above.

Design Considerations

Orthogonal designs. One desirable characteristic of any design is that the main effect and interaction estimates of interest are independent of each other. For example, suppose you had a two- factor experiments, with both factors at two levels. Your design consists of four runs:

A B

Run 1

Run 2

Run 3

Run 4 1

1

-1

-1 1

1

-1

-1

For the first two runs, both factors A and B are set at their high levels (+1). In the last two runs, both are set at their low levels (-1). Suppose you wanted to estimate the independent contributions of factors A and B to the prediction of the dependent variable of interest. Clearly this is a silly design, because there is no way to estimate the A main effect and the B main effect. One can only estimate one effect -- the difference between Runs 1+2 vs. Runs 3+4 -- which represents the combined effect of A and B.

The point here is that, in order to assess the independent contributions of the two factors, the factor levels in the four runs must be set so that the "columns" in the design (under A and B in the illustration above) are independent of each other. Another way to express this requirement is to say that the columns of the design matrix (with as many columns as there are main effect and interaction parameters that one wants to estimate) should be orthogonal (this term was first used by Yates, 1933). For example, if the four runs in the design are arranged as follows:

A B

Run 1

Run 2

Run 3

Run 4 1

1

-1

-1 1

-1

1

-1

then the A and B columns are orthogonal. Now you can estimate the A main effect by comparing the high level for A within each level of B, with the low level for A within each level of B; the B main effect can be estimated in the same way.

Technically, two columns in a design matrix are orthogonal if the sum of the products of their elements within each row is equal to zero. In practice, one often encounters situations, for example due to loss of some data in some runs or other constraints, where the columns of the design matrix are not completely orthogonal. In general, the rule here is that the more orthogonal the columns are, the better the design, that is, the more independent information can be extracted from the design regarding the respective effects of interest. Therefore, one consideration for choosing standard central composite designs is to find designs that are orthogonal or near-orthogonal.

Rotatable designs. The second consideration is related to the first requirement, in that it also has to do with how best to extract the maximum amount of (unbiased) information from the design, or specifically, from the experimental region of interest. Without going into details (see Box, Hunter, and Hunter, 1978; Box and Draper, 1987, Chapters 14; see also Deming and Morgan, 1993, Chapter 13), it can be shown that the standard error for the prediction of dependent variable values is proportional to:

(1 + f(x)' * (X'X)¨¹ * f(x))**½

where f(x) stands for the (coded) factor effects for the respective model (f(x) is a vector, f(x)' is the transpose of that vector), and X is the design matrix for the experiment, that is, the matrix of coded factor effects for all runs; X'X**-1 is the inverse of the crossproduct matrix. Deming and Morgan (1993) refer to this expression as the normalized uncertainty; this function is also related to the variance function as defined by Box and Draper (1987). The amount of uncertainty in the prediction of dependent variable values depends on the variability of the design points, and their covariance over the runs. (Note that it is inversely proportional to the determinant of X'X; this issue is further discussed in the section on D- and A-optimal designs).

The point here is that, again, one would like to choose a design that extracts the most information regarding the dependent variable, and leaves the least amount of uncertainty for the prediction of future values. It follows, that the amount of information (or normalized information according to Deming and Morgan, 1993) is the inverse of the normalized uncertainty.

For the simple 4-run orthogonal experiment shown earlier, the information function is equal to

Ix = 4/(1 + x1² + x2²)

where x1 and x2 stand for the factor settings for factors A and B, respectively (see Box and Draper, 1987).

Inspection of this function in a plot (see above) shows that it is constant on circles centered at the origin. Thus any kind of rotation of the original design points will generate the same amount of information, that is, generate the same information function. Therefore, the 2-by-2 orthogonal design in 4 runs shown earlier is said to be rotatable.

As pointed out before, in order to estimate the second order, quadratic, or non-linear component of the relationship between a factor and the dependent variable, one needs at least 3 levels for the respective factors. What does the information function look like for a simple 3-by-3 factorial design, for the second-order quadratic model as shown at the beginning of this section?

As it turns out (see Box and Draper, 1987 and Montgomery, 1991; refer also to the manual), this function looks more complex, contains "pockets" of high-density information at the edges (which are probably of little particular interest to the experimenter), and clearly it is not constant on circles around the origin. Therefore, it is not rotatable, meaning different rotations of the design points will extract different amounts of information from the experimental region.

Star-points and rotatable second-order designs. It can be shown that by adding so-called star- points to the simple (square or cube) 2-level factorial design points, one can achieve rotatable, and often orthogonal or nearly orthogonal designs. For example, adding to the simple 2-by-2 orthogonal design shown earlier the following points, will produce a rotatable design.

A B

Run 1

Run 2

Run 3

Run 4

Run 5

Run 6

Run 7

Run 8

Run 9

Run 10 1

1

-1

-1

-1.414

1.414

0

0

0

0 1

-1

1

-1

0

0

-1.414

1.414

0

0

The first four runs in this design are the previous 2-by-2 factorial design points (or square points or cube points); runs 5 through 8 are the so-called star points or axial points, and runs 9 and 10 are center points.

The information function for this design for the second-order (quadratic) model is rotatable, that is, it is constant on the circles around the origin.

Alpha for Rotatability and Orthogonality

The two design characteristics discussed so far -- orthogonality and rotatability -- depend on the number of center points in the design and on the so-called axial distance (alpha), which is the distance of the star points from the center of the design (i.e., 1.414 in the design shown above). It can be shown (e.g., see Box, Hunter, and Hunter, 1978; Box and Draper, 1987, Khuri and Cornell, 1987; Montgomery, 1991) that a design is rotatable if:

= ( nc )¼

where nc stands for the number of cube points in the design (i.e., points in the factorial portion of the design).

A central composite design is orthogonal, if one chooses the axial distance so that:

= {[( nc + ns + n0 )½ - nc½]² * nc/4}¼

where

nc is the number of cube points in the design

ns is the number of star points in the design

n0 is the number of center points in the design

To make a design both (approximately) orthogonal and rotatable, one would first choose the axial distance for rotatability, and then add center points (see Kkuri and Cornell, 1987), so that:

n0 4*nc½ + 4 - 2k

where k stands for the number of factors in the design.

Finally, if blocking is involved, Box and Draper (1987) give the following formula for computing the axial distance to achieve orthogonal blocking, and in most cases also reasonable information function contours, that is, contours that are close to spherical:

= [k*(l+ns0/ns)/(1+nc0/nc)]½

where

ns0 is the number of center points in the star portion of the design

ns is the number of non-center star points in the design

nc0 is the number of center points in the cube portion of the design

nc is the number of non-center cube points in the design

Available Standard Designs

The standard central composite designs are usually constructed from a 2**(k-p) design for the cube portion of the design, which is augmented with center points and star points. Box and Draper (1987) list a number of such designs.

Small composite designs. In the standard designs, the cube portion of the design is typically of resolution V (or higher). This is, however, not necessary, and in cases when the experimental runs are expensive, or when it is not necessary to perform a statistically powerful test of model adequacy, then one could choose for the cube portion designs of resolution III. For example, it could be constructed from highly fractionalized Plackett-Burman designs. Hartley (1959) described such designs.

Analyzing Central Composite Designs

The analysis of central composite designs proceeds in much the same way as for the analysis of 3**(k-p) designs. You fit to the data the general model described above; for example, for two variables you would fit the model:

y = b0 + b1*x1 + b2*x2 + b12*x1*x2 + b11*x12 + b22*x22

The Fitted Response Surface

The shape of the fitted overall response can best be summarized in graphs and you can generate both contour plots and response surface plots (see examples below) for the fitted model.

Categorized Response Surfaces

You can fit 3D surfaces to your data, categorized by some other variable. For example, if you replicated a standard central composite design 4 times, it may be very informative to see how similar the surfaces are when fitted to each replication.

This would give you a graphical indication of the reliability of the results and where (e.g., in which region of the surface) deviations occur.

Clearly, the third replication produced a different surface. In replications 1, 2, and 4, the fitted surfaces are very similar to each other. Thus, one should investigate what could have caused this noticeable difference in the third replication of the design.

To index

Latin Square Designs

Overview

Latin square designs (the term Latin square was first used by Euler, 1782) are used when the factors of interest have more than two levels and you know ahead of time that there are no (or only negligible) interactions between factors. For example, if you wanted to examine the effect of 4 fuel additives on reduction in oxides of nitrogen and had 4 cars and 4 drivers at your disposal, then you could of course run a full 4 x 4 x 4 factorial design, resulting in 64 experimental runs. However, you are not really interested in any (minor) interactions between the fuel additives and drivers, fuel additives and cars, or cars and drivers. You are mostly interested in estimating main effects, in particular the one for the fuel additives factor. At the same time, you want to make sure that the main effects for drivers and cars do not affect (bias) your estimate of the main effect for the fuel additive.

If you labeled the additives with the letters A, B, C, and D, the Latin square design that would allow you to derive unconfounded main effects estimates could be summarized as follows (see also Box, Hunter, and Hunter, 1978, page 263):

Car

Driver 1 2 3 4

1

2

3

4 A

D

B

C B

C

D

A D

A

C

B C

B

A

D

Latin Square Designs

The example shown above is actually only one of the three possible arrangements in effect estimates. These "arrangements" are also called Latin square. The example above constitutes a 4 x 4 Latin square; and rather than requiring the 64 runs of the complete factorial, you can complete the study in only 16 runs.

Greco-Latin square. A nice feature of Latin Squares is that they can be superimposed to form what are called Greco-Latin squares (this term was first used by Fisher and Yates, 1934). For example, the following two 3 x 3 Latin squares can be superimposed to form a Greco-Latin square:

In the resultant Greco-Latin square design, you can evaluate the main effects of four 3-level factors (row factor, column factor, Roman letters, Greek letters) in only 9 runs.

Hyper-Greco Latin square. For some numbers of levels, there are more than two possible Latin square arrangements. For example, there are three possible arrangements for 4-level Latin squares. If all three of them are superimposed, you get a Hyper-Greco Latin square design. In that design you can estimate the main effects of all five 4-level factors with only 16 runs in the experiment.

Analyzing the Design

Analyzing Latin square designs is straightforward. Also, plots of means can be produced to aid in the interpretation of results.

Very Large Designs, Random Effects, Unbalanced Nesting

Note that there are several other statistical methods that can also analyze these types of designs; see the section on Methods for Analysis of Variance for details. In particular the Variance Components and Mixed Model ANOVA/ANCOVA chapter discusses very efficient methods for analyzing designs with unbalanced nesting (when the nested factors have different numbers of levels within the levels of the factors in which they are nested), very large nested designs (e.g., with more than 200 levels overall), or hierarchically nested designs (with or without random factors).

To index

Taguchi Methods: Robust Design Experiments

Overview

Applications. Taguchi methods have become increasingly popular in recent years. The documented examples of sizable quality improvements that resulted from implementations of these methods (see, for example, Phadke, 1989; Noori, 1989) have added to the curiosity among American manufacturers. In fact, some of the leading manufacturers in this country have begun to use these methods with usually great success. For example, AT&T is using these methods in the manufacture of very large scale integrated (VLSI) circuits; also, Ford Motor Company has gained significant quality improvements due to these methods (American Supplier Institute, 1984 to 1988). However, as the details of these methods are becoming more widely known, critical appraisals are also beginning to appear (for example, Bhote, 1988; Tribus and Szonyi, 1989).

Overview. Taguchi robust design methods are set apart from traditional quality control procedures (see Quality Control and Process Analysis) and industrial experimentation in various respects. Of particular importance are:

1. The concept of quality loss functions,

2. The use of signal-to-noise (S/N) ratios, and

3. The use of orthogonal arrays.

These basic aspects of robust design methods will be discussed in the following sections. Several books have recently been published on these methods, for example, Peace (1993), Phadke (1989), Ross (1988), and Roy (1990), to name a few, and it is recommended that you refer to those books for further specialized discussions. Introductory overviews of Taguchi's ideas about quality and quality improvement can also be found in Barker (1986), Garvin (1987), Kackar (1986), and Noori (1989).

Quality and Loss Functions

What is quality. Taguchi's analysis begins with the question of how to define quality. It is not easy to formulate a simple definition of what constitutes quality; however, when your new car stalls in the middle of a busy intersection -- putting yourself and other motorists at risk -- you know that your car is not of high quality. Put another way, the definition of the inverse of quality is rather straightforward: it is the total loss to you and society due to functional variations and harmful side effects associated with the respective product. Thus, as an operational definition, you can measure quality in terms of this loss, and the greater the quality loss the lower the quality.

Discontinuous (step-shaped) loss function. You can formulate hypotheses about the general nature and shape of the loss function. Assume a specific ideal point of highest quality; for example, a perfect car with no quality problems. It is customary in statistical process control (SPC; see also Process Analysis) to define tolerances around the nominal ideal point of the production process. According to the traditional view implied by common SPC methods, as long as you are within the manufacturing tolerances you do not have a problem. Put another way, within the tolerance limits the quality loss is zero; once you move outside the tolerances, the quality loss is declared to be unacceptable. Thus, according to traditional views, the quality loss function is a discontinuous step function: as long as you are within the tolerance limits, quality loss is negligible; when you step outside those tolerances, quality loss becomes unacceptable.

Quadratic loss function. Is the step function implied by common SPC methods a good model of quality loss? Return to the "perfect automobile" example. Is there a difference between a car that, within one year after purchase, has nothing wrong with it, and a car where minor rattles develop, a few fixtures fall off, and the clock in the dashboard breaks (all in-warranty repairs, mind you...)? If you ever bought a new car of the latter kind, you know very well how annoying those admittedly minor quality problems can be. The point here is that it is not realistic to assume that, as you move away from the nominal specification in your production process, the quality loss is zero as long as you stay within the set tolerance limits. Rather, if you are not exactly "on target," then loss will result, for example in terms of customer satisfaction. Moreover, this loss is probably not a linear function of the deviation from nominal specifications, but rather a quadratic function (inverted U). A rattle in one place in your new car is annoying, but you would probably not get too upset about it; add two more rattles, and you might declare the car "junk." Gradual deviations from the nominal specifications do not produce proportional increments in loss, but rather squared increments.

Conclusion: Controlling variability. If, in fact, quality loss is a quadratic function of the deviation from a nominal value, then the goal of your quality improvement efforts should be to minimize the squared deviations or variance of the product around nominal (ideal) specifications, rather than the number of units within specification limits (as is done in traditional SPC procedures).

Signal-to-Noise (S/N) Ratios

Measuring quality loss. Even though you have concluded that the quality loss function is probably quadratic in nature, you still do not know precisely how to measure quality loss. However, you know that whatever measure you decide upon should reflect the quadratic nature of the function.

Signal, noise, and control factors. The product of ideal quality should always respond in exactly the same manner to the signals provided by the user. When you turn the key in the ignition of your car you expect that the starter motor turns and the engine starts. In the ideal-quality car, the starting process would always proceed in exactly the same manner -- for example, after three turns of the starter motor the engine comes to life. If, in response to the same signal (turning the ignition key) there is random variability in this process, then you have less than ideal quality. For example, due to such uncontrollable factors as extreme cold, humidity, engine wear, etc. the engine may sometimes start only after turning over 20 times and finally not start at all. This example illustrates the key principle in measuring quality according to Taguchi: You want to minimize the variability in the product's performance in response to noise factors while maximizing the variability in response to signal factors.

Noise factors are those that are not under the control of the operator of a product. In the car example, those factors include temperature changes, different qualities of gasoline, engine wear, etc. Signal factors are those factors that are set or controlled by the operator of the product to make use of its intended functions (turning the ignition key to start the car).

Finally, the goal of your quality improvement effort is to find the best settings of factors under your control that are involved in the production process, in order to maximize the S/N ratio; thus, the factors in the experiment represent control factors.

S/N ratios. The conclusion of the previous paragraph is that quality can be quantified in terms of the respective product's response to noise factors and signal factors. The ideal product will only respond to the operator's signals and will be unaffected by random noise factors (weather, temperature, humidity, etc.). Therefore, the goal of your quality improvement effort can be stated as attempting to maximize the signal-to-noise (S/N) ratio for the respective product. The S/N ratios described in the following paragraphs have been proposed by Taguchi (1987).

Smaller-the-better. In cases where you want to minimize the occurrences of some undesirable product characteristics, you would compute the following S/N ratio:

Eta = -10 * log10 [(1/n) * (yi2)] for i = 1 to no. vars see outer arrays

Here, Eta is the resultant S/N ratio; n is the number of observations on the particular product, and y is the respective characteristic. For example, the number of flaws in the paint on an automobile could be measured as the y variable and analyzed via this S/N ratio. The effect of the signal factors is zero, since zero flaws is the only intended or desired state of the paint on the car. Note how this S/N ratio is an expression of the assumed quadratic nature of the loss function. The factor 10 ensures that this ratio measures the inverse of "bad quality;" the more flaws in the paint, the greater is the sum of the squared number of flaws, and the smaller (i.e., more negative) the S/N ratio. Thus, maximizing this ratio will increase quality.

Nominal-the-best. Here, you have a fixed signal value (nominal value), and the variance around this value can be considered the result of noise factors:

Eta = 10 * log10 (Mean2/Variance)

This signal-to-noise ratio could be used whenever ideal quality is equated with a particular nominal value. For example, the size of piston rings for an automobile engine must be as close to specification as possible to ensure high quality.

Larger-the-better. Examples of this type of engineering problem are fuel economy (miles per gallon) of an automobile, strength of concrete, resistance of shielding materials, etc. The following S/N ratio should be used:

Eta = -10 * log10 [(1/n) * (1/yi2)] for i = 1 to no. vars see outer arrays

Signed target. This type of S/N ratio is appropriate when the quality characteristic of interest has an ideal value of 0 (zero), and both positive and negative values of the quality characteristic may occur. For example, the dc offset voltage of a differential operational amplifier may be positive or negative (see Phadke, 1989). The following S/N ratio should be used for these types of problems:

Eta = -10 * log10(s2) for i = 1 to no. vars see outer arrays

where s2 stands for the variance of the quality characteristic across the measurements (variables).

Fraction defective. This S/N ratio is useful for minimizing scrap, minimizing the percent of patients who develop side-effects to a drug, etc. Taguchi also refers to the resultant Eta values as Omegas; note that this S/N ratio is identical to the familiar logit transformation (see also Nonlinear Estimation):

Eta = -10 * log10[p/(1-p)]

where

p is the proportion defective

Ordered categories (the accumulation analysis). In some cases, measurements on a quality characteristic can only be obtained in terms of categorical judgments. For example, consumers may rate a product as excellent, good, average, or below average. In that case, you would attempt to maximize the number of excellent or good ratings. Typically, the results of an accumulation analysis are summarized graphically in a stacked bar plot.

Orthogonal Arrays

The third aspect of Taguchi robust design methods is the one most similar to traditional techniques. Taguchi has developed a system of tabulated designs (arrays) that allow for the maximum number of main effects to be estimated in an unbiased (orthogonal) manner, with a minimum number of runs in the experiment. Latin square designs, 2**(k-p) designs (Plackett-Burman designs, in particular), and Box-Behnken designs main are also aimed at accomplishing this goal. In fact, many of the standard orthogonal arrays tabulated by Taguchi are identical to fractional two-level factorials, Plackett-Burman designs, Box-Behnken designs, Latin square, Greco-Latin squares, etc.

Analyzing Designs

Most analyses of robust design experiments amount to a standard ANOVA of the respective S/N ratios, ignoring two-way or higher-order interactions. However, when estimating error variances, one customarily pools together main effects of negligible size.

Analyzing S/N ratios in standard designs. It should be noted at this point that, of course, all of the designs discussed up to this point (e.g., 2**(k-p), 3**(k-p), mixed 2 and 3 level factorials, Latin squares, central composite designs) can be used to analyze S/N ratios that you computed. In fact, the many additional diagnostic plots and other options available for those designs (e.g., estimation of quadratic components, etc.) may prove very useful when analyzing the variability (S/N ratios) in the production process.

Plot of means. A visual summary of the experiment is the plot of the average Eta (S/N ratio) by factor levels. In this plot, the optimum setting (i.e., largest S/N ratio) for each factor can easily be identified.

Verification experiments. For prediction purposes, you can compute the expected S/N ratio given a user-defined combination of settings of factors (ignoring factors that were pooled into the error term). These predicted S/N ratios can then be used in a verification experiment, where the engineer actually sets the machine accordingly and compares the resultant observed S/N ratio with the predicted S/N ratio from the experiment. If major deviations occur, one must conclude that the simple main effect model is not appropriate.

In those cases, Taguchi (1987) recommends transforming the dependent variable to accomplish additivity of factors, that is, to "make" the main effects model fit. Phadke (1989, Chapter 6) also discusses in detail methods for achieving additivity of factors.

Accumulation Analysis

When analyzing ordered categorical data, ANOVA is not appropriate. Rather, you produce a cumulative plot of the number of observations in a particular category. For each level of each factor, you plot the cumulative proportion of the number of defectives. Thus, this graph provides valuable information concerning the distribution of the categorical counts across the different factor settings.

Summary

To briefly summarize, when using Taguchi methods you first need to determine the design or control factors that can be set by the designer or engineer. Those are the factors in the experiment for which you will try different levels. Next, you decide to select an appropriate orthogonal array for the experiment. Next, you need to decide on how to measure the quality characteristic of interest. Remember that most S/N ratios require that multiple measurements are taken in each run of the experiment; for example, the variability around the nominal value cannot otherwise be assessed. Finally, you conduct the experiment and identify the factors that most strongly affect the chosen S/N ratio, and you reset your machine or production process accordingly.

To index

Mixture Designs and Triangular Surfaces

Overview

Special issues arise when analyzing mixtures of components that must sum to a constant. For example, if you wanted to optimize the taste of a fruit-punch, consisting of the juices of 5 fruits, then the sum of the proportions of all juices in each mixture must be 100%. Thus, the task of optimizing mixtures commonly occurs in food-processing, refining, or the manufacturing of chemicals. A number of designs have been developed to address specifically the analysis and modeling of mixtures (see, for example, Cornell, 1990a, 1990b; Cornell and Khuri, 1987; Deming and Morgan, 1993; Montgomery, 1991).

Triangular Coordinates

The common manner in which mixture proportions can be summarized is via triangular (ternary) graphs. For example, suppose you have a mixture that consists of 3 components A, B, and C. Any mixture of the three components can be summarized by a point in the triangular coordinate system defined by the three variables.

For example, take the following 6 different mixtures of the 3 components.

A B C

1

0

0

0.5

0.5

0 0

1

0

0.5

0

0.5 0

0

1

0

0.5

0.5

The sum for each mixture is 1.0, so the values for the components in each mixture can be interpreted as proportions. If you graph these data in a regular 3D scatterplot, it becomes apparent that the points form a triangle in the 3D space. Only the points inside the triangle where the sum of the component values is equal to 1 are valid mixtures. Therefore, one can simply plot only the triangle to summarize the component values (proportions) for each mixture.

To read-off the coordinates of a point in the triangular graph, you would simply "drop" a line from each respective vertex to the side of the triangle below.

At the vertex for the particular factor, there is a pure blend, that is, one that only contains the respective component. Thus, the coordinates for the vertex point is 1 (or 100%, or however else the mixtures are scaled) for the respective component, and 0 (zero) for all other components. At the side opposite to the respective vertex, the value for the respective component is 0 (zero), and .5 (or 50%, etc.) for the other components.

Triangular Surfaces and Contours

One can now add to the triangle a fourth dimension, that is perpendicular to the first three. Using that dimension, one could plot the values for a dependent variable, or function (surface) that was fit to the dependent variable. Note that the response surface can either be shown in 3D, where the predicted response (Taste rating) is indicated by the distance of the surface from the triangular plane, or it can be indicated in a contour plot where the contours of constant height are plotted on the 2D triangle.

It should be mentioned at this point that you can produce categorized ternary graphs. These are very useful, because they allow you to fit to a dependent variable (e.g., Taste) a response surface, for different levels of a fourth component.

The Canonical Form of Mixture Polynomials

Fitting a response surface to mixture data is, in principle, done in the same manner as fitting surfaces to, for example, data from central composite designs. However, there is the issue that mixture data are constrained, that is, the sum of all component values must be constant.

Consider the simple case of two factors A and B. One may want to fit the simple linear model:

y = b0 + bA*xA + bB*xB

Here y stands for the dependent variable values, bA and bB stand for the regression coefficients, xA and xB stand for the values of the factors. Suppose that xA and xB must sum to 1; you can multiple b0 by 1=(xA + xB):

y = (b0*xA + b0*xB) + bA*xA + bB*xB

or:

y = b'A*xA + b'B*xB

where b'A = b0 + bA and b'B = b0 + bB. Thus, the estimation of this model comes down to fitting a no- intercept multiple regression model. (See also Multiple Regression, for details concerning multiple regression.)

Common Models for Mixture Data

The quadratic and cubic model can be similarly simplified (as illustrated for the simple linear model above), yielding four standard models that are customarily fit to the mixture data. Here are the formulas for the 3-variable case for those models (see Cornell, 1990, for additional details).

Linear model:

y = b1*x1 + b2*x2 + b3*x3

Quadratic model:

y = b1*x1 + b2*x2 + b3*x3 + b12*x1*x2 + b13*x1*x3 + b23*x2*x3

Special cubic model:

y = b1*x1 + b2*x2 + b3*x3 + b12*x1*x2 + b13*x1*x3 + b23*x2*x3 + b123*x1*x2*x3

Full cubic model:

y = b1*x1 + b2*x2 + b3*x3 + b12*x1*x2 + b13*x1*x3 + b23*x2*x3 + d12*x1*x2*(x1 - x2) + d13*x1*x3*(x1 - x3) + d23*x2*x3*(x2 - x3) + b123*x1*x2*x3

(Note that the dij's are also parameters of the model.)

Standard Designs for Mixture Experiments

Two different types of standard designs are commonly used for experiments with mixtures. Both of them will evaluate the triangular response surface at the vertices (i.e., the corners of the triangle) and the centroids (sides of the triangle). Sometimes, those designs are enhanced with additional interior points.

Simplex-lattice designs. In this arrangement of design points, m+1 equally spaced proportions are tested for each factor or component in the model:

xi = 0, 1/m, 2/m, ..., 1 i = 1,2,...,q

and all combinations of factor levels are tested. The resulting design is called a {q,m} simplex lattice design. For example, a {q=3, m=2} simplex lattice design will include the following mixtures:

A B C

1

0

0

.5

.5

0 0

1

0

.5

0

.5 0

0

1

0

.5

.5

A {q=3,m=3} simplex lattice design will include the points:

A B C

1

0

0

1/3

1/3

0

2/3

2/3

0

1/3 0

1

0

2/3

0

1/3

1/3

0

2/3

1/3 0

0

1

0

2/3

2/3

0

1/3

1/3

1/3

Simplex-centroid designs. An alternative arrangement of settings introduced by Scheffé (1963) is the so-called simplex-centroid design. Here the design points correspond to all permutations of the pure blends (e.g., 1 0 0; 0 1 0; 0 0 1), the permutations of the binary blends (½ ½ 0; ½ 0 ½; 0 ½ ½), the permutations of the blends involving three components, and so on. For example, for 3 factors the simplex centroid design consists of the points:

A B C

1

0

0

1/2

1/2

0

1/3 0

1

0

1/2

0

1/2

1/3 0

0

1

0

1/2

1/2

1/3

Adding interior points. These designs are sometimes augmented with interior points (see Khuri and Cornell, 1987, page 343; Mason, Gunst, Hess; 1989; page 230). For example, for 3 factors one could add the interior points:

A B C

2/3

1/6

1/6 1/6

2/3

1/6 1/6

1/6

2/3

If you plot these points in a scatterplot with triangular coordinates; one can see how these designs evenly cover the experimental region defined by the triangle.

Lower Constraints

The designs described above all require vertex points, that is, pure blends consisting of only one ingredient. In practice, those points may often not be valid, that is, pure blends cannot be produced because of cost or other constraints. For example, suppose you wanted to study the effect of a food- additive on the taste of the fruit-punch. The additional ingredient may only be varied within small limits, for example, it may not exceed a certain percentage of the total. Clearly, a fruit punch that is a pure blend, consisting only of the additive, would not be a fruit punch at all, or worse, may be toxic. These types of constraints are very common in many applications of mixture experiments.

Let us consider a 3-component example, where component A is constrained so that xA .3. The total of the 3-component mixture must be equal to 1. This constraint can be visualized in a triangular graph by a line at the triangular coordinate for xA=.3, that is, a line that is parallel to the triangle's edge opposite to the A vertex point.

One can now construct the design as before, except that one side of the triangle is defined by the constraint. Later, in the analysis, one can review the parameter estimates for the so-called pseudo-components, treating the constrained triangle as if it were a full triangle.

Multiple constraints. Multiple lower constraints can be treated analogously, that is, you can construct the sub-triangle within the full triangle, and then place the design points in that sub-triangle according to the chosen design.

Upper and Lower Constraints

When there are both upper and lower constraints (as is often the case in experiments involving mixtures), then the standard simplex-lattice and simplex-centroid designs can no longer be constructed, because the subregion defined by the constraints is no longer a triangle. There is a general algorithm for finding the vertex and centroid points for such constrained designs.

Note that you can still analyze such designs by fitting the standard models to the data.

Analyzing Mixture Experiments

The analysis of mixture experiments amounts to a multiple regression with the intercept set to zero. As explained earlier, the mixture constraint -- that the sum of all components must be constant -- can be accommodated by fitting multiple regression models that do not include an intercept term. If you are not familiar with multiple regression, you may want to review at this point Multiple Regression.

The specific models that are usually considered were described earlier. To summarize, one fits to the dependent variable response surfaces of increasing complexity, that is, starting with the linear model, then the quadratic model, special cubic model, and full cubic model. Shown below is a table with the number of terms or parameters in each model, for a selected number of components (see also Table 4, Cornell, 1990):

Model (Degree of Polynomial)

No. of

Comp.

Linear

Quadr. Special

Cubic Full

Cubic

2

3

4

5

6

7

8 2

3

4

5

6

7

8 3

6

10

15

21

28

36 --

7

14

25

41

63

92 --

10

20

35

56

84

120

Analysis of Variance

To decide which of the models of increasing complexity provides a sufficiently good fit to the observed data, one usually compares the models in a hierarchical, stepwise fashion. For example, consider a 3- component mixture to which the full cubic model was fitted.

ANOVA; Var.:DV (mixt4.sta)

3 Factor mixture design; Mixture total=1., 14 Runs

Sequential fit of models of increasing complexity

Model SS

Effect df

Effect MS

Effect SS

Error df

Error MS

Error

F

p

R-sqr R-sqr

Adj.

Linear

Quadratic

Special Cubic

Cubic

Total Adjusted 44.755

30.558

.719

8.229

91.627 2

3

1

3

13 22.378

10.186

.719

2.743

7.048 46.872

16.314

15.596

7.367

11

8

7

4

4.2611

2.0393

2.2279

1.8417

5.2516

4.9949

.3225

1.4893

.0251

.0307

.5878

.3452

.4884

.8220

.8298

.9196

.3954

.7107

.6839

.7387

First, the linear model was fit to the data. Even though this model has 3 parameters, one for each component, this model has only 2 degrees of freedom. This is because of the overall mixture constraint, that the sum of all component values is constant. The simultaneous test for all parameters of this model is statistically significant (F(2,11)=5.25; p<.05). The addition of the 3 quadratic model parameters (b12*x1*x2, b13*x1*x3, b23*x2*x3) further significantly improves the fit of the model (F(3,8)=4.99; p<.05). However, adding the parameters for the special cubic and cubic models does not significantly improve the fit of the surface. Thus one could conclude that the quadratic model provides an adequate fit to the data (of course, pending further examination of the residuals for outliers, etc.).

R-square. The R-square value can be interpreted as the proportion of variability around the mean for the dependent variable, that can be accounted for by the respective model. (Note that for non- intercept models, some multiple regression programs will only compute the R-square value pertaining to the proportion of variance around 0 (zero) accounted for by the independent variables; for more information, see Kvalseth, 1985; Okunade, Chang, and Evans, 1993.)

Pure error and lack of fit. The usefulness of the estimate of pure error for assessing the overall lack of fit was discussed in the context of central composite designs. If some runs in the design were replicated, then one can compute an estimate of error variability based only on the variability between replicated runs. This variability provides a good indication of the unreliability in the measurements, independent of the model that was fit to the data, since it is based on identical factor settings (or blends in this case). One can test the residual variability after fitting the current model against this estimate of pure error. If this test is statistically significant, that is, if the residual variability is significantly larger than the pure error variability, then one can conclude that, most likely, there are additional significant differences between blends that cannot be accounted for by the current model. Thus, there may be an overall lack of fit of the current model. In that case, try a more complex model, perhaps by only adding individual terms of the next higher-order model (e.g., only the b13*x1*x3 to the linear model).

Parameter Estimates

Usually, after fitting a particular model, one would next review the parameter estimates. Remember that the linear terms in mixture models are constrained, that is, the sum of the components must be constant. Hence, independent statistical significance tests for the linear components cannot be performed.

Pseudo-Components

To allow for scale-independent comparisons of the parameter estimates, during the analysis, the component settings are customarily recoded to so-called pseudo-components so that (see also Cornell, 1993, Chapter 3):

x'i = (xi-Li)/(Total-L)

Here, x'i stands for the i'th pseudo-component, xi stands for the original component value, Li stands for the lower constraint (limit) for the i'th component, L stands for the sum of all lower constraints (limits) for all components in the design, and Total is the mixture total.

The issue of lower constraints was also discussed earlier in this section. If the design is a standard simplex-lattice or simplex-centroid design (see above), then this transformation amounts to a rescaling of factors so as to form a sub-triangle (sub-simplex) as defined by the lower constraints. However, you can compute the parameter estimates based on the original (untransformed) metric of the components in the experiment. If you want to use the fitted parameter values for prediction purposes (i.e., to predict dependent variable values), then the parameters for the untransformed components are often more convenient to use. Note that the results dialog for mixture experiments contains options to make predictions for the dependent variable for user-defined values of the components, in their original metric.

Graph Options

Surface and contour plots. The respective fitted model can be visualized in triangular surface plots or contour plots, which, optionally, can also include the respective fitted function.

Note that the fitted function displayed in the surface and contour plots always pertains to the parameter estimates for the pseudo-components.

Categorized surface plots. If your design involves replications (and the replications are coded in your data file), then you can use 3D Ternary Plots to look at the respective fit, replication by replication.

Of course, if you have other categorical variables in your study (e.g., operator or experimenter; machine, etc.) you can also categorize the 3D surface plot by those variables.

Trace plots. One aid for interpreting the triangular response surface is the so-called trace plot. Suppose you looked at the contour plot of the response surface for three components. Then, determine a reference blend for two of the components, for example, hold the values for A and B at 1/3 each. Keeping the relative proportions of A and B constant (i.e., equal proportions in this case), you can then plot the estimated response (values for the dependent variable) for different values of C.

If the reference blend for A and B is 1:1, then the resulting line or response trace is the axis for factor C; that is, the line from the C vertex point connecting with the opposite side of the triangle at a right angle. However, trace plots for other reference blends can also be produced. Typically, the trace plot contains the traces for all components, given the current reference blend.

Residual plots. Finally, it is important, after deciding on a model, to review the prediction residuals, in order to identify outliers or regions of misfit-fit. In addition, one should review the standard normal probability plot of residuals and the scatterplot of observed versus predicted values. Remember that the multiple regression analysis (i.e., the process of fitting the surface) assumes that the residuals are normally distributed, and one should carefully review the residuals for any apparent outliers.

To index

Designs for Constrained Surfaces and Mixtures

Overview

As mentioned in the context of mixture designs, it often happens in real-world studies that the experimental region of interest is constrained, that is, that not all factors settings can be combined with all settings for the other factors in the study. There is an algorithm suggested by Piepel (1988) and Snee (1985) for finding the vertices and centroids for such constrained regions.

Designs for Constrained Experimental Regions

When in an experiment with many factors, there are constraints concerning the possible values of those factors and their combinations, it is not clear how to proceed. A reasonable approach is to include in the experiments runs at the extreme vertex points and centroid points of the constrained region, which should usually provide good coverage of the constrained experimental region (e.g., see Piepel, 1988; Snee, 1975). In fact, the mixture designs reviewed in the previous section provide examples for such designs, since they are typically constructed to include the vertex and centroid points of the constrained region that consists of a triangle (simplex).

Linear Constraints

One general way in which one can summarize most constraints that occur in real world experimentation is in terms of a linear equation (see Piepel, 1988):

A1x1 + A2x2 + ... + Aqxq + A0 0

Here, A0, .., Aq are the parameters for the linear constraint on the q factors, and x1,.., xq stands for the factor values (levels) for the q factors. This general formula can accommodate even very complex constraints. For example, suppose that in a two-factor experiment the first factor must always be set at least twice as high as the second, that is, x1 2*x2. This simple constraint can be rewritten as x1-2*x2 0. The ratio constraint 2*x1 /x2 1 can be rewritten as 2*x1 - x2 0, and so on.

The problem of multiple upper and lower constraints on the component values in mixtures was discussed earlier, in the context of mixture experiments. For example, suppose in a three-component mixture of fruit juices, the upper and lower constraints on the components are (see example 3.2, in Cornell 1993):

40% Watermelon (x1) 80%

10% Pineapple (x2) 50%

10% Orange (x3) 30%

These constraints can be rewritten as linear constraints into the form:

Watermelon:

x1-40 0

-x1+80 0

Pineapple:

x2-10 0

-x2+50 0

Orange:

x3-10 0

-x3+30 0

Thus, the problem of finding design points for mixture experiments with components with multiple upper and lower constraints is only a special case of general linear constraints.

The Piepel & Snee Algorithm

For the special case of constrained mixtures, algorithms such as the XVERT algorithm (see, for example, Cornell, 1990) are often used to find the vertex and centroid points of the constrained region (inside the triangle of three components, tetrahedron of four components, etc.). The general algorithm proposed by Piepel (1988) and Snee (1979) for finding vertices and centroids can be applied to mixtures as well as non-mixtures. The general approach of this algorithm is described in detail by Snee (1979).

Specifically, it will consider one-by-one each constraint, written as a linear equation as described above. Each constraint represents a line (or plane) through the experimental region. For each successive constraint you will evaluate whether or not the current (new) constraint crosses into the current valid region of the design. If so, new vertices will be computed which define the new valid experimental region, updated for the most recent constraint. It will then check whether or not any of the previously processed constraints have become redundant, that is, define lines or planes in the experimental region that are now entirely outside the valid region. After all constraints have been processed, it will then compute the centroids for the sides of the constrained region (of the order requested by the user). For the two-dimensional (two-factor) case, one can easily recreate this process by simply drawing lines through the experimental region, one for each constraint; what is left is the valid experimental region.

For more information, see Piepel (1988) or Snee (1979).

Choosing Points for the Experiment

Once the vertices and centroids have been computed, you may face the problem of having to select a subset of points for the experiment. If each experimental run is costly, then it may not be feasible to simply run all vertex and centroid points. In particular, when there are many factors and constraints, then the number of centroids can quickly get very large.

If you are screening a large number of factors, and are not interested in non-linear effects, then choosing the vertex points only will usually yield good coverage of the experimental region. To increase statistical power (to increase the degrees of freedom for the ANOVA error term), you may also want to include a few runs with the factors set at the overall centroid of the constrained region.

If you are considering a number of different models that you might fit once the data have been collected, then you may want to use the D- and A-optimal design options. Those options will help you select the design points that will extract the maximum amount of information from the constrained experimental region, given your models.

Analyzing Designs for Constrained Surfaces and Mixtures

As mentioned in the section on central composite designs and mixture designs, once the constrained design points have been chosen for the final experiment, and the data for the dependent variables of interest have been collected, the analysis of these designs can proceed in the standard manner.

For example, Cornell (1990, page 68) describes an experiment of three plasticizers, and their effect on resultant vinyl thickness (for automobile seat covers). The constraints for the three plasticizers components x1, x2, and x3 are:

.409 x1 .849

.000 x2 .252

.151 x3 .274

(Note that these values are already rescaled, so that the total for each mixture must be equal to 1.) The vertex and centroid points generated are:

x1 x2 x3

.8490

.7260

.4740

.5970

.6615

.7875

.6000

.5355

.7230 .0000

.0000

.2520

.2520

.1260

.0000

.1260

.2520

.1260 .1510

.2740

.2740

.1510

.2125

.2125

.2740

.2125

.1510

To index

Constructing D- and A-Optimal Designs

Overview

In the sections on standard factorial designs (see 2**(k-p) Fractional Factorial Designs and 3**(k-p), Box Behnken, and Mixed 2 and 3 Level Factorial Designs) and Central Composite Designs, the property of orthogonality of factor effects was discussed. In short, when the factor level settings for two factors in an experiment are uncorrelated, that is, when they are varied independently of each other, then they are said to be orthogonal to each other. (If you are familiar with matrix and vector algebra, two column vectors X1 and X2 in the design matrix are orthogonal if X1'*X2= 0). Intuitively, it should be clear that one can extract the maximum amount of information regarding a dependent variable from the experimental region (the region defined by the settings of the factor levels), if all factor effects are orthogonal to each other. Conversely, suppose one ran a four-run experiment for two factors as follows:

x1 x2

Run 1

Run 2

Run 3

Run 4 1

1

-1

-1 1

1

-1

-1

Now the columns of factor settings for X1 and X2 are identical to each other (their correlation is 1), and there is no way in the results to distinguish between the main effect for X1 and X2.

The D- and A-optimal design procedures provide various options to select from a list of valid (candidate) points (i.e., combinations of factor settings) those points that will extract the maximum amount of information from the experimental region, given the respective model that you expect to fit to the data. You need to supply the list of candidate points, for example the vertex and centroid points computed by the Designs for constrained surface and mixtures option, specify the type of model you expect to fit to the data, and the number of runs for the experiment. It will then construct a design with the desired number of cases, that will provide as much orthogonality between the columns of the design matrix as possible.

The reasoning behind D- and A-optimality is discussed, for example, in Box and Draper (1987, Chapter 14). The different algorithms used for searching for optimal designs are described in Dykstra (1971), Galil and Kiefer (1980), and Mitchell (1974a, 1974b). A detailed comparison study of the different algorithms is discussed in Cook and Nachtsheim (1980).

Basic Ideas

A technical discussion of the reasoning (and limitations) of D- and A-optimal designs is beyond the scope of this introduction. However, the general ideas are fairly straight-forward. Consider again the simple two-factor experiment in four runs.

x1 x2

Run 1

Run 2

Run 3

Run 4 1

1

-1

-1 1

1

-1

-1

As mentioned above, this design, of course, does not allow one to test, independently, the statistical significance of the two variables' contribution to the prediction of the dependent variable. If you computed the correlation matrix for the two variables, they would correlate at 1:

x1 x2

x1

x2 1.0

1.0 1.0

1.0

Normally, one would run this experiment so that the two factors are varied independently of each other:

x1 x2

Run 1

Run 2

Run 3

Run 4 1

1

-1

-1 1

-1

1

-1

Now the two variables are uncorrelated, that is, the correlation matrix for the two factors is:

x1 x2

x1

x2 1.0

0.0 0.0

1.0

Another term that is customarily used in this context is that the two factors are orthogonal. Technically, if the sum of the products of the elements of two columns (vectors) in the design (design matrix) is equal to 0 (zero), then the two columns are orthogonal.

The determinant of the design matrix. The determinant D of a square matrix (like the 2-by-2 correlation matrices shown above) is a specific numerical value, that reflects the amount of independence or redundancy between the columns and rows of the matrix. For the 2-by-2 case, it is simply computed as the product of the diagonal elements minus the off-diagonal elements of the matrix (for larger matrices the computations are more complex). For example, for the two matrices shown above, the determinant D is:

D1 =

|1.0 1.0|

|1.0 1.0| = 1*1 - 1*1 = 0

D2 =

|1.0 0.0|

|0.0 1.0| = 1*1 - 0*0 = 1

Thus, the determinant for the first matrix computed from completely redundant factor settings is equal to 0. The determinant for the second matrix, when the factors are orthogonal, is equal to 1.

D-optimal designs. This basic relationship extends to larger design matrices, that is, the more redundant the vectors (columns) of the design matrix, the closer to 0 (zero) is the determinant of the correlation matrix for those vectors; the more independent the columns, the larger is the determinant of that matrix. Thus, finding a design matrix that maximizes the determinant D of this matrix means finding a design where the factor effects are maximally independent of each other. This criterion for selecting a design is called the D-optimality criterion.

Matrix notation. Actually, the computations are commonly not performed on the correlation matrix of vectors, but on the simple cross-product matrix. In matrix notation, if the design matrix is denoted by X, then the quantity of interest here is the determinant of X'X (X- transposed times X). Thus, the search for D-optimal designs aims to maximize |X'X|, where the vertical lines (|..|) indicate the determinant.

A-optimal designs. Looking back at the computations for the determinant, another way to look at the issue of independence is to maximize the diagonal elements of the X'X matrix, while minimizing the off-diagonal elements. The so-called trace criterion or A-optimality criterion expresses this idea. Technically, the A-criterion is defined as:

A = trace(X'X)-1

where trace stands for the sum of the diagonal elements (of the (X'X)-1 matrix).

The information function. It should be mentioned at this point that D-optimal designs minimize the expected prediction error for the dependent variable, that is, those designs will maximize the precision of prediction, and thus the information (which is defined as the inverse of the error) that is extracted from the experimental region of interest.

Measuring Design Efficiency

A number of standard measures have been proposed to summarize the efficiency of a design.

D-efficiency. This measure is related to the D-optimality criterion:

D-efficiency = 100 * (|X'X|1/p/N)

Here, p is the number of factor effects in the design (columns in X), and N is the number of requested runs. This measure can be interpreted as the relative number of runs (in percent) that would be required by an orthogonal design to achieve the same value of the determinant |X'X|. However, remember that an orthogonal design may not be possible in many cases, that is, it is only a theoretical "yard-stick." Therefore, you should use this measure rather as a relative indicator of efficiency, to compare other designs of the same size, and constructed from the same design points candidate list. Also note that this measure is only meaningful (and will only be reported) if you chose to recode the factor settings in the design (i.e., the factor settings for the design points in the candidate list), so that they have a minimum of -1 and a maximum of +1.

A-efficiency. This measure is related to the A-optimality criterion:

A-efficiency = 100 * p/trace(N*(X'X)-1)

Here, p stands for the number of factor effects in the design, N is the number of requested runs, and trace stands for the sum of the diagonal elements (of (N*(X'X)-1) ). This measure can be interpreted as the relative number of runs (in percent) that would be required by an orthogonal design to achieve the same value of the trace of (X'X)-1. However, again you should use this measure as a relative indicator of efficiency, to compare other designs of the same size and constructed from the same design points candidate list; also this measure is only meaningful if you chose to recode the factor settings in the design to the -1 to +1 range.

G-efficiency. This measure is computed as:

G-efficiency = 100 * square root(p/N)/ M

Again, p stands for the number of factor effects in the design and N is the number of requested runs; M (sigmaM) stands for the maximum standard error for prediction across the list of candidate points. This measure is related to the so-called G- optimality criterion; G-optimal designs are defined as those that will minimize the maximum value of the standard error of the predicted response.

Constructing Optimal Designs

The optimal design facilities will "search for" optimal designs, given a list of "candidate points." Put another way, given a list of points that specifies which regions of the design are valid or feasible, and given a user-specified number of runs for the final experiment, it will select points to optimize the respective criterion. This "searching for" the best design is not an exact method, but rather an algorithmic procedure that employs certain search strategies to find the best design (according to the respective optimality criterion).

The search procedures or algorithms that have been proposed are described below (for a review and detailed comparison, see Cook and Nachtsheim, 1980). They are reviewed here in the order of speed, that is, the Sequential or Dykstra method is the fastest method, but often most likely to fail, that is, to yield a design that is not optimal (e.g., only locally optimal; this issue will be discussed shortly).

Sequential or Dykstra method. This algorithm is due to Dykstra (1971). Starting with an empty design, it will search through the candidate list of points, and choose in each step the one that maximizes the chosen criterion. There are no iterations involved, they will simply pick the requested number of points sequentially. Thus, this method is the fastest of the ones discussed. Also, by default, this method is used to construct the initial designs for the remaining methods.

Simple exchange (Wynn-Mitchell) method. This algorithm is usually attributed to Mitchell and Miller (1970) and Wynn (1972). The method starts with an initial design of the requested size (by default constructed via the sequential search algorithm described above). In each iteration, one point (run) in the design will be dropped from the design and another added from the list of candidate points. The choice of points to be dropped or added is sequential, that is, at each step the point that contributes least with respect to the chosen optimality criterion (D or A) is dropped from the design; then the algorithm chooses a point from the candidate list so as to optimize the respective criterion. The algorithm stops when no further improvement is achieved with additional exchanges.

DETMAX algorithm (exchange with excursions). This algorithm, due to Mitchell (1974b), is probably the best known and most widely used optimal design search algorithm. Like the simple exchange method, first an initial design is constructed (by default, via the sequential search algorithm described above). The search begins with a simple exchange as described above. However, if the respective criterion (D or A) does not improve, the algorithm will undertake excursions. Specifically, the algorithm will add or subtract more than one point at a time, so that, during the search, the number of points in the design may vary between ND+ Nexcursion and ND- Nexcursion, where ND is the requested design size, and Nexcursion refers to the maximum allowable excursion, as specified by the user. The iterations will stop when the chosen criterion (D or A) no longer improves within the maximum excursion.

Modified Fedorov (simultaneous switching). This algorithm represents a modification (Cook and Nachtsheim, 1980) of the basic Fedorov algorithm described below. It also begins with an initial design of the requested size (by default constructed via the sequential search algorithm). In each iteration, the algorithm will exchange each point in the design with one chosen from the candidate list, so as to optimize the design according to the chosen criterion (D or A). Unlike the simple exchange algorithm described above, the exchange is not sequential, but simultaneous. Thus, in each iteration each point in the design is compared with each point in the candidate list, and the exchange is made for the pair that optimizes the design. The algorithm terminates when there are no further improvements in the respective optimality criterion.

Fedorov (simultaneous switching). This is the original simultaneous switching method proposed by Fedorov (see Cook and Nachtsheim, 1980). The difference between this procedure and the one described above (modified Fedorov) is that in each iteration only a single exchange is performed, that is, in each iteration all possible pairs of points in the design and those in the candidate list are evaluated. The algorithm will then exchange the pair that optimizes the design (with regard to the chosen criterion). Thus, it is easy to see that this algorithm potentially can be somewhat slow, since in each iteration ND*NC comparisons are performed, in order to exchange a single point.

General Recommendations

If you think about the basic strategies represented by the different algorithms described above, it should be clear that there are usually no exact solutions to the optimal design problem. Specifically, the determinant of the X'X matrix (and trace of its inverse) are complex functions of the list of candidate points. In particular, there are usually several "local minima" with regard to the chosen optimality criterion; for example, at any point during the search a design may appear optimal unless you simultaneously discard half of the points in the design and choose certain other points from the candidate list; but, if you only exchange individual points or only a few points (via DETMAX), then no improvement occurs.

Therefore, it is important to try a number of different initial designs and algorithms. If after repeating the optimization several times with random starts the same, or very similar, final optimal design results, then you can be reasonably sure that you are not "caught" in a local minimum or maximum.

Also, the methods described above vary greatly with regard to their ability to get "trapped" in local minima or maxima. As a general rule, the slower the algorithm (i.e., the further down on the list of algorithms described above), the more likely is the algorithm to yield a truly optimal design. However, note that the modified Fedorov algorithm will practically perform just as well as the unmodified algorithm (see Cook and Nachtsheim, 1980); therefore, if time is not a consideration, we recommend the modified Fedorov algorithm as the best method to use.

D-optimality and A-optimality. For computational reasons (see Galil and Kiefer, 1980), updating the trace of a matrix (for the A-optimality criterion) is much slower than updating the determinant (for D-optimality). Thus, when you choose the A-optimality criterion, the computations may require significantly more time as compared to the D-optimality criterion. Since in practice, there are many other factors that will affect the quality of an experiment (e.g., the measurement reliability for the dependent variable), we generally recommend that you use the D optimality criterion. However, in difficult design situations, for example, when there appear to be many local maxima for the D criterion, and repeated trials yield very different results, you may want to run several optimization trials using the A criterion to learn more about the different types of designs that are possible.

Avoiding Matrix Singularity

It may happen during the search process that it cannot compute the inverse of the X'X matrix (for A-optimality), or that the determinant of the matrix becomes almost 0 (zero). At that point, the search can usually not continue. To avoid this situation, perform the optimization based on an augmented X'X matrix:

X'Xaugmented = X'X + *(X0'X0/N0)

where X0 stands for the design matrix constructed from the list of all N0 candidate points, and (alpha) is a user-defined small constant. Thus, you can turn off this feature by setting to 0 (zero).

"Repairing" Designs

The optimal design features can be used to "repair" designs. For example, suppose you ran an orthogonal design, but some data were lost (e.g., due to equipment malfunction), and now some effects of interest can no longer be estimated. You could of course make up the lost runs, but suppose you do not have the resources to redo them all. In that case, you can set up the list of candidate points from among all valid points for the experimental region, add to that list all the points that you have already run, and instruct it to always force those points into the final design (and never to drop them out; you can mark points in the candidate list for such forced inclusion). It will then only consider to exclude those points from the design that you did not actually run. In this manner you can, for example, find the best single run to add to an existing experiment, that would optimize the respective criterion.

Constrained Experimental Regions and Optimal Design

A typical application of the optimal design features is to situations when the experimental region of interest is constrained. As described earlier in this section, there are facilities for finding vertex and centroid points for linearly constrained regions and mixtures. Those points can then be submitted as the candidate list for constructing an optimal design of a particular size for a particular model. Thus, these two facilities combined provide a very powerful tool to cope with the difficult design situation when the design region of interest is subject to complex constraints, and one wants to fit particular models with the least number of runs.

To index

Special Topics

The following sections introduce several analysis techniques. The sections describe Response/desirability profiling, conducting Residual analyses, and performing Box-Cox transformations of the dependent variable.

See also ANOVA/MANOVA, Methods for Analysis of Variance, and Variance Components and Mixed Model ANOVA/ANCOVA.

Profiling Predicted Responses and Response Desirability

Basic Idea. A typical problem in product development is to find a set of conditions, or levels of the input variables, that produces the most desirable product in terms of its characteristics, or responses on the output variables. The procedures used to solve this problem generally involve two steps: (1) predicting responses on the dependent, or Y variables, by fitting the observed responses using an equation based on the levels of the independent, or X variables, and (2) finding the levels of the X variables which simultaneously produce the most desirable predicted responses on the Y variables. Derringer and Suich (1980) give, as an example of these procedures, the problem of finding the most desirable tire tread compound. There are a number of Y variables, such as PICO Abrasion Index, 200 percent modulus, elongation at break, and hardness. The characteristics of the product in terms of the response variables depend on the ingredients, the X variables, such as hydrated silica level, silane coupling agent level, and sulfur. The problem is to select the levels for the X's which will maximize the desirability of the responses on the Y's. The solution must take into account the fact that the levels for the X's that maximize one response may not maximize a different response.

When analyzing 2**(k-p) (two-level factorial) designs, 2-level screening designs, 2**(k-p) maximally unconfounded and minimum aberration designs, 3**(k-p) and Box Behnken designs, Mixed 2 and 3 level designs, central composite designs, and mixture designs, Response/desirability profiling allows you to inspect the response surface produced by fitting the observed responses using an equation based on levels of the independent variables.

Prediction Profiles. When you analyze the results of any of the designs listed above, a separate prediction equation for each dependent variable (containing different coefficients but the same terms) is fitted to the observed responses on the respective dependent variable. Once these equations are constructed, predicted values for the dependent variables can be computed at any combination of levels of the predictor variables. A prediction profile for a dependent variable consists of a series of graphs, one for each independent variable, of the predicted values for the dependent variable at different levels of one independent variable, holding the levels of the other independent variables constant at specified values, called current values. If appropriate current values for the independent variables have been selected, inspecting the prediction profile can show which levels of the predictor variables produce the most desirable predicted response on the dependent variable.

One might be interested in inspecting the predicted values for the dependent variables only at the actual levels at which the independent variables were set during the experiment. Alternatively, one also might be interested in inspecting the predicted values for the dependent variables at levels other than the actual levels of the independent variables used during the experiment, to see if there might be intermediate levels of the independent variables that could produce even more desirable responses. Also, returning to the Derringer and Suich (1980) example, for some response variables, the most desirable values may not necessarily be the most extreme values, for example, the most desirable value of elongation may fall within a narrow range of the possible values.

Response Desirability. Different dependent variables might have different kinds of relationships between scores on the variable and the desirability of the scores. Less filling beer may be more desirable, but better tasting beer can also be more desirable--lower "fillingness" scores and higher "taste" scores are both more desirable. The relationship between predicted responses on a dependent variable and the desirability of responses is called the desirability function. Derringer and Suich (1980) developed a procedure for specifying the relationship between predicted responses on a dependent variable and the desirability of the responses, a procedure that provides for up to three "inflection" points in the function. Returning to the tire tread compound example described above, their procedure involved transforming scores on each of the four tire tread compound outcome variables into desirability scores that could range from 0.0 for undesirable to 1.0 for very desirable. For example, their desirability function for hardness of the tire tread compound was defined by assigning a desirability value of 0.0 to hardness scores below 60 or above 75, a desirability value of 1.0 to mid-point hardness scores of 67.5, a desirability value that increased linearly from 0.0 up to 1.0 for hardness scores between 60 and 67.5 and a desirability value that decreased linearly from 1.0 down to 0.0 for hardness scores between 67.5 and 75.0. More generally, they suggested that procedures for defining desirability functions should accommodate curvature in the "falloff" of desirability between inflection points in the functions.

After transforming the predicted values of the dependent variables at different combinations of levels of the predictor variables into individual desirability scores, the overall desirability of the outcomes at different combinations of levels of the predictor variables can be computed. Derringer and Suich (1980) suggested that overall desirability be computed as the geometric mean of the individual desirabilities (which makes intuitive sense, because if the individual desirability of any outcome is 0.0, or unacceptable, the overall desirability will be 0.0, or unacceptable, no matter how desirable the other individual outcomes are--the geometric mean takes the product of all of the values, and raises the product to the power of the reciprocal of the number of values). Derringer and Suich's procedure provides a straightforward way for transforming predicted values for multiple dependent variables into a single overall desirability score. The problem of simultaneously optimization of several response variables then boils down to selecting the levels of the predictor variables that maximize the overall desirability of the responses on the dependent variables.

Summary. When one is developing a product whose characteristics are known to depend on the "ingredients" of which it is constituted, producing the best product possible requires determining the effects of the ingredients on each characteristic of the product, and then finding the balance of ingredients that optimizes the overall desirability of the product. In data analytic terms, the procedure that is followed to maximize product desirability is to (1) find adequate models (i.e., prediction equations) to predict characteristics of the product as a function of the levels of the independent variables, and (2) determine the optimum levels of the independent variables for overall product quality. These two steps, if followed faithfully, will likely lead to greater success in product improvement than the fabled, but statistically dubious technique of hoping for accidental breakthroughs and discoveries that radically improve product quality.

Residuals Analysis

Basic Idea. Extended residuals analysis is a collection of methods for inspecting different residual and predicted values, and thus to examine the adequacy of the prediction model, the need for transformations of the variables in the model, and the existence of outliers in the data.

Residuals are the deviations of the observed values on the dependent variable from the predicted values, given the current model. The ANOVA models used in analyzing responses on the dependent variable make certain assumptions about the distributions of residual (but not predicted) values on the dependent variable. These assumptions can be summarized by saying that the ANOVA model assumes normality, linearity, homogeneity of variances and covariances, and independence of residuals. All of these properties of the residuals for a dependent variable can be inspected using Residuals analysis.

Box-Cox Transformations of Dependent Variables

Basic Idea. It is assumed in analysis of variance that the variances in the different groups (experimental conditions) are homogeneous, and that they are uncorrelated with the means. If the distribution of values within each experimental condition is skewed, and the means are correlated with the standard deviations, then one can often apply an appropriate power transformation to the dependent variable to stabilize the variances, and to reduce or eliminate the correlation between the means and standard deviations. The Box-Cox transformation is useful for selecting an appropriate (power) transformation of the dependent variable.

Selecting the Box-Cox transformation option will produce a plot of the Residual Sum of Squares, given the model, as a function of the value of lambda, where lambda is used to define a transformation of the dependent variable,

y' = ( y**(lambda) - 1 ) / ( g**(lambda-1) * lambda) if lambda 0

y' = g * natural log(y) if lambda = 0

in which g is the geometric mean of the dependent variable and all values of the dependent variable are non-negative. The value of lambda for which the Residual Sum of Squares is a minimum is the maximum likelihood estimate for this parameter. It produces the variance stabilizing transformation of the dependent variable that reduces or eliminates the correlation between the group means and standard deviations.

In practice, it is not important that you use the exact estimated value of lambda for transforming the dependent variable. Rather, as a rule of thumb, one should consider the following transformations:

Approximate

lambda Suggested

transorfmation of y

-1

-0.5

0

0.5

1 Reciprocal

Reciprocal square root

Natural logarithm

Square root

None

## Senin, 13 Agustus 2007

Langganan:
Poskan Komentar (Atom)

## 2 komentar:

Hello. This post is likeable, and your blog is very interesting, congratulations :-). I will add in my blogroll =). If possible gives a last there on my blog, it is about the Telefone VoIP, I hope you enjoy. The address is http://telefone-voip.blogspot.com. A hug.

Read all the related Posts:

How to do Website Compatibility Testing (Made Simple)

7 steps to select Automated Software Testing Tools

64 Software Manual Testing Interview Questions

Why need Software Testing Tools?

Overview of Manual Software Testing

Embedded Software Testing - Overview

Automation vs Manual Software Testing

Standard definition of software testing

What parameters to consider for Performance Testing?

Poskan Komentar