Descriptive statistics are used to describe distributions. Three measures of central tendency describe the middle of the distribution: mean, median, and mode. The term
“average” in statistics is typically defined as a synonym of the mean. Occasionally, this term is used to refer to the other measures of central tendency (the median and mode).
When you read a newspaper article, it may say that the average family in a community has two children (this is probably the median, but it could be the mode) or that the average income in the community is $55,218 (also probably the median, but it could be the mean). The article might say that the averageSATscore at your university is 1840 (probably the mean). It might say that the average person in a community has a high school diploma (this could be the mean, median, or mode). It is important to know when each central tendency measurement is appropriate.
The mode is the value or the category that occurs most often. We might say that the mode for political party in a parliament is the Labor Party. This would be the mode if there were more members of the parliament who were in the Labor Party than members of any other party. If we said the mode was 17 for age of high school seniors, this means that there are more high school seniors who are 17 than there are seniors who are any other age. This would be a reasonable measure of central tendency because most high school seniors are 17 years old. The mode represents the average in the sense of being the most typical value or category. If there is not a category or value that characterizes a distribution clearly, the mode is not descriptive of the central tendency of a distribution. For example, the heights of eighth-grade class members would not have a descriptive mode because each adolescent might be a different height, and there is no single height that is typical of eighth graders.
When you have unordered categorical variables, such as gender, marital status, or race/ethnicity, the mode is the only measure of central tendency. Even here, the mode is helpful only if one category is much more common than the others. If 79% of the adults in a community are married, saying that the modal marital status is married is a fair description of the typical member of the community. However, if 52% of adults in a community are female and 48% are male, it does not make much sense to say that the modal gender is female because there are nearly as many men as there are women.
The median is the value or the category that divides a distribution into two parts.
Half the observations will have a higher value and half will have a lower value than the median. The median can be applied to categories that are ordered (political liberalism, religiosity, job satisfaction) or to quantitative variables (age, education, income). If we said that the median household income of a community is $55,218, we mean that half the households in that community have an income more than $55,218 and half the households have an income less than $55,218. When we used the summarize, detail command in chapter 4, we saw that Stata refers to the median as the 50th percentile.
5.2 Where is the center of a distribution? 93 The median is not influenced by extreme cases. If Bill Gates moved to this commu- nity, his multibillion dollar income would not influence the median. He would simply be in the half of the distribution that made more than the median income. Because of this property, the median is sometimes used with quantitative variables that are skewed (a distribution is skewed if it trails off in one direction or the other). Income trails off at the high end because relatively few people have huge incomes.
The median is occasionally used with variables that are ordered categories. When there are relatively few ordered categories, there may not be a category that has exactly half the cases above it and half below it. You might ask people about their marital satisfaction and give them response options of a) very dissatisfied, b) somewhat dissat- isfied, c) neither satisfied nor dissatisfied, d) somewhat satisfied, and e) very satisfied.
Because we usually code numbers rather than letters, we might code very dissatisfied as 1, somewhat dissatisfied as 2, neither satisfied nor dissatisfied as3, somewhat satisfied as 4, and very satisfied as5. The median satisfaction for men might be in the category we coded with a4, somewhat satisfied. The median satisfaction for women might be3, neither satisfied nor dissatisfied.
More often, researchers compute the mean for variables like this, and the mean for men might be 4.21 compared with 3.74 for women. These values indicate that men are, on average, a little above the somewhat-satisfied level and women are a little bit below the somewhat-satisfied level.
The mean is what lay people usually think of when they hear the word “average”.
It is the value every case would be, if every case had the same value. It is a fulcrum point that considers both the number of cases above and below it and how far they are above or below it. Although Bill Gates would scarcely change the median income of a community, his moving to a small town would raise the mean by a lot. Some people use M (recommended by the American Psychological Association) to represent the mean, and others useX (recommended by most statisticians). The formula for the mean is
X = ΣX n
In plain English, this says the mean is the sum of all the values, ΣX(pronounced sigma X or sum of X), divided by the number of observations,n. For example, if you had five college women who weighed 120, 110, 160, 140, and 210 pounds, respectively, the mean would be
X= 120 + 110 + 160 + 140 + 210
5 = 148
From now on, we will useM instead ofX to represent the mean.
What measure of central tendency should you use? This decision depends on the level of measurement you have, how your variable is distributed, and what you are trying to show (see table 5.1).
94 Chapter 5 Descriptive statistics and graphs for one variable Table 5.1. Level of measurement and choice of average
Level of measurement Mode Median Mean
Categorical, no order (nominal, e.g., gender) Yes No No Categorical, ordered (ordinal, e.g., social support) Yes Yes Yes*
Quantitative (interval or ratio, e.g., age) Yes Yes Yes
*Many researchers use the mean when there are several ordered categories.
• When you have categories with no order (gender, religion), you can use only the mode. The mode for religion in Saudi Arabia, for example, is “Muslim”.
Unordered categorical variables are callednominal-level variables.
• When you have ordered categories (religiosity, marital satisfaction), the median is often recommended. Such variables are often labeled as ordinal measures. You might read that the median religiosity response in Chicago is “somewhat religious”.
Ordered categories can be ordered along some dimension, such as low to high or negative to positive. When there are several categories, many researchers treat them as quantitative variables and use the mean. If religiosity has seven ordinal categories from 1 for not religious at all to 7 for extremely religious, you might use the mean by treating these numbers from 1 to 7 as if they are an interval-level measure. You might say that the mean is 3.4, for example.
• When you have quantitative data (meaningful numbers), you can use the mean, median, or mode. Quantitative data are often calledinterval-level variables. You will usually use the mean. If, however, the variable is extremely skewed, you would use the median.
Suppose that we want an average value for the number of children in households that have at least one child. The distribution is highly skewed in a positive direction because it trails off on the positive tail (see figure 5.1).
0200400600800Frequency
1 2 3 4 5 6 7 8 9
number of children descriptive gss.dta
Number of Children in Families with at Least One Child
Figure 5.1. How many children do families have?
5.2 Where is the center of a distribution? 95 In this distribution, the mode is 2, the median (Mdn) is 2, and the mean (M) is 2.5.
Notice how the small number of families with a lot of children drew the mean toward the tail but did not influence either the mode or the median. When a distribution is skewed, the mean will be bigger or smaller than the median, depending on the direction the distribution trails off.
There are two specialized averages you could use. The harmonic mean is useful when you want to average rates. Suppose that you go on a 60-mile bicycle ride where the first half of the course is a major hill climb and the second half of the course is a major descent. You might average just 5 miles per hour for the first half and then average 25 miles per hour for the second half of the ride. If we call your ratea, the harmonic mean (H) is
H = n
1 a1 +a1
2 +ã ã ã+a1k
= 2
1 5+251
= 8.33
This harmonic mean of H = 8.33 miles per hour is a much better estimate of your average speed for the 60-mile ride than the arithmetic mean, which would be (5+25)/2 = 15 miles per hour.
The geometric mean is useful when you have a growth process where the growth is at a constant rate. This happens with population size or annual income, such as having your income grow at a rate of 3% per year. If you made 52,500 in 2000 and made 73,500 in 2010, what did you make in 2005? The arithmetic mean (52500 + 73500)/2 = 63000 exaggerates your income in 2005. The geometric mean (G) is
G= √n
a1ìa2ì ã ã ã ìan
=√
52500×73500
= 62118.84
whereG = $62,118.84 is a much better estimate of your 2005 income.
The Stata command ameans varlist computes the arithmetic mean, the geometric mean, and the harmonic mean.
96 Chapter 5 Descriptive statistics and graphs for one variable