May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.Introduction slide 2 of 2 There are four steps in data analysis: 1.. May not b
Trang 1DECISION MAKING
Describing the Distribution of a Single Variable
2
Trang 2(slide 1 of 2)
The goal is to present data in a form that
makes sense to people Tools that are used
to do this include:
Graphs: bar charts, pie charts, histograms,
scatterplots, time series graphs
Numerical summary measures: counts,
percentages, averages, measures of variability
Tables of summary measures: totals, averages, counts, grouped by categories
It is a challenge to summarize data so that the important information stands out clearly.
Trang 3© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Introduction
(slide 2 of 2)
There are four steps in data analysis:
1 Recognize a problem that needs to be
solved.
2 Gather data to help understand and then
solve the problem.
3 Analyze the data.
4 Act on this analysis.
It is up to you to ask good questions—
and then take advantage of the most
appropriate tools to answer them.
Trang 4Populations and Samples
interest in a study (people, households,
machines, etc.)
Examples:
All potential voters in a presidential election
All subscribers to cable television
All invoices submitted for Medicare reimbursement
by nursing homes
randomly chosen and preferably
representative of the population as a whole.
Examples: Gallup, Harris, other polls today
Trang 5© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sets, Variables, and Observations
A data set is usually a rectangular array
of data, with variables in columns and
Trang 6Example 2.1:
Questionnaire Data.xlsx
Objective: To illustrate variables and observations in a
typical data set.
Solution: Data set includes observations on 30 people who
responded to a questionnaire on the president’s
environmental policies.
Variables include: age, gender, state, children, salary, opinion.
Include a row that lists variable names.
Include a column that shows an index of the observation.
Trang 7© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Types of Data
(slide 1 of 5)
A variable is numerical if meaningful
arithmetic can be performed on it
Otherwise, the variable is categorical
There is also a third data type , a date
variable.
Excel ® stores dates as numbers, but dates are treated differently from typical
numbers.
A categorical variable is ordinal if there
is a natural ordering of its possible
values.
If there is no natural ordering, it is
nominal
Trang 8 It is coded as 1 for all observations in that
category and 0 for all observations not in that
category.
Categorizing a numerical variable by putting
the data into discrete categories (called bins )
is called binning or discretizing
A variable that has been categorized in this way is called a binned or discretized variable
Trang 9© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Environmental Data
Using a Different Coding (slide 3 of 5)
Trang 10 A continuous variable is the result of an
essentially continuous measurement, such
as weight or height.
Cross-sectional data are data on a cross section of a population at a distinct point in time.
Time series data are data collected over time.
Trang 11© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Typical Time Series Data Set
(slide 5 of 5)
Trang 12Descriptive Measures for
Categorical Variables
There are only a few possibilities for
describing a categorical variable, all
based on counting:
Count the number of categories.
Give the categories names.
Count the number of observations in each category (referred to as the count of
Once you have the counts, you can display
them graphically, usually in a column chart or a pie chart.
Trang 13© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 2.2:
Supermarket Transactions.xlsx (slide 1 of 3)
Objective: To summarize categorical variables in a
large data set.
Solution: Data set contains transactions made by
supermarket customers over a two-year period.
Children, Units Sold, and Revenue are numerical
Purchase Date is a date variable.
Transaction and Customer ID are used only to identify
All of the other variables are categorical.
Trang 14Example 2.2:
Supermarket Transactions.xlsx (slide 2 of 3)
To get the counts in column S, use Excel’s COUNTIF function.
To get the percentages in column T, divide each count by the total number of observations.
When creating charts, be careful to use appropriate scales.
Trang 15© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 2.2:
Supermarket Transactions.xlsx (slide 3 of 3)
Another efficient way to find counts for a categorical variable is
to use dummy (0–1) variables.
Recode each variable so that one category is replaced by 1 and all others by 0.
This can be done using a simple IF formula.
Find the count of that category by summing the 0s and 1s.
Find the percentage of that category by averaging the 0s and 1s.
Trang 16Descriptive Measures for
Numerical Variables
variables, both with numerical summary
measures and with charts.
distributed, ask:
What are the most “typical” values?
How spread out are the values?
What are the “extreme” values on either end?
Is the chart of the values symmetric about some middle value, or is it skewed in some direction? Does it have any other peculiar features besides possible skewness?
Trang 17© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 2.3:
Objective: To learn how salaries are distributed across all 2011 MLB
players.
Solution: Data set contains data on 843 Major League Baseball
players in the 2011 season.
Variables are player’s name, team, position, and salary.
Create summary measures of baseball salaries using Excel functions.
Trang 18Example 2.3:
Trang 19© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Measures of Central Tendency (slide 1 of 3)
The mean is the average of all values.
If the data set represents a sample from some
larger population, this measure is called the
If the data set represents the entire population, it is called the population mean and is denoted by μ.
In Excel, the mean can be calculated with the
AVERAGE function.
Trang 20Measures of Central Tendency (slide 2 of 3)
The median is the middle observation when the data are sorted from smallest
to largest.
If the number of observations is odd, the
median is literally the middle observation.
If the number of observations is even, the median is usually defined as the average of the two middle observations.
In Excel, the median can be calculated
with the MEDIAN function.
Trang 21© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Measures of Central Tendency (slide 3 of 3)
The mode is the value that appears
most often.
In most cases where a variable is
essentially continuous, the mode is not
very interesting because it is often the
result of a few lucky ties.
However, it is not always a result of luck
and may reveal interesting information.
In Excel, the mode can be calculated
with the MODE function.
Trang 22Minimum, Maximum,
Percentiles, and Quartiles
For any percentage p, the pth percentile is the value
such that a percentage p of all values are less than it.
with (approximately) a quarter of all observations.
The first, second and third quartiles are the percentiles
corresponding to p = 25%, p = 50%,
and p = 75%.
By definition, the second quartile (p = 50%) is equal to
the median.
calculated with Excel’s MIN and MAX functions, and the percentiles and quartiles with Excel’s PERCENTILE and QUARTILE functions.
Trang 23© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Measures of Variability
(slide 1 of 3)
The range is the maximum value minus the
minimum value.
The interquartile range ( IQR ) is the third
quartile minus the first quartile.
Thus, it is the range of the middle 50% of the data.
It is less sensitive to extreme values than the
range.
The variance is essentially the average of the squared deviations from the mean.
If X i is a typical observation, its squared deviation
from the mean is (X i – mean) 2
Trang 24 If at least a few of the observations are far from the
mean, their squared deviations from the mean—and the variance—will be large.
In Excel, use the VAR function to obtain the sample variance and the VARP function to obtain the
population variance.
Trang 25© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Measures of Variability
(slide 3 of 3)
A fundamental problem with variance is that it is
in squared units (e.g., $ $ 2 ).
A more natural measure is the standard
deviation , which is the square root of variance.
the square root of the sample variance.
σ, is the square root of the population variance.
In Excel, use the STDEV function to find the sample standard deviation or the STDEVP function to find
the population standard deviation.
Trang 26Calculating Variance and
Standard Deviation
Trang 27© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Empirical Rules for Interpreting Standard Deviation (slide 1 of 3)
The interpretation of the standard
deviation can be stated as three
If the values of a variable are approximately
normally distributed (symmetric and
bell-shaped), then the following rules hold:
Approximately 68% of the observations are
within one standard deviation of the mean.
Approximately 95% of the observations are
within two standard deviations of the mean.
Approximately 99.7% of the observations are
within three standard deviations of the mean.
Trang 28Empirical Rules for Baseball Salaries (slide 2 of 3)
The empirical rules should be applied
with caution, especially when the data are clearly skewed, as illustrated by the calculations for baseball salaries below.
Trang 29© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Empirical Rules for Interpreting Standard Deviation (slide 3 of 3)
The mean absolute deviation ( MAD ) is the average of the absolute deviations.
calculate MAD.
For many variables, the standard deviation
is approximately 25% larger than MAD.
Trang 30Measures of Shape
(slide 1 of 2)
Skewness occurs when there is a lack
of symmetry.
A variable can be skewed to the right (or
really large values (e.g., really large
baseball salaries).
Or it can be skewed to the left (or
really small values (e.g., temperature lows
in Antarctica).
In Excel, a measure of skewness can be
Trang 31© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Measures of Shape
(slide 2 of 2)
Kurtosis has to do with the “fatness” of the tails of the distribution relative to the tails of a normal distribution.
A distribution with high kurtosis has
many more extreme observations.
In Excel, kurtosis can be calculated with
the KURT function.
Trang 32Numerical Summary Measures in the
Status Bar and with StatTools
If you select multiple cells, summary
measures appear for the selected cells in the status bar at the bottom of the Excel window.
You can choose the summary measures
that appear by right-clicking the status bar and selecting your favorites.
Although Excel’s built-in functions can be used to calculate a number of summary measures, a much quicker way is to use the StatTools add-in.
Trang 33© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 2.3 (Continued):
Baseball Salaries 2011.xlsx
Objective: To learn the
fundamentals of StatTools and
use it to generate summary
measures of baseball salaries.
Solution: First, define a
StatTools data set, by
selecting any cell in the data
set and clicking the Data Set
Manager button
Then generate summary
measures for the Salary
variable, by selecting
One-Variable Summary from the
Summary Statistics dropdown
list and filling in the dialog box
that appears.
Trang 34Charts for Numerical
Variables
There are many graphical ways to
indicate the distribution of a numerical variable
For cross-sectional variables:
Histograms
Box plots
For time series variables:
Time series graphs
Trang 35© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Histograms
A histogram is the most common type
of chart for showing the distribution of a numerical variable.
It is based on binning the variable—that is, dividing it up into discrete categories.
It is a column chart of the counts in the
various categories (with no gaps between the vertical bars).
A histogram is great for showing the
shape of a distribution—whether the
distribution is symmetric or skewed in
one direction.
Trang 36Example 2.3 (Continued):
Objective: To see the shape of the salary
distribution through a histogram.
Solution: It is possible to create a histogram with
Excel tools only—but it is a tedious process.
The resulting table of counts is usually called a
frequency table
The counts are called frequencies
It is much easier to create a histogram with
StatTools.
First, designate a StatTools data set.
Next, select Histogram from the Summary Graphs
dropdown list.
In the dialog box, select the Salary variable and click OK.
Trang 37© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 2.3 (Continued):
Trang 38 Solution: Data set lists the
number of bags that were either
late or lost for 456 flights.
In the Histogram dialog box,
request 9 bins and set the
minimum and maximum to -0.5
and 8.5.
StatTools divides the range into
9 equal-length bins.
Trang 39© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 2.4: