1. Trang chủ
  2. » Giáo án - Bài giảng

Business analytics data analysis and decision making 5th by wayne l winston chapter 02

55 374 1

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 55
Dung lượng 5,09 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.Introduction slide 2 of 2  There are four steps in data analysis: 1.. May not b

Trang 1

DECISION MAKING

Describing the Distribution of a Single Variable

2

Trang 2

(slide 1 of 2)

 The goal is to present data in a form that

makes sense to people Tools that are used

to do this include:

 Graphs: bar charts, pie charts, histograms,

scatterplots, time series graphs

 Numerical summary measures: counts,

percentages, averages, measures of variability

 Tables of summary measures: totals, averages, counts, grouped by categories

 It is a challenge to summarize data so that the important information stands out clearly.

Trang 3

© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Introduction

(slide 2 of 2)

 There are four steps in data analysis:

1 Recognize a problem that needs to be

solved.

2 Gather data to help understand and then

solve the problem.

3 Analyze the data.

4 Act on this analysis.

 It is up to you to ask good questions—

and then take advantage of the most

appropriate tools to answer them.

Trang 4

Populations and Samples

interest in a study (people, households,

machines, etc.)

 Examples:

 All potential voters in a presidential election

 All subscribers to cable television

 All invoices submitted for Medicare reimbursement

by nursing homes

randomly chosen and preferably

representative of the population as a whole.

 Examples: Gallup, Harris, other polls today

Trang 5

© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Data Sets, Variables, and Observations

 A data set is usually a rectangular array

of data, with variables in columns and

Trang 6

Example 2.1:

Questionnaire Data.xlsx

Objective: To illustrate variables and observations in a

typical data set.

Solution: Data set includes observations on 30 people who

responded to a questionnaire on the president’s

environmental policies.

 Variables include: age, gender, state, children, salary, opinion.

 Include a row that lists variable names.

 Include a column that shows an index of the observation.

Trang 7

© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Types of Data

(slide 1 of 5)

 A variable is numerical if meaningful

arithmetic can be performed on it

 Otherwise, the variable is categorical

 There is also a third data type , a date

variable.

 Excel ® stores dates as numbers, but dates are treated differently from typical

numbers.

 A categorical variable is ordinal if there

is a natural ordering of its possible

values.

 If there is no natural ordering, it is

nominal

Trang 8

 It is coded as 1 for all observations in that

category and 0 for all observations not in that

category.

 Categorizing a numerical variable by putting

the data into discrete categories (called bins )

is called binning or discretizing

 A variable that has been categorized in this way is called a binned or discretized variable

Trang 9

© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Environmental Data

Using a Different Coding (slide 3 of 5)

Trang 10

 A continuous variable is the result of an

essentially continuous measurement, such

as weight or height.

Cross-sectional data are data on a cross section of a population at a distinct point in time.

Time series data are data collected over time.

Trang 11

© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Typical Time Series Data Set

(slide 5 of 5)

Trang 12

Descriptive Measures for

Categorical Variables

 There are only a few possibilities for

describing a categorical variable, all

based on counting:

 Count the number of categories.

 Give the categories names.

 Count the number of observations in each category (referred to as the count of

 Once you have the counts, you can display

them graphically, usually in a column chart or a pie chart.

Trang 13

© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Example 2.2:

Supermarket Transactions.xlsx (slide 1 of 3)

Objective: To summarize categorical variables in a

large data set.

Solution: Data set contains transactions made by

supermarket customers over a two-year period.

 Children, Units Sold, and Revenue are numerical

 Purchase Date is a date variable.

 Transaction and Customer ID are used only to identify

 All of the other variables are categorical.

Trang 14

Example 2.2:

Supermarket Transactions.xlsx (slide 2 of 3)

To get the counts in column S, use Excel’s COUNTIF function.

 To get the percentages in column T, divide each count by the total number of observations.

 When creating charts, be careful to use appropriate scales.

Trang 15

© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Example 2.2:

Supermarket Transactions.xlsx (slide 3 of 3)

 Another efficient way to find counts for a categorical variable is

to use dummy (0–1) variables.

 Recode each variable so that one category is replaced by 1 and all others by 0.

This can be done using a simple IF formula.

 Find the count of that category by summing the 0s and 1s.

 Find the percentage of that category by averaging the 0s and 1s.

Trang 16

Descriptive Measures for

Numerical Variables

variables, both with numerical summary

measures and with charts.

distributed, ask:

 What are the most “typical” values?

 How spread out are the values?

 What are the “extreme” values on either end?

 Is the chart of the values symmetric about some middle value, or is it skewed in some direction? Does it have any other peculiar features besides possible skewness?

Trang 17

© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Example 2.3:

Objective: To learn how salaries are distributed across all 2011 MLB

players.

Solution: Data set contains data on 843 Major League Baseball

players in the 2011 season.

 Variables are player’s name, team, position, and salary.

 Create summary measures of baseball salaries using Excel functions.

Trang 18

Example 2.3:

Trang 19

© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Measures of Central Tendency (slide 1 of 3)

 The mean is the average of all values.

 If the data set represents a sample from some

larger population, this measure is called the

 If the data set represents the entire population, it is called the population mean and is denoted by μ.

 In Excel, the mean can be calculated with the

AVERAGE function.

Trang 20

Measures of Central Tendency (slide 2 of 3)

 The median is the middle observation when the data are sorted from smallest

to largest.

 If the number of observations is odd, the

median is literally the middle observation.

 If the number of observations is even, the median is usually defined as the average of the two middle observations.

 In Excel, the median can be calculated

with the MEDIAN function.

Trang 21

© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Measures of Central Tendency (slide 3 of 3)

 The mode is the value that appears

most often.

 In most cases where a variable is

essentially continuous, the mode is not

very interesting because it is often the

result of a few lucky ties.

 However, it is not always a result of luck

and may reveal interesting information.

 In Excel, the mode can be calculated

with the MODE function.

Trang 22

Minimum, Maximum,

Percentiles, and Quartiles

For any percentage p, the pth percentile is the value

such that a percentage p of all values are less than it.

with (approximately) a quarter of all observations.

 The first, second and third quartiles are the percentiles

corresponding to p = 25%, p = 50%,

and p = 75%.

By definition, the second quartile (p = 50%) is equal to

the median.

calculated with Excel’s MIN and MAX functions, and the percentiles and quartiles with Excel’s PERCENTILE and QUARTILE functions.

Trang 23

© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Measures of Variability

(slide 1 of 3)

 The range is the maximum value minus the

minimum value.

 The interquartile range ( IQR ) is the third

quartile minus the first quartile.

 Thus, it is the range of the middle 50% of the data.

 It is less sensitive to extreme values than the

range.

 The variance is essentially the average of the squared deviations from the mean.

If X i is a typical observation, its squared deviation

from the mean is (X i – mean) 2

Trang 24

 If at least a few of the observations are far from the

mean, their squared deviations from the mean—and the variance—will be large.

In Excel, use the VAR function to obtain the sample variance and the VARP function to obtain the

population variance.

Trang 25

© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Measures of Variability

(slide 3 of 3)

 A fundamental problem with variance is that it is

in squared units (e.g., $  $ 2 ).

 A more natural measure is the standard

deviation , which is the square root of variance.

the square root of the sample variance.

σ, is the square root of the population variance.

In Excel, use the STDEV function to find the sample standard deviation or the STDEVP function to find

the population standard deviation.

Trang 26

Calculating Variance and

Standard Deviation

Trang 27

© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Empirical Rules for Interpreting Standard Deviation (slide 1 of 3)

 The interpretation of the standard

deviation can be stated as three

 If the values of a variable are approximately

normally distributed (symmetric and

bell-shaped), then the following rules hold:

 Approximately 68% of the observations are

within one standard deviation of the mean.

 Approximately 95% of the observations are

within two standard deviations of the mean.

 Approximately 99.7% of the observations are

within three standard deviations of the mean.

Trang 28

Empirical Rules for Baseball Salaries (slide 2 of 3)

 The empirical rules should be applied

with caution, especially when the data are clearly skewed, as illustrated by the calculations for baseball salaries below.

Trang 29

© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Empirical Rules for Interpreting Standard Deviation (slide 3 of 3)

 The mean absolute deviation ( MAD ) is the average of the absolute deviations.

calculate MAD.

For many variables, the standard deviation

is approximately 25% larger than MAD.

Trang 30

Measures of Shape

(slide 1 of 2)

Skewness occurs when there is a lack

of symmetry.

 A variable can be skewed to the right (or

really large values (e.g., really large

baseball salaries).

 Or it can be skewed to the left (or

really small values (e.g., temperature lows

in Antarctica).

 In Excel, a measure of skewness can be

Trang 31

© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Measures of Shape

(slide 2 of 2)

Kurtosis has to do with the “fatness” of the tails of the distribution relative to the tails of a normal distribution.

 A distribution with high kurtosis has

many more extreme observations.

 In Excel, kurtosis can be calculated with

the KURT function.

Trang 32

Numerical Summary Measures in the

Status Bar and with StatTools

 If you select multiple cells, summary

measures appear for the selected cells in the status bar at the bottom of the Excel window.

 You can choose the summary measures

that appear by right-clicking the status bar and selecting your favorites.

 Although Excel’s built-in functions can be used to calculate a number of summary measures, a much quicker way is to use the StatTools add-in.

Trang 33

© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Example 2.3 (Continued):

Baseball Salaries 2011.xlsx

Objective: To learn the

fundamentals of StatTools and

use it to generate summary

measures of baseball salaries.

Solution: First, define a

StatTools data set, by

selecting any cell in the data

set and clicking the Data Set

Manager button

 Then generate summary

measures for the Salary

variable, by selecting

One-Variable Summary from the

Summary Statistics dropdown

list and filling in the dialog box

that appears.

Trang 34

Charts for Numerical

Variables

 There are many graphical ways to

indicate the distribution of a numerical variable

 For cross-sectional variables:

 Histograms

 Box plots

 For time series variables:

 Time series graphs

Trang 35

© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Histograms

 A histogram is the most common type

of chart for showing the distribution of a numerical variable.

 It is based on binning the variable—that is, dividing it up into discrete categories.

 It is a column chart of the counts in the

various categories (with no gaps between the vertical bars).

 A histogram is great for showing the

shape of a distribution—whether the

distribution is symmetric or skewed in

one direction.

Trang 36

Example 2.3 (Continued):

Objective: To see the shape of the salary

distribution through a histogram.

Solution: It is possible to create a histogram with

Excel tools only—but it is a tedious process.

 The resulting table of counts is usually called a

frequency table

 The counts are called frequencies

 It is much easier to create a histogram with

StatTools.

 First, designate a StatTools data set.

 Next, select Histogram from the Summary Graphs

dropdown list.

 In the dialog box, select the Salary variable and click OK.

Trang 37

© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Example 2.3 (Continued):

Trang 38

Solution: Data set lists the

number of bags that were either

late or lost for 456 flights.

 In the Histogram dialog box,

request 9 bins and set the

minimum and maximum to -0.5

and 8.5.

 StatTools divides the range into

9 equal-length bins.

Trang 39

© 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Example 2.4:

Ngày đăng: 10/08/2017, 10:35

TỪ KHÓA LIÊN QUAN