4.1 Measures of Central Location With one data point clearly the central location is at the point itself.. With two data points, the central location should fall in the middle between th
Trang 1Descriptive Statistics:
Numerical Methods
Trang 24.1 Measures of Central Location
With one data point clearly the central location is at the point
itself.
The central data point reflects the locations of all
the actual data points.
With two data points, the central location should fall in the middle between them (in order
to reflect the location of both of them).
Trang 34.1 Measures of Central Location
The central data point reflects the locations of all
the actual data points.
If the third data point appears in the center the measure of central location will remain
in the center, but… (click)
But if the third data point appears on the left hand-side
of the midrange, it should “pull”
the central location to the left.
Trang 5Sum of the measurements Number of measurements Mean =
• This is the most popular and useful measure of central location
The Arithmetic Mean (average)
Trang 6Sample mean Population mean
The Arithmetic Mean
n
x
n 1 i
n
x
n 1 i
Trang 7Find the mean rate of return for a portfolio equally invested in five
stocks having the following annual rate of returns: 11.2%, 8.07%,
5.55%, 13.7%, 21%
Solution
Example 1 The Arithmetic Mean
% 764
9 5
21 7
13 55
5 07
8 2
11
x
Trang 83 Geometric mean
• A specialized measure, used to find the average growth rate, or rate
of change of a variable over time
• Example:
The number of students attending the music class last Tuesday was
160 This Tuesday, the number is expected to increase by 15%
How many of them are likely to attend this Tuesday?
Trang 9The number of students likely to attend this Tuesday
Growth rate/rate of change?
Trang 10(i) Simple geometric mean: applied when each
rate of change appears once only
Rg n (1 R1)(1 R2 ) (1 Rn) - 1
Trang 12200 220 250 262 284 300 312
What is the average rate of change in the
number of employees?
Trang 13Year 200
0 200 1 200 2 200 3 200 4 200 5 200 6
No of emplo yees
200 220 250 262 284 300 312
(1+R) - 1.1 1.136 1.048 1.084 1.056 1.04
Trang 14The average rate of change:
Rg 6 1.1´1.136´1.048´1.084´1.056´1.04
- 10.077 ~ 7.7%
Trang 15Grow
th rate (%)
Grow
th rate (%)
Trang 17Characteristics of the mean
A representative of a data set
Takes every single value into account so it is likely to be affected by extreme values
Used to compare different-sized data sets.
Trang 18• The median of a set of measurements is the value
that falls in the middle when the measurements are
arranged in order of magnitude
• When determining the median pay attention to the number of observations (k).
• ‘k’ is odd
Median = the number at the (k+1)/2th location of the ordered array
• ‘k’ is Even
Median = the average of the two numbers in the middle
(The number at the (k/2)th and the [(k/2)+1)]th
locations of the ordered array.)
The Median
Trang 19Find the median salary.
Suppose an additional salary of $31,000
is added to the group of salaries recorded
before Find the median salary.
Even number of observations
29.5,
The Median
There are seven salaries (K = 7)
The (k+1)/2 th salary of the ordered array is the number at the (7+1)/2 th = 4 th location.
The median is 29
There are eight salaries (K = 8)
The two salaries in the middle are 29 (in the (k/2) th =4 th location), and 30 (in the
[(k/2)+1] th =5 th location.
The median is the average number – 29.5
Trang 20• The Mode of a set of measurements is the value
that occurs most frequently.
• A Set of data may have one mode (or modal class),
or two or more modes.
The modal class For large data setsthe modal class is
much more relevant than a single-value
mode
The Mode
Trang 21 The mode of this data set is 34 in
This information seems to be valuable (for example, for the design of a new display in the store), much more than “ the
median is 33.5 in.”
This information seems to be valuable (for example, for the design of a new display in the store), much more than “ the
median is 33.5 in.”
The Mode
Trang 22 If a distribution is non symmetrical, and skewed to the
left or to the right, the three measures differ.
A positively skewed distribution (“skewed to the right”)
Mean Median
Mode
Trang 23• If a distribution is symmetrical, the mean, median and mode coincide
If a distribution is non symmetrical, and skewed
to the left or to the right, the three measures
differ.
A positively skewed distribution
(“skewed to the right”)
Mean Median
MedianMode
A negatively skewed distribution
(“skewed to the left”)
Relationship among Mean, Median, and Mode
Trang 24Using the Mean, Median, and
Mode
• The mean - is very sensitive to extreme
• The median is not effected by extreme values, yet, does not reflect all the values
included in the data set, but rather the location of the observation in the middle
• The mode – should be used mainly for
Trang 254.2 Measures of Variability
• Measures of central location fail to tell the whole story about the distribution.
• A question of interest still remains unanswered:
How much are the values of a given set spread
out around the mean value?
Trang 26• Think of a sample portfolio composed of three stocks
100 shares ARR = 10%
200 shares ARR = 15% 100 shares
ARR = 20%
A central measure for this portfolio’s ARR for is 15%.
Now observe the following portfolio
Trang 27• Considering the average ARR only the two portfolios are equal But are they really?
• Is the dispersion (variability) of ARR the same for the two portfolio?
• The dispersion is as important as the central location.
Trang 28But, how do all the measurements spread out?
Smallest measurement
Largest measurement
The range cannot assist in answering this questionRange
The Range
Trang 29 This measure reflects the dispersion of all the
measurement values.
The variance of a population of N measurements
x1, x2,…,xN having a mean is defined as
The variance of a sample of n measurements
x1, x2, …,xn having a mean is defined as
The Variance
x
N
) x
N 1 i
1 n
) x x
( s
2 i
n 1 i
Trang 30Consider two small populations:
10
98
4-10 = - 6
7-10 = -3 13-10 = +3 16-10 = +6
A measure of dispersion should agree with this
observation
Can the sum of deviations from the mean
be a good measure of dispersion?
A
B
The Variance
Trang 31The sum of deviations is zero for both populations, therefore, is not a good measure of dispersion, since clearly their dispersion is not equal.
The Variance
Trang 32Let us calculate the variance of the two populations
The Variance
18 5
) 10 16
( )
10 13
( )
10 10
( )
10 7
( )
10 4
-
-
) 10 12
( )
10 11
( )
10 10
( )
10 9
( )
10 8
-
-
-
Trang 33• Example 6
• Find the variance of the following set of numbers, representing annual rates of returns for a group of mutual funds Assume the set is (i) a sample, (ii) a population: -2, 4, 5, 6.9, 10
2
2 i
n 1 i 2
percent59
.19
)78.410(
)78.44()
78.42
(15
11
n
)xx
(s
-
-
-
-
--
-
4.78 5
23.9 5
10 6.9
5 4
2 5
x
x i61 i -
Trang 342 i
n 1 i 2
percent 6736
15
) 78 4 10 (
) 78 4 4 ( )
78 4 2
( 5
1 n
) x x
-
Trang 37deviation andard
st Population
s s
: deviation standard
Trang 38Standard Deviation
Trang 39The Empirical Rule for a Bell
Shaped Data Set …
Approximately 68% of all observations fall
within one standard deviation of the mean.
Approximately 95% of all observations fall
within two standard deviations of the mean.
Approximately 99.7% of all observations fall
within three standard deviations of the mean.
Trang 40• The proportion of observations in any sample that lie within k standard deviations of the mean is at least 1- 1/k2
for any k > 1.
• This theorem is valid for any set of measurements
(sample, population) of any shape!!
s 3 x
, s 3
s 2 x
, s 2
s 4 x
, s 4
Trang 41If the histogram is not at all bell-shaped we can say that at least 75% of the marks fell between 60 and 80, and at
least 88.9% of the marks fell between 55 and 85 (We can use other values of k.)
Trang 42• At most p% of the measurements are less than that value
• At most (100-p)% of all the measurements are greater than that value.
Trang 43• First (lower) decile = 10th percentile
• First (lower) quartile, Q 1 , = 25th percentile
• Third quartile, Q 3, = 75th percentile
• Ninth (upper) decile = 90th percentile
Lower decile
A demostration of Commonly used percentiles
lie here
Trang 44• Commonly used percentiles:
• First (lower) decile = 10th percentile
• First (lower) quartile, Q 1 , = 25th percentile
• Third quartile, Q 3, = 75th percentile
• Ninth (upper) decile = 90th percentile
Lower quartile
A demostration of Commonly used percentiles - optional
lie here
lie here
Trang 45• Commonly used percentiles:
• First (lower)decile = 10th percentile
• First (lower) quartile, Q 1 , = 25th percentile
• Third quartile, Q 3, = 75th percentile
• Ninth (upper) decile = 90th percentile
Middle decile -Median
A demostration of Commonly used percentiles
Trang 46the of
location the
is L
where
100
P ) 1 n
( L
th P
Trang 47• Example 12-solution continued
• Finding the location of the 20th percentile:
• 2.7, 3.1, 5.2, 6.2, 8.3, 20.9, 24.4, 30.05, 33.6, 42.9
• Finding the value of the 20th percentile
The 20th percentile is located at location 2.75, that is, at 75 the distance from 3.1 to 5.2
Therefore,
3.1 5.2
2100
20)110
(100
P)1n(
Trang 48Quartiles and Variability
• Quartiles can provide an idea about the shape of a histogram
Q1 Q2 Q3Positively skewed
histogram
Q1 Q2 Q3Negatively skewed
histogram
Trang 49• This is a measure of the spread of the middle 50% of the observations
• Large value indicates a large spread of the observations
Interquartile range = Q3 – Q1
Inter-quartile Range
Trang 501.5(Q3 – Q1) 1.5(Q3 – Q1)
• A box plot is a pictorial display that provides the main descriptive measures of the measurement set:
• L - the largest measurement
• Q3 - The upper quartile
• Q2 - The median
• Q1 - The lower quartile
• S - The smallest measurement
Trang 51.
Smallest = 449 Q1 = 512
Median = 537 Q3 = 575 Largest = 788 IQR = 63 Outliers = (788, 788, 766, 763, 756, 719, 712, 707, 703, 694, 690, 675, )
Trang 52DATA COLLECTION AND
SAMPLING
CHAPTER 4
Trang 53 Methods of collecting data
Simple Random Sampling
Stratified Random Sampling
Trang 551 Observation
The investigator observes characteristics of a subset
of members of one or more existing populations.
Goal: draw conclusions about the corresponding population or about the difference between two or more populations.
Advantage vs Disadvantage
o Advantage: easy to conduct, relatively inexpensive
o Disadvantage: provide little useful information;
impossible to draw cause-and-effect conclusions due
to confounding variable
55
I Methods of collecting
data
Trang 56A researcher for a pharmaceutical company wants to determine whether aspirin does reduce the incidence of heart attacks He select a sample of men and women and asking each whether he or she has taken aspirin regularly over the past 2 years Each person would be asked whether he or she had suffered a heart attack over the same period The proportions reporting heart attacks would be compared and a conclusion can be drawn whether aspirin is effective in reducing the likelihood of heart attacks
56
Observation
Trang 572 Experiment
The investigator observes how a response variable
behaves when the researcher manipulates one or more explanatory variables (factors).
Goal: determine the effect of the manipulated factors on the response variable
Trang 58A researcher for a pharmaceutical company wants to determine whether aspirin does reduce the incidence of heart attacks He select a sample of men and women The sample would be divided into two groups: one group would take aspirin regularly and the other would not After 2 years, the researcher would determine the proportion
of people in each group who had suffered a heart attack Then, it is possible to draw conclusion whether aspirin is effective in reducing the likelihood of heart attacks
58
Experiment
Trang 593 Survey
One of the most familiar methods of collecting data
Goal: Used to solicit information from people
concerning things as income, family size, opinions on
various issues…
The majority of surveys are conducted for private use
Examples:
o Market researchers conduct a survey to determine the
preferences and attitudes of consumers which will help target a new product;
o A company surveys customers’ satisfaction on their products and service.
59
I Methods of collecting
data
Trang 60- Inexpensive
- Low response rate, high
number of incorrect answers
Trang 61 Define the issue
What are the purpose and objectives of the survey
Identify the questions to answer?
Deciding what to measure and how to measure
Decide what information needed to answer questions
Think about how you intend to tabulate and analyze the response
Define the population of interest
61
Survey Design Steps
Trang 62Design questionnaire
Questionnaire should be kept as short as possible
The questions should be short, simple, clear,
unambiguous
Begin with simple demographic questions
Use both dichotomous questions (close–ended) questions as well as open – ended question
Avoid using leading questions
62
Survey Design Steps
Trang 63 Pre-test the survey
pilot test with a small group of participants
assess clarity and length
Determine the sample size and sampling method
Select Sample and administer the survey
63
Survey Design Steps
Trang 64 Close-ended Questions
• Select from a short list of defined choices
Example: Major: business liberal arts science other
• Questions about the respondents’ personal characteristics
Example: Gender: Female Male
64
Types of Questions
Trang 651/ Why Sampling
- Less time consuming than a census
- Less costly to administer than a census
- It is possible to obtain statistical results of a sufficiently high precision based on samples.
- Sometimes, it’s impossible to identify the whole
population
65
II SAMPLING METHODS
Trang 66 A few parts selected for destructive testing
selected for auditVS
Trang 68 Every individual or item from the population has
an equal chance of being selected
Selection may be with replacement or without replacement
Samples can be obtained from a table of random numbers or computer random number generators
68
Simple Random Samples
Trang 69 Population divided into subgroups (called strata)
according to some common characteristic
Simple random sample selected from each subgroup
Samples from subgroups are combined into one
Trang 70 Decide on sample size: n
Divide frame of N individuals into groups of k
individuals: k=N/n
Randomly select one individual from the 1st group
Select every kth individual thereafter
Trang 71• Population is divided into several “clusters,” each
representative of the population
• A simple random sample of clusters is selected
• All items in the selected clusters can be used, or items can
be chosen from a cluster using another probability sampling technique
71
Cluster Samples
Population
divided into
16 clusters Randomly selected
clusters for sample
Trang 73self-1/ Sampling Error
- An error is expected to occur when making statement
about the population that is based on the observations
contained in a sample taken from the population
- The difference/deviation between the true (unknown)
value of a population parameter (mean, standard
deviation…) and its estimate, the sample statistic is the