FREQUENCY DISTRIBUTION MEASURES OF DISPERSION STATISTICS: The study of methods for collecting, organizing, and analyzing data o Descriptive Statistics: Procedures used to organize and pr
Trang 1FREQUENCY DISTRIBUTION MEASURES OF DISPERSION
STATISTICS: The study of methods for collecting,
organizing, and analyzing data
o Descriptive Statistics: Procedures used to organize and
present data in a convenient and communicable form
o Inferential Statistics: Procedures employed to
arrive at broader conclusions or inferences about
populations on the basis of samples
POPULATION: The complete set of actual or
potential elements about which inferences are made
SAMPLE: A subset of the population selected using
some sampling method
oSampling methods
-Cluster sample: A population is divided into
groups called clusters; some clusters are randomly
selected, and every member in them is observed
-Stratified sample: The population is divided into
strata, and a fixed number of elements of each
stratum are selected for the sample
-Simple random sample: A sample selected so that
each possible sample of the same size has an equal
probability of being selected; used for most
elementary inference
VARIABLE: An attribute of elements ofa population
or sample that can be measured; ex: height, weight,
IQ, hair colo~ and pulse rate are some of the many
variables that can be measured for people
DATA: Values of variables that have been
observed
oTypes of data
-Qualitative (or "categorical") data are descriptive
but not numeric; ex: your gender, your birthplace,
the color of an automobile
-Quantitative data take numeric values
-Discrete data take counting numbers (0, 1,2, ) as
values, usually representing things that can be
counted; ex: the number of fleas on a dog, the
number of times a professor is late in a semester
-Continuous data can take a range of numeric
values, not just counting numbers; ex: the height of
a child, the weight of a bag of beans, the amount of
time a professor is late
oLevels of measurement
-Qualitative data can be measured at the:
o Nominal level: Values are just names, without any
order; ex: color of a car, major in college
oOrdinal level: Values have some natural order;
ex: high school class (freshman!sophomore!
junior/senior), military rank
-Quantitative data can be measured at the :
o Interval level: Numeric data with no natural zero
point; intervals (differences) are meaningful, but
ratios are not; ex: temperature in Fahrenheit
degrees; 80°F is 20°F hotter than 60°F, but it is not
150% as hot
oRatio level: Numeric data for which there is a
true zero; both intervals and ratios are
meaningful; ex: weight, length, duration, most
physical properties
STATISTIC: A numeric measure computed
from sample data, used to describe the sample and to
estimate the corresponding popUlation parameter
PARAMETER: A numeric measure that describes a
population; parameters are usually not computed, but
are inferred from sample statistics
Provides the frequency (number of times observed)
of each value of a variable Table #1: Students in a driving class are polled regarding number of accidents they've had:
(# of accidents) (frequency) (relative frequency)
GROUPED FREQUENCY DISTRIBUTION:
Values of the variable are grouped into classes Table #2: The scores on a midterm exam are grouped into classes:
RELATIVE FREQUENCY DISTRIBUTION: Each frequency is divided by the total number of observa
tions to produce the proportion or percentage of the data set having that value; ex: third column of Table 1 CUMULATIVE FREQUENCY DISTRIBUTION:
Frequencies count all observations at a particular value
or class and all those less; ex: third column of Table 2
MEAN: Most commonly used measure of central tendency, usually meant by "average"; sensitive to extreme values
1 n
X = Ii LX;
i = I oTrimmed mean: Computed discarding some number of the highest and lowest values; less sensitive than ordinary mean
G
oWeighted mean: Computed with a L W;X;
weight multiplied to each value, making ; =d
some values influence the mean more L W;
MEDIAN: Value that divides the set so the same number of observations lie on each side of it; less sensitive to extreme values; for an odd number of values, it is the middle value; for an even number, it
is the average of the middle two; ex: in Table 1, the median is the average of the 28th and 29th observa
tions, or 1.5 MODE: Observation that occurs with the greatest frequency; ex: in Table 1, the mode is 1
SUM OF SQUARES (SS): The sum of squared deviations from the mean (" )2
2 2 L.,.X·
oPopulation SS: L(X;- f.1x) orLx; - ,;-
2
2 2 (LX;)
oSample ss: L(X;- x) orLx; - n
-VARIANCE: The average of square differences between observations and their mean
oPopulation variance: (J2= N L(x;- f.1)
; = I
oSample variance: s2= =-1 L(x;- x)
n ; = I
o Variances for grouped data:
-Population: (J = N
Lf;(m;-i =
-Sample: s = =-1 Lf;(m;-x)
n ; = I STANDARD DEVIATION: The square root of the variance; unlike variance, it has the same units as the original data and is more commonly used:
ex: Pop S,D.: (J = N L(X;- f.1)
i = I STANDARD SCORES: Also known as Z-scores; the standard score of a value is the directed number
of standard deviations from the mean at which the value is found; that is, z = x ~J.l
o A positive z-score indicates a value greater than the mean; a negative z-score indicates a value less than the mean; a z-score of zero indicates the mean value
·Converting every value in a data set or distribution
to a z-score is called standardization; once a data set
or distribution has been standardized, it has a new mean ~=0 and a new standard deviation a = I
GRAPHING TECHNIQUES
BAR GRAPH: A graph that uses bars to indicate the frequency of occurrence of observations
o Histogram: A bar graph used with quantitative, continuous variables
FREQUENCY CURVE: A graph representing a frequency distribution in the form of a continuous line that traces a histogram
oCumulative frequency curve: A continuous line that traces a histogram where bars in all the lower classes are stacked up in the adjacent higher class; cannot have a negative slope
oSymmetric curve: The frequency curve is unchanged
if rotated around its center; median = mean oNormal curve: Bell-shaped curve; symmetric -Skewed curve: Deviates from symmetry; frequency curve is shifted with a longer "tail" to the left (mean
< median) or to the right (mean > median)
SKEWED CURVE
1
Trang 2A measure of the likelihood of a random event; the
long-term relative frequency with which an outcome
or event occurs
Probability of occurrence of Event A
(A) = Number of outcomes favoring Event A
·Sample space: All possible simple outcomes of an
experiment
• Relationships between events
-Exhaustive: 2 or more events are said to be
exhaustive if they represent all possible outcomes
·Symbolically, peA or B or ) = I
-Non-exhaustive: Two or more events are said to be
non-exhaustive if they do not exhaust all possible
outcomes
-Mutually exclusive: Events that cannot occur
simultaneously: peA and B) = 0, and peA or B) =
peA) + PCB); ex: males, females
-Non-mutually exclusive: Events that can occur
simultaneously: peA or B) = peA) +PCB) - peA and
B); ex: males, brown eyes
-Independent: Events whose probability is
unaffected by occurrence or nonoccurrence of each
other: P(AIB) = peA); P(BIA) = PCB); and peA and
B) = P(A)P(B); ex: gender and eye color
-Dependent: Events whose probability changes
depending upon the occurrence or non-occurrence
of each other: P(AIB) differs from peA); P(BIA)
differs from PCB); and peA and B) = peA) P(BIA) =
PCB) P(AIB); ex: race and eye color
·Joint probabilities: Probability that 2 or more
events occur simultaneously
• Marginal probabilities or unconditional
probabilities = summation of probabilities
·Conditional probabilities: Probability of A given
the existence of S, written, P(AIS)
•Ex: Given the numbers I to 9 as observations in a
sample space:
-Events mutually exclusive and complementary;
ex: P(all odd numbers); P(all even numbers)
-Events mutually exclusive but not complementary;
ex: P(an even number); P(the numbers 7 and 5)
- Events neither mutually exclusive or exhaustive;
ex: P(an even number or a 2)
FREQUENCY TABLE
Event C Event D I Totals
EX: Joint Probability Between C and E
p(C & E) = 52/220 = 0.24
JOINT, MARGINAL & CONDITIONAL
PROBABILITY TABLE
Event D Marginal Conditional EventC Probability ProbablHty
(CfE)=O.60 Event E 0.24 0.16 0.40 (DfE)=O.40
(CIF)=O.47 Event F 0.28 0.32 0.60 (D/F)=O.53
Probability
Conditional (E/C)=O.46 (EfD)=O.33
Probability (F / C)=O.54 (FfD)=O.67
Sampling distribution: A theoretical probability
distribution of a statistic that would result from
drawing all possible samples of a given size from
some population
- A random variable takes numeric values randomly, with probabilities specified by a probability distri
bution (or density) function
• Discrete random variables: Take only distinct values (as with quantitative data)
• Binomial distribution: A model for the number (x)
of successes in a series of n independent trials where each trial results in success with probability p, or failure with probability 1 - p; ex: The number (x) of heads ("successes") obtained in 12 (n) tosses of a fair (probability of heads = p = 0.5) coin P(x)=nCx Px(l_p)n-x where P(x) is the probability
of exactly x successes out of n trials with a constant probability p of success on each trial;
nCx = n!/(n-x)!x!
-Binomial mean: 11 = np -Binomial variance: (J2 = np(l- p) -As n increases, the binomial approaches the Normal distribution
•Hypergeometric distribution:
- Represents the number of successes from a series
of n trials where each trial results in success or failure
- Like the binomial, except that each trial is drawn from a small population with N elements split between NI successes and Nz failures
-Then the probability of splitting the n trials between x I successes and Xz failures is:
NI!
P(
XI an X2 - N!
n! (N - n)!
nNI
-Hypergeometric mean: 111 = E(xl)= N a n d variance' O'z= N - n[nNI][Nz]
·Poisson distribution: A model for the number of occurrences of an event x=0,l,2, , counted over some fixed interval of space and time rather than some fixed number of trials; the parameter is the average number of occurrences, A , for x=0,1,2,3, , and >0, otherwise P(x)=O
-AA x
p (x) = _e_,_ Poisson mean and variance: 1
x
- A continuous random variable may take on any value along an uninterrupted interval of a number line
- Probabilities are measured only over intervals, never for single values; the probability that a continuous random variable falls between two values is exactly equal to the area under the density curve between those two values
•Normal distribution: Bell curve; a distribution whose values cluster symmetrically around the
mean (also median and mode); common in nature and important in making inferences
- The density curve is the graph of:
I(x) = 1_e - (x - Ji)' /2(5' where f(x) =
0'/2ii
frequency at a given value (J = standard deviation of the normal distribution J1 = the mean of the normal distribution
x = value of normally distributed variable
·Standard normal distribution: A normal distri
bution with a mean of 0, and standard deviation of I;
values following a normal distribution can be transformed to the standard normal distribution by using z-scores [see Measures of Dispersion, page I]
2
STATISTICAL INFERENCE
• In order to make inferences about a population, which is unobserved, a random sample is drawn -The sample is used to compute statistics, which are then used to draw probability conclusions about the parameters of the population
Population (unobserved) measured by
'andom mmpUng
(observed)
m ea sured b y
Parameters statistical inferellce Statistics
BIASED & UNBIASED ESTIMATORS
• Unbiased estimator of a parameter: An estimator (sample statistic) with an average value equal to the value of the parameter; ex: the sample mean is an unbiased estimator of the population mean; the average value of all possible sample means is the population mean; all other factors being equal, an unbiased estimator is preferable to a biased one
• Biased estimator of a parameter: An estimator (sample statistic) that does not equal on the average the value of the parameter; ex: the median is a biased estimator, since the average of sample medians is not always equal to the population median; variance calculated from a sample, dividing by n, is a biased estimator of the population variance; however, when calculated with n-I it is unbiased
- Note: Estimators themselves present only one source
of bias; even when an unbiased estimator is used, bias in the sample (elements not all equally likely to be chosen) may still be present
- Elementary methods of in ference assume unbiased sampling
-Sampling distribution: The probability distribution
of a sample statistic that would result from drawing all possible samples of a given size from some
population; because smples are drawn at random, every sample statistic is a random variable, and has a probability distribution that can be described using mean and standard deviation
·Standard error: The standard deviation of th estimator; do not confuse this with the standard deviation of the sample itself; measures the
variability in the estimates around their expected value, while the standard deviation of the sample
reflects the variability within the sample around the
sample mean
-The standard deviation of all possible sample means
of a given sample size, drawn from the same
population, is called the standard error of the
sample mean -If the population standard deviation (J is known, the
standard error is: O' x= ~
';11
- Usually, the popUlation standard deviation s is unknown, and is estimated by s in this case, the
estImate stan ar error IS: 0' ,, '" s" = in
-Note: in either case, the standard error of the sample mean decreases as sample size is increased - a larger sample provides more reliable information ab ut the
population
Trang 3•In a hypothesis test, sample data is used to accept or
reject a null hypothesis (Ho) in favor of an
alternative hypothesis (H ); the significance level
at which the null hypothesis can be rejected
provides against the null hypothesis
• Null hypothesis (Ho): Always specifies a
value (the null hypothesis value) for a
population parameter; the null hypothesis is
assumed to be true-this assumption underlies
the computations for the hypothesis test; ex:
Ho: "a coin is unbiased," that is, the proportion
of heads is 0.5: Ho: P = 0.5
•Alternative hypothesis (H.): Never specifies a
value for a parameter; the alternative hypothesis
states that a population parameter has some value
different from the one specified under the null
hypothesis; ex: H.: A coin is biased; that is, the
proportion of heads is not 0.5: HI: p *0.5
I Two-tailed (or nondirectional): An alternative
hypothesis (H 1) that states only that the
population parameter is simply different from
the one specified under Ho; two-tailed probability
is employed; ex: To use sample data to test
whether the population mean pulse rate is
different from 65, we would use the two-tailed
hypothesis test Ho: ~ = 65 vs HI: ~ *65
2 One-tailed (or directional): An alternative
hypothesis (H t) that states that the population
parameter is greater than (right-tailed) or less
than (left-tailed) the value specified under Ho; one
tailed probability is employed; ex: to use sample
data to test whether the population mean pulse rate
1 _ is greater than 65, we would use the right-tailed
I/" hypothesis test Ho: ~ = 65 vs H.: ~ > 65
• The alternative hypothesis HI is also
sometimes known as the "research hypothesis,"
as only claims expressed as alternative
hypotheses can be positively asserted
• Level of significance: The probability of observing
sample results as extreme or more extreme than
those actually observed, under the assumption the
null hypothesis is true; if this probability is small
enough, we conclude there is sufficient evidence to
reject the null hypothesis; two basic approaches:
I Fixed significance level (traditional method): A
level of significance a is predetermined; commonly
used significance levels are 0.0 I, 0.05, and 0.10
• Thesmaller the significance level a, the higher the
standard for rejecting Ho; critical value(s) for the
test statistic are determined such that the probability
of the test statistic being farther from zero than the
critical value (in one or two tails, depending on HI)
is a; if the test statistic falls beyond the critical
value-in the rejection region- then Ho can be·
rejected at that fixed significance level a
2 Observed significance level (p-value
method): The test statistic is computed using
the sample data, then the appropriate probability
distribution is used to find the probability of
observing a sample statistic that differs at least
that much from the null hypothesis value for the
population parameter (the probability value, or
~ p-value); the smaller the p-value, the better the
evidence against Ho
I ~
·This method is more commonly used by
computer applications
·The p-value also represents the smallest signifi
cance level a at which Ho can be rejected; thus,
p-value results can be used with a fixed signifi
cance level by rejecting Ho ifp-va[ue ~ a
• Generally, the larger (farther from zero, positive
or negative) the value of the test statistic, the smaller the p-value will be, providing better evidence against the null hypothesis in favor of the alternative
• Notion of indirect proof: Through traditional hypothesis testing, the null hypothesis can never be proven true; ex: if we toss a coin 200 times and tails
comes up exactly 100 times, we have no evidence the coin is biased, but cannot prove the coin is fair because of the random nature of sampling-it is possible to flip an unfair coin 200 times and get exactly 100 heads, just as it is possible to draw a sample from a population with mean 104.5 and find
a sample mean of 101; failing to reject the null hypothesis does not prove it true and rejecting it does not prove it false
·Two types of errors
- Type I error: Rejecting Ho when it is actually true;
the probability of a type I error is given by the significance level a; type I is generally more prominent, as it can be controlled
- Type II error: Failing to reject Ho when it is actually false; the probability of a type II error is denoted Ii; type II error is often (foolishly) disregarded: it is difficult to measure or control, as
Ii depends on the unknown true value of the parameter in question, which is not known
True Status of Ho
Ho True Ho False
, - - - r - - - 1- - - - -t . -. -. . -.
i Type II error
Type
CENTRAL '-IMIT THEOREM
(for sample mean x)
If XI, X2, X3, xn , is a simple random sample of n elements from a large (infinite) population, with mean Il( m) and standard deviation (J, then the distri
bution of x takes on the bell shaped distribution of a normal random variable as n increases and the distri
bution of the ratio: x ~f1 approaches the standard
I"rn
normal distribution as n goes to infinity; in practice,
a normal approximation is acceptable for samples of size 30 or larger
(0
Requires that the sample must be drawn from a normal distribution or have a sample size (n) of at least 30
• Used when the population standard deviation (J is known: If (J is known (treated as a constant, not
random) and the above conditions are met, then the distribution of the sample mean follows a normal distribution, and the test statistic z follows a standard normal distribution: Note that this is rarely the case in reality and the t-distribution is more widely used
3
·The test statistic is z = x;;: where ~ = population mean (either known or hypothesized under Ho) and (J" x= (J"/Iii
·Critical region: The portion of the area under the curve which includes those values of the test statistic that provide sufficient evidence for the rejection of the null hypothesis
- The most often used significance levels are 0.01, 0.05, and 0.1; for a one-tailed test using z-statistic, these correspond to z-values of 2.33, 1.65, and 1.28 respectively-positive values for a right-tailed test, negative for a left-tailed test
•For a two-tailed test, the critical region for a = 0.01 is split into two equal outer areas marked by z-values of 12.581; for a = 0.05, the critical values
of z are 11.961, and for a = 0.10, the critical values are 11.651
-Ex 1: Given a population with (J = 50, a simple random sample of n = 100 values is chosen with a sample mean X of 255; test using the p-value method Ho: ~ = 250 vs HI: ~ > 250; is there sufficient evidence to reject the null hypothesis?
• In this case, the test statistic
(255-250)/(501'" I 00) =
•Looking at Table A, the area given for z = 1.00 is 0.3413; the area to its right (since H is ">", this
is a right-tailed test) is 0.5 - 0.3413 = 0.1587, or 15.87%
·This is the p-value: the probability, if Ho is true (that is, if ~ = 250), of obtaining a sample mean of
255 or greater; it also represents the smallest significance level a at which Ho can be rejected
• Since, even if Ho is true, the probability of obtaining a sample mean ~ 255 from this popUlation with a sample of size n = 100 is about 16%, it is quite plausible that Ho is true- there is not very good evidence to support the alternative hypothesis that the population mean is greater than 250 so we fail to reject Ho
·It can't even be rejected at the weakest common significance level of a = 0.10, since 0.1587 > 0.10; remember, this doesn't prove the population mean to be equal to 250; we just haven't accumu lated sufficient evidence against the claim -Ex 2: A simple random sample of size n = 25 is taken from a population following a normal distri bution with (J = 15; the sample mean x is 95; use the p-value method to test Ho: ~ = 100 vs H.: 1.1 *
100; is there sufficient evidence to reject the claim that the population mean is 100 at a significance level a of 0.1 O? At a = 0.05?
•In this case, the test statistic z = (95 100)/(15/"'25) = -5/3 = -1.67
·Since the normal curve is symmetric, we can look
up a z-score of 1.67 - the value in Table A is 0.4525, that is, P(O < z < 1.67) = P(-1.67 < z < 0)
= 0.4525 -Thus, P(z < -1.67) = P(z > 1.67) = 0.5 - 0.4525 = 0.0475
• Since this is a two-tailed test (HI: ~* 100), the p value is twice this area, or 0.095
• Since the p-value = 0.095 < 0.10 = a, there is
sufficient evidence to reject the null hypothesis at
a significance level a of 0.10, but in the second case, the p-value = 0.095 > 0.05 = a, so the sample data are not strong enough to reject at the higher (0.05) level of significance
Trang 4~
Table A -Ex: A simple random sample of size 25 is taken from a population following a
normal distribution, with a sample mean 42, and the sample standard deviation
Normal Curve Areas
7.5; test at a fixed significance level a = 0.05: "0: 11 = 45 vs HI: 11 > 45
·Consulting Table B to find the appropriate critical value, with df = n - I = 24,
produces a critical value of -1.711; the null hypothesis can be rejected at a =
0.05 if the value of the test statistic t < -1.711
·The test statistic t = (42 - 45) / (7 5 / -.,125) = -3 1 1.5 = -2; since this is less than
0.0 0000 0040 0080 0120 0160 0199 0239 0279 0319 0359 the critical value of -1.711 , HO is rejected at a = 0.05
0.1 0398 0438 0478 0517 0557 0596 0636 0675 0714 0753
0.3 1179 1217 1255 1293 1331 1368 1406 1443 1480 1517
0.6 2257 2291 2324 2357 2389 2422 2454 2486 2517 2549
0.8 2881 2910 2939 2967 2995 3023 3051 3078 3106 3133
0.9 3159 3186 3212 3238 3264 3289 3315 3340 3365 3389
1 0 3413 3438 3461 3485 3508 3531 3554 3577 3599 3621
1.7 4554 4564 4573 4582 4591 4599 4608 4616 4625 4633
1.8 4641 4649 4656 4664 4671 4678 4686 4693 4699 4706
2.4 4918 4920 4922 4925 4927 4929 4931 4932 4934 4936
2.5 4938 4940 4941 4943 4945 4946 4948 4949 4951 4952
have a sampl e size (n) ofat least 30
a larger critical value of t as the boundary for the rejection region
-The t-distribution is characterized by its degrees of freedom (dt),
Confidence interval: Interval within which a population parameter is likely to be referring to the number of values that are free to vary after placing cer
tain restrictions on the data
refers to the level of significance])
·Common confidence levels are 90%, 95%, and 99%, just as common levels of
87, we know that the sum of the numbers is 4 * 87 = 348; this tells us
significance are 0.10, 0.05, and 0.01 nothing about the individual values in the sample- there are an infi
• (I - a) confidence interval for 11:
•For instance, the first number might be 84, the second 98,
third 81; but if the first three numbers are 84, 98, and 81 , then the normal variable z that puts an area a/2
with n -1 df
•As df increases, the t-distribution approaches the standard normal z
-The test statistic t used for testing hypotheses about a population mean
on this sample
Note: This is not so dif.forent from the test statisti c z used when (J is known! -Any hypothesized 11 below 102 or above 114 would be rejected at 0.05 significance
3
4
sider
can
Trang 5l"'"1l'!'1I~~ CJ~ "'
COMPARING r rl] '1 I :J.!.':i I ~ [tIl'/.!.':i r.!., ~ [~+" -USING F-RATIO: F=BGV/WGV
-Homogeneity of variances (a criterion for the pooled
and n-k for the denominator
~ -Sampling distribution of the difference 2-sample t-test): The condition that the variances of
-If BGV>WGY, the experimental treatments are
Z between means: If a number of pairs of sam two populations are equal; to establish homo~eneity
responsible for the large differences among
III pIes were taken from the same population or of variances, test HO: 0)2 = oi vs H f 0) oF-022
~ _ -
group means
II from two different populations, then: (note that this is e~uivalent to testing
" -The distribution of differences between HO:o)2/02 2 = ) vs Hfo) /°2 2 oF-1) - Null hypothesis: The group sample means are
pairs of sample means tends to be normal - Under the null hypothesis, the test statistic s 12/S22 all estimates of a common population mean;
= (z-distribution) follows an F-distribution with degrees of freedom: that is, HO:I1) =112=113= = l1k' for all k
- The mean of these differences between means treatment groups, vs HI: at l e a s t on e pair of
(n I-I, n2 -1); if the test statistic exceeds the critical
~ f.i xI - X2 is equal to the difference between means is different (determining which pair(s)
value in Table C, then the null hypothesis can be the population means, that IS 111 - 112 are different requires follow-up testing)
rejected at the indicated level of significance -Independent samples
- We are testing whether or not two samples
are drawn from populations with the same Top row = 05; bottom row = 0 I
mean, that is, HO: 111 = 112, versus a one- or Points for distribution of
In random samples of size n, the sample proportion
Degrees of freedom for numerator
- When 01 and 02 are known, the test statistic p fluctuates around the population
161 200 216 225 230 234 237 239 241 1 242 proportIOn p WIt a vanance 0 n
under the null hypothesis 1
- The standard error of the difference between 4052 4999 5403 5625 5764 5859 59 2 8 5981 6022 1 6056 proportion standard error of In (I - n) In
18 51 19.00 19.16 19.25 19.30 19.33 19 36 1937 19 38 19.39 As the sample size increases, it concentrates more means CY x _ x = j(cy~) In,+ (CY~) Inz 2
1 9930 1 99 33 99.34 99.36 99.38 99.40 around its target mean; also gets closer to the nomlal
- Where (Ill - 112) represents the hypothesized 10.
13 9.55 9.28 912 I 9.01 I 894 8.88 8.64 8 81 8 78
difference in means, the following statistic 34.12 30.81 29.46 1
1
28 71 28 24 27.91 27.67 27.49 27 34 27.23 In(l - n) In
~
can be used for hypothesis tests: .S 7.71 6.94 6.59 6.39 6.26 6.16 6.08 6.04 6.00
(x,- xz)(f.1, - f.1z) E Q c 21.20 18.00 16.69 1 1598 15.52 15 21 14.98 14.80 14.66 14.54
Z= CY x - x " 6.61 5.79 5.41 5 19 5 05 4 95 4.88 4.82 4.78 4 74
5
- When 01 and 02 ~re unknown, which is usu
16.26 13.27 12 06 11 39 10.97 10.67 10.45 10.27 10 15 10.05
ally the case, substitute s 1 and s2 for ° I and E
Q 5.99 5.14 4 76 4 53 4.39 4.28 4.21 4.15 4 10 4.06 - The correlation coefficient r (also known as e
02, respectively, in the above formulas, and 1 6
13.74 10 92 9.78 9 15 8.75 8.47 8 Z6 8 10 7.98 7.87 "Pearson Product-Moment Correlation
use the t-distribution with df= n I + n2 -2 .:: "
Q 5 59 4.74 4.35 4 12 3 97 3.87 3 79 3.73 3.68 3.63 Coefficient") is a measure of the lin e ar (straight
12.25 9.55 8.45 7.85 7 46 7.19 7.00 6.64 6.71 6 62 line) relationship between two quantitative vari
Both populations have normal distributions t
50 3.44 3.39 3.34
~ " 8
- Ex: Given observations to two variables X and -Requires homogeneity of variance: 01 and 11.26 8 65 7.59 7.01 6.63 6.37 6 19 6.03 5 91 5.82
IU 02 are not known but assumed equal- a Y, we can compute their corresponding sums of
5 12 4 26 3.86 3.63 3.48 3.37 3 29 3.23 3.18 3.1 3
~ -Many statisticians do not recommend the t SSx = I,(x-(x-x)/sx)2 and SSy = I,(y-(y-y)/s i
4 96 4.10 3.71 3.48 3.33 3.22 3.14 3 07 3.02 2 97
distribution with pooled standard error, the 10 10.04 7
56 6.55 5.99 5.54 5.39 5.21 5 08 4 95 1 4.85 -The formulas for the Pearson correlation (r):
above approach is more conservative
-The hypothesis test may be 2 tailed (= vs oF-) or
1 tailed: III :5112 and the alternative is
111 <
(nl-)+(nT1)= n 1+nT2 ference exists between more than two group means [LX2- ( L;) Z llLy2 - (L,;) 2 ]
-Use the given formula below for estimating - Indicates possibility of overall mean effect of the
-DetermIne the critical region for rejection by the means are different 1 the data are said to have perfect positive cor
assigning an acceptable level of significance -ANOVA: Consists of obtaining independent esti relation- if plotted, they would form a straight
and looking at the Hable with
mates from population subgroups line with positive (upward) slope; when r =-\
- The total sum of squares is partitioned into known -Use the following formula for the estimated lation- if plotted, they would form a straight
components of variation standard ec;r:.:.r::: o:: :r:_ _ _ _ _ _
~, _ line with negative (downward) slope; if r = 0
- Partition of variances
s-_-= l(n,-os,z+ (nz - l)si][n,+ nz] -Between-group variance (BGV): Reflects the the data are said to have no linear correlation;
magnitude of the difference(s)
some other way)
- Within-group variance (WGV): Reflects the dis pl e ji"Olll a population >,vith :: e ro c o rr e lation 1 0
test for the mean difference
persion within each treatment group; also referred -For instance, clients of a weight-loss pro produ c e b y c han ce a sampl e with r'" O!
to as the error term gram might be weighed before and after the -Test
~ program, and a significant mean difference - When the BGV is large relative to the WGY, the F
CHI-SQUARE ( X 2) TESTS
ascribed to the effectiveness ratio will also be large
Z -Standard error of the mean difference - Most widely used non-parametric test
IU -General formula: sd = S;t= J,; where sd is BGV = k'- 1 0 where Xi = mean of i -The X2 mean = its degrees of freedom
treatment group and Xtot = mean of all n values -The X2 variance = twice its degrees of freedom
-Can be used to test independence, homogeneity,
and goodness-o f- fit
SSI+ SS2+ + SSk
WGV = n _ k where the SS's are -The square of a standard normal variable is a
= the sums of squares [see Measures of Central chi-square variable with df = I
around the subgroup mean tion depends on the value of df
- We can then test HO: IJ.d = 0 versus a one- or Tendency, page I] of each subgroup's values
two-tailed alternative by using a t-test statistic
5 standard deviation
Trang 6(continued)
DEGREES OF FREEDOM (dt) COMPUTATION
• If chi-square tests for the goodness-of-fit to a
hypothesized distribution (uses frequency distri
bution), df = g ~ I, where g = number of groups, or
classes, in the frequency distribution
• If chi-square tests for homogeneity or independence
(uses two-way contingency table), df = (#of
rows ~I)(# of columns ~ I)
GOODNESS-OF-FIT TEST: To apply the chi-square
distribution in this manner, the critical chi-square value
is expressed as: L 0 Ie ' where fo = observed
frequency of the variable, fe = expected frequency
(based on hypothesized population distribution)
TESTS OF CONTINGENCY: Application of chi
square tests to two separate populations to test statis
tical independence of attributes
TESTS OF HOMOGENEITY: Application of chi
square tests to two samples to test if they came from
populations with like distributions
RUNS TEST: Tests whether a sequence (to comprise a
sample) is random; the following equations are applied:
( 'R-) = ~+ 2nl1l2 1 d an 21111l,(2111112 ~ 1I1~1I,) h
1l1+1l2 (1I1+1l2)-(nl+n2~1)
R = mean number of runs
nl = number of outcomes of one type
n2 = number of outcomes of the other type
SR = standard deviation of the distribution of the
number of runs
With a simple random sample of size n producing a
sample correlation coefficient r, it is possible to test
for the linear correlation in the population, P; that is,
we conduct the hypothesis test Ho: P = Po, versus a
right-, left-, or two-tailed alternative; usually we are
interested in determining whether there is any linear
correlation at all; that is, po = 0
The test statIstIc IS: t = /
(I ~r2)/(n~2) which follows a t-distribution with n ~ 2 degrees of
freedom under Ho; this hypothesis test assumes that
the sample is drawn from a population with a bivariate
normal distribution
'Ex: A simple random sample of size 27 produces a
correlation coefficient r =
evidence at a
we need a left-tailed test: Ho: P = 0 vs HI: P <
bution with n
~I.708 ,
)(1 ~(~0.412)/(27 -2))
to reject the null hypothesis of no linear correlation
and support the alternative hypothesis of a negative
linear correlation at a = 0.05
Regression is a method for predicting values of one variable (the outcome or dependent variable)
on the basis of the values of one or more
independent or predictor variables; fitting a regression model is the process of using sample data to determine an equation to represent the relationship
In a simple linear regression model, we use only one predictor variable and assume that the relationship to the outcome variable is linear; that is, the graph of the regression equation is that of a straight line; (we often refer to the "regression line"); for the entire population, the model can be expressed as:
y = ~+ ~lX + e
y is called the dependent variable (or outcome variable), as it is assumed (0 depend on a linear relationship to x
x is the independent variable, also called the predictor variable
~ is the intercept of the regression line; that is, the predicted value for y when x = 0
~I is the slope of the regression line-the marginal change in y per unit change in x
e refers to random error; the error term is assumed (0 follow a normal distribution with a mean olzero and
constant variation-that is, there should be no increase or decrease in dispersion for different regions along the regression line; in addition, it is assumed that error terms are independent for different (x, y) observations
On the basis of sample data, we find estimates bo and bl of the intercept ~o and slope ~l; this gives
us the estimated (or sample) regression equation y= bo+ blX
The parameter estimates bo and bl can be derived in
a variety of ways; one of the most common is known
as the method of least squares; least squares estimates minimize the sum of squared differences between predicted and actual values of the dependent variable y
For a simple linear regression model, the least squares estimates of the intercept and slope are:
estimated slope = b 1 = SSxy 1SSx estimated intercept = bo = Y~ b I x
These estimates-and other calculations in regression-involve sums of squares:
SSxy = I(x ~ x)(y ~ y) = Ixy ~ (Ix)(Iy)/n SSx = I(x ~ x)2 = I(x2) ~ (.h)2/n
SSy = I(y ~ Y)2 = I(y2) ~ (Iy)2/n
Ex: A simple random sample of 8 cars provides the following data on engine displacement (x) and highway mileage (y); fit a simple linear regression model
(displacement) (mileage)
4.6 17 21.16 289 78.2 , ,
51.2 ,., 1.6 32 2.56 1024
Fitting a model entails computing the least-squares estimates bo and bl; note that there are 8 observations-that is, n = 8
First, SSxy = Ixy ~ (Ix)(Iy)/n = ~54.9, SSx = I(x2)
~ CIx)2/n = 17.26, and SSy = I(y2) ~ (Iy)2/n = 268 Then the estimated slope is bl = SSxy/SSx = ~3.18 , and (he estimated intercept is bo = Y~ b l X = 32.54 The estimated regression model, then, is mileage = 32.54 ~ 3.18 displacement
We can assess the significance of the model by testing to see if the sample provides sufficient evidence of a linear relationship in the population; that is, we conduct the hypothesis test: Ho: ~1 = 0 versus HI: ~I;C0; this is exactly equivalent to testing for linear correlation in the population: Ho: P = 0 versus HI: p;c 0; the test for correlation is somewhat simpler:
The correlatIOn coeffiCient r = / = -0.8072
~SSx.SSy
(r~O)
The test statistic t = I = ~3.350
f(l~r2)/(n~2) Consulting Table S, with degrees of freedom
=n~2=6 , we obtain a critical value of 3.143 at a=0.02, and a critical value of 3.707 at a=O.OI; since we have a two-tailed test, we should consider the absolute value of the test statistic, which exceeds 3.143, but does not exceed 3.707; that is we can reject Ho at a=0.02 but not at a=O.OI , so the p value is between 0.02 and 0.01; (the actual p-value, which can be found using computer applications is 0.0154); this is a reasonably significant model
LINEAR DETERMINATION
Regression models are also assessed by the coefficient
of linear determination, r2; this represents the proportion of total variation in y that is explained by the regression model; the coefficient of linear determination can be calculated in a variety of ways; the easiest is to compute r2 = (r)2; that is, the coefficient of determi nation is the square of the coefficient of correlation
RESIDUALS
The difference between an observed and a fitted value of y(y ~y) is called a residual; examining the residuals is useful (0 identify outliers (observations far from the regression line, representing unusual values for x
3
and y) and to check the assumptions of the model
l'iOTICE TO STUU[NT: This Qu!ckQuickstud y® guid e uJVer s the b ;;tsies of
ISBN~13: 978-157222944-0 U.S $5.95 I CAN.$8.95
[ ntroduc t ory Statistics Due to its con den se d fonnm, h owe ver, u se it a s Stat i s tics gi de an ISBN-10 157222944-6 Customer Hotline # 1.800.230.9522 not :.IS a repl a ce me nt for a ss lg!H"!d
All rI g hts reserved, No pan of this publication may be repr odu ce d or tra ns mi tlc li in
free downloads & form, or by any means, electronic or mec han ical, inclu din g p hot ocopy, rec ord i n or
J'l U tl dr e dS o f tltres at information stora ge and retrieval sy s tem, without written permi ss ion fro lll the publi sh e
.c 2002, 2UU5 BarCharts, Jnc, Boc a a ton, FL 0308
6
6