Compute the median: Since n ="L;f = 110, an even number, the median is the average population parameter: some numeric measurement that of the observations with ranks ~ and ~ +1 i.e., the
Trang 1Z
Essential Tools for Understanding Statistics & Probability - Rules, Concepts, Variables, Equations,
& t s'f Problems, Helpful Hints & & Common Pitfalls
DESCRIPTIVE STATISTICS
Methods used to simply describe data set that has been observed
quantitative data: data variables that represent some numeric 1 A student receives the following exam grades in a course: 67,88,75,82,78
quantity (is a numeric measurement)
a Compute the mean: x = LX =67 + 88 + 75 + 82 + 78 390 =78
reflect some quality of the element; one of several categories, not b What is the median exam score?
a numeric measurement in order, the scores are: 67,75,78,82, 88; middle element = 78
population: "the whole"; the entire group of which we wish to c What is the range? range = maximum - minimum = 88 - 67 = 21
speak or that we intend to measure d Compute the standard deviation:
sample: "the part"; a representative subset of the population s= ~L(X-x)' = (67 -78)' + (88 - 78)' + (75 -78)' + (82 -78)' + (78 - 78)' {2464 6 =.J615 =7.84
simple random sampling: the most commonly assumed method for
selecting a sample; samples are chosen so that every possible sample e W at IS t e h h z score or t e exam gra e f h d 0 f 88? ~ =- s - x - x =7.8'1 88 - 78 = 7.84 10 =1.28
of the same size is equally likely to be the one that is selected
unity are surveyed as to how many times x =# of marriages 0
they've been married; the results are f = # of observations 13 42 37 12 6 1 0 =1/ x: the value of an observation given in the following frequency table: xl 0 42 7 4 36 24 176
_ LX! 176 L -_ _ _ _ _ _ _ _ _ _ _ _ _ ' f: the frequency of an observation (i.e., the number of times it occurs)
a Compute the mean: x = n-= T10 = 1.6 frequency table: a table that lists the values observed in a data
set along with the frequency with which it occurs b Compute the median: Since n ="L;f = 110, an even number, the median is the average (population) parameter: some numeric measurement that of the observations with ranks ~ and ~ +1 (i.e., the 55th and 56th observations)
describes a population; generally not known, but estimated from .& While we could count from either side of the distribution (from 0 or from 4), it is
sample statistics
EX: population mean: II; population standard deviation: a; easier here to count from the bottom: The first 13 observations in rank order are all pop,u/ation proportion: p (sometimes denoted 'IT) 0; the next 42 (the 14th through the 55th) are all 1; the 56th through the 920d are all 2;
since the 55th is a 1 and the 56th is a 2, the median is the average: (1 + 2) / 2 = 1.5 (sample) statistic: some numeric measurement used to
describe data in a sample, used to estimate or make c Compute the IQR: To find the IQR, we must first compute Q1 and Q3; if we divide n in inferences about population parameters half, we have a lower 55 and an upper 55 observations; the "median" of each would
EX: sample mean: X; sample standard deviation: s; have rank n;' =28; the 28th observation in the lower half is a 1, so Q1 =1 and the 28th
sample proportion: p observation In the upper half is a 2, so Q2 = 2; therefore, IQR = Q3 - Q1 = 2 - 1 = 1
measures of center from a frequency table sensitive to extreme values; any outlier will influence the mean;
= - -n
~ -4 -~~ - ~ +-
indicate which value is
median n odd : median has rank n+1 not sensitive to extreme values;
typical for the data set
the middle n even: median is the - 2 - more useful when data are skewed
element in average of values with ranks !! and !! + 1
2 2
order of rank mode the observation with the highest frequency J ""(mly measure of center appropriate for categorical data
~ -4 -mid-range maximum+minimum not often used; highly sensitive to unusual values;
easy to compute
2 measures of variation sample
(measures of variance 2 _L ( _ X _ - X_) I_n t often used; units are the squares of those for the data
dispersion)
reflect the variability
of the data (i.e , how sample
different the values are standard s= ~L(x-xf
interquartile IQR = Q3 - Q1 (see quartile, below) less sensitive to extreme values range (JQR)
to compute percentile data divided into 100 equal parts by rank (i.e., the k'h per important to apply to normal distributions
centile is that value greater than k% of the others) (see probability distributions) measures of
relative standing quartile data divided into 4 equal parts by rank: Q3 (third quartile) used to compute IQR (see IQR, above); Q3 is often viewed (measures is the value greater than 3,4 of the others; Q1 (first quartile) as the "median" of the upper half, and Q1 as the "median"
of relative position) is greater than 1.4; Q2 is identical to the median of the lower half; Q2 is the median of the data set
indicate how a
::: =X - X to find the value of some observation, x, of standard deviation
compares to the others
S when the z score is known: X = X +ZS
in the same data set
1
Trang 2
-KEY TERMS & SYMBOLS
probability experiment: any process with an outcome regarded as random
sample space (S): the set of all possible outcomes toss a fair coin twice P {HH, HT, TH, TT} there are two ways to get heads just once from a probability experiment
events (A, B, C, etc.): subsets of the sample space; roll a fair die {1, 2, 3, 4, 5, 6} many problems are b e st solv e d by a careful considerat
on of the defined events.
roll two fair dice (}{(1, 1), (1,2), (1,3) (2,1), (2,2), (2,3) (6,4), (6,5), (6,6)}
~a total of 36 outcomes: six for the first die, times another
peA): the probability of event A; for any event A
six for the second die
o :s:P(A):S: 1 , and for the entire sample space 5, P(S) =1
~r-{b -oy,-g-I} or {B, G}ir- _ _
have a baby
"equally likely outcomes": a very common assumption pick an orange from one of the (}{some positive real number, in some unit of weight} this in solving problems in probability; if all outcomes in the
trees in a grove, and weigh it ~would be a continuous sample space s mple space 5 are equally likely, then the probability
of some event A can be calculated as
P A) = number of simpl e oll/ c mll es E A
( lola I nunlb e o s;111pl e O llt c0l1 1 eS
ill Knowing that events are disjoint can make things much easier, since Rule Formula
otherwise peA and B) can be difficult to find
addition rule peA or B) = peA) + PCB) - peA and B) ("or") if A and B are disjoint, peA or B) = P(A) + PCB) complementary the complement of peA) + peN) = 1
event A (denoted AC or A) (any event will either
means "not A"; it consists happen, or not) thus, ill Subtract peA and B) so as not to count twice the elements of both
of all simple outcomes in peA) = 1 -peN); A and B
5 that are not in A P(AC) =
multiplication peA and B) = P(A)P(BIA) equivalently, peA and B) = P(B)P(AIB)
(} The law of complements is a useful tool, since it's often easier to find the rule ("and") if A and B are independent, peA and B) = P(A)P(B)
~probability that an event does NOT occur
illWhile it doesn't matter whether we "condition on A" (first) or "condition
on B" (second), generally the information available will require one or independe,nt the occurrence of one P(AIB) = peA),
the other
event does not affect the and P(BIA) =
probability of the other, so peA and B) = conditional
and vice versa probability rule p(AIB) = P{B) p(BIA) = P{A) A!!!
("given that")
(} Events are often assumed to be independent, particularly
~repeated trials jJ By multiplying both sides by PC B) or peA), we see this is a rephrasing of
the multiplication rule; conditional probabilities are often difficult to assess; an alternative way of thinking about "P(AIB)" is that it is the
When some number is derived from a probability experiment, it is called a
total To find the probability of an event A if the sample random variable
probability space is partitioned into several disjoint and exhaustive
Every random variable has a probability distribution that determines the rule events 0 , 0 , 0 , , 0 , then, since A must occur
probabilities of particular values along With one and only one of the D's:
For instance, when you roll a fair, six-sided die, the resulting number (X) is a peA) = peA and 0,) + P(A and O2) + + peA and OJ
random variable, with the following IDKrm probability distribution: =P(D,)P(AID,) + P(DJP(AIDz) + + P(Dk)P(AIDk)
In the table to the right, P(X) is called the probability
ill The total probability rule may look complicated, but it isn't!
distribution function (pdf)
(see sample problem 3a, next page).
Since each value of P(X) represents a probability, pdf's
must follow the basic probability rules: PiX) must always be Bayes' With two events, A and B, using the total probability rule: between 0 and I, and all of the values P(X) sum to I Theorem
Other probability distributions are continuous: They do not as
I P(A) P{AltndB)P(AandB ' ) P(B)P(A I B)+P(B ' )(A I B ' )
sign specific probabilities to specific values, as above in the
a range of values, using the area under the curve of a prob ~probability statement, and is the only generally valid method!
("expectation") and standard deviation of random variables; if we can char Sample Problems
acterize a random variable as belonging to some major family (see
1 Discrete random variable, X, follows the
we can find the mean and standard deviation easily; in general, we have: following probability distribution:
X (X) 0 0 25 O B 0.6 1.65-E(X)
discrete ( ) '" ( )
number of specific values)
continuous ( ) f () J
possible values, and P(X)
b What is the standard deviation of X?
can be measured only
ill Fortunately, most useful continuous probability distributions do not require integration in practice;
other formulas and tables are used .)3.65 -1.652 = ~O 9275 = 0.963
2
Trang 3equally likely
np ~l1p(l-p)
times the event occurs
event occurs on a given trial
events occur independently, at some average rate per interval
per interval
P(X) = (1 _ py.1 p
Sample Problems &
1
without replacement;
taken out; in situations like this, without any other information, we should
assume, that each sock is equally likely to be chosen
a both socks are black
second is black) = P(first is black)p(second is black I first is black)
=.2.- x !.= ~= R= 0.189
20 19 20 x 19 380
b both socks are white
the socks total, after selecting the first:
2 x -±-= ~= ~= 0.053
20 19 20 x 19 380
9 8 6 5 5 4 122
= 20 x 19 + 20 x 19 + 20 x 19= 380 =0.321
d the socks DO NOT match
possibilities, too; it is much safer, as well as easier, to use the rule for
complements-common sense dictates that the socks will either match
or not match, so:
2 In a particular county, 88% of homes have air conditioning, 27% have a
a air conditioning OR a pool?
pool) = 1 - 0.92 = 0.08
that," a conditional probability is indicated:
P(AC) 0.88
-d has air conditioning, given that it has a pool?
than pools
conditioning.]
P(AC I 001) = P (pool and A C )
of the motors they buy are defective-the defective rate is 6% for supplier A,
random; find the probability that
and P(C) = 0.1
P(DIA) = 0 06, P(DIB) = 0 , 05, and P(DIC) =
tive: P(DCIA) = 0.94, P(DCIB) = 0.95, and P(D c lC) =
a the motor is defective
P(D) = PtA and D) + P(B and D) + P(C and D) = P(A)P(DIA) + P(B)P(DIB) + P(C)P(DIC) = (0.5)(0 06) + (0.4)(0 05) + (0.1 )(0.3) = 0.03 + 0 02 + 0.03 = 0.08
j)
j) This is like asking, "What proportion of the defectives come from supplier C? "
Denote this probability as P(CID); we began with P(DIC) (among other probabilities} -we are effectively using Bayes' Theorem to reverse the order; however, we already have P(D), so:
3
Trang 4symmetric, unbounded, bell
(or some shaped; arises commonly in
other nature and in statistics, as a re
letter) sult of the central limit theorem
n Many other distributions approach the normal as n
~(or some other parameter, such as A or df) increases
standard Z I II =mean =0 a special variant of normal,
normal a = standard with II = 0 and (J = 1;
deviation = 1 represented in "Z tables"
jJ Used for inference about proportions; the cumulative probability is
provided in Z tables: For a particular value z, the cumulative probability
is <!l (z) = P(Z < z); i.e., the area under the density curve to the left ofz
student's t t df = degrees similar in shape to normal
of freedom II = 0 (always!)
jJ Used for inference about means
chi-square X 2 df = degrees not symmetric (skewed right)
of freedom
jJ Used for inferences about categorical distributions
Sample Problems &
1 For a standard normal random variable Z, find p(Z < 1.5)
n Since, by definition, the values from the standard normal table are
~<I> (z) - p(Z < z) P(Z < 1.5) = <I>(1.5) = 0.9332
2 For a t distribution with df = 20, which critical value of t has an area
of 0.05 in the right tail?
n At tabJe generally provides the tail area, rather than the cumulative
~ probability, as given in standard normal tables; with the row = df
20, and the column =tail area = 0.05, a t table produces the value
of 1.725
3 The heights of military recruits follow a normal distribution with a
mean of 70 inches and a standard deviation of 4 inches; find the proba
bility that a randomly chosen recruit is
a shorter than 60 inches
jJ First, we must transform values of the variable (height) to the
standard normal distribution, by taking z scores; here:
z= x - p = 60 - 70 = -10 = -2.5
a 4 4
6 Since we want the "less than" probability, the solution comes
directly from the standard normal z table:
P(X < 60) = p(Z < -2.5) = <1>(-2.5) = 0.0062
b taller than 72
~First, the z score: Z = a = - - 4 -='4 =0.5
11'- Since this is a "greater than" probability, subtract the cumulative
P(X > 72) = P(Z > 0.5) = 1 - <1>(0.5) = 1 - 0.6915 = 0.3085
c between 64 and 76 inches tall
fJ In this case, there are two boundaries: The only way to find the area
under the curve between them is to find the cumulative probabili
ties for each, and then to subtract; this entails finding z scores for
both X = 64 md X = 76:
z= x- p = 64-70 = -6 =-1.5 andz = x-/J = 76-70 =§ = 1.5
p(64 < x < 76) = P (-1 5 < Z < 1.5) = <1>(1.5) - <1>(-1.5) = 0.9332 - 0.0668
Because sample statistics are derived from random samples, they are random
The probability distribution
of a statistic is called its sam
pling distribution
Due to the central limit theo jJ .if n 2: 30, or if the population rem, some important statistics
distribution is normal have sampling distributions
that approach a normal sample p distribution as the sample size proportion increases (these are listed in
the table at right)
Knowing the expected value and standard error allows us
to find probabilities; then, in tum, we can use the properties of these sam
pling distributions to make inferences about the parameter values when we
Sample Problems &
1 60% of the registered voters in a large district plan to vote in favor of
a referendum; a random sample of 340 of these voters is selected
a What is the expected value of the sample proportion?
E(p)= p=O.6
b What is the standard error of the sample proportion?
SE(P)=~P(I:P) =~0.6~~0 6) =0.0266
c
55% and 65%?
jJ First, find the z scores for those proportions:
p(p) 0.55-0.6 -0.05
z = SE(p) = 0.0266 0.0266 =-1.88 and
p(p) 0.65-0.6 0.05
z= SE(p) = 0.0266 0.0266 = 1.88 Now,
P (0.55) P(0.65) = P - (1.88) Z (1 88)
= <I>(1.88) - <I>(-1 88) = 0.9699 - 0.0301 0.9398
2 The standard deviation of the weight of cattle in a certain herd is 160 pounds, but the mean is unknown; a random sample of size 100 is chosen
a Compute the standard error of the sample mean:
SE(x)= c; = ~= 161bs.
b For an individual animal in this herd,
jJSince this problem refers to a single observation, not the sample
mean, we use the standard deviation, not the standard error
6 Not knowing the value of II, we can only express the boundaries for "within 40 Ibs of the mean" as X = II + 40 and X = II - 40
We can still compute z scores:
Z = x ~p= p +1:-p = 1:= 0.25 and
x-p p-40-p -40
z= -o:-= 160 = 160 =-0.25
jJ That is, "within 40 Ibs of the mean" is the same as within 0.25 standard deviation
P (-0.25 < Z < 0.25) = <1>(0.25) - <1>(-0.25) = 0.5987 - 0.4013 =
c
jJ Even though we don't know the population mean, the z score formula will allow us to find this probability
6 Since this is the sample mean, we must use the standard error
of 16 Ibs., rather than the standard deviation, in computing the z scores:
Z = x -!!: = p +40 - P = 40 =2.5 and
SE(X) 16 16 - x-p p-40-p -40
Now:
P (-2.5 < Z < 2.5) = <1>(2.5) - <1>(-2.5) = 0.9938 - 0.0062 = 0.9876
6 This probability is dramatically higher than the probability for an
• individual head of cattle!
; is
nd
ware
le
?
; 0 ,
e
of
he
d i n :t n y
or a n y
he r
4
Trang 5When we want to draw conclusions about a population using data from a sample, we use some method
ofstatistical inference
A hypothesis test is a procedure by which claims about populations (hypotheses) are evaluated on the
basis of sample statistics
The procedure begins with a null hypothesis (Hj and an alternative (or "research") hypothesis (H); if
the sample data are too unusual, assuming H to be true, then H is rejected in favor ofHI; otherwise, we
fail to reject the null hypothesis, and thereby fail to support the alternatives
the null hypothesis (Ho) the alternative hypothesis (H ,or H)
Sample Problems &
In each of the following cases, formulate
hypotheses to test the claim; indicate which hypothesis represents the claim
1 The manager of a bank claims that the average waiting time for customers is less than two minutes
~ Since the claim refers to the average,
• this is a test for ~
As a "less than" claim, it is represented by H' and the hypothesis test is: Ho: J.I =2, vs H,: J.I < 2
ALWAYS provides a specific value for the I NEVER provides a specific value for the parameter; - - - - j
parameter, its "null value"; always instead, contains ">" (right-tailed), "<" (left-tailed), or
contains "=" " "", " (two-tailed)
the null value implies a specific sampling without any specific value for the parameter of interest,
6 The tail(s) of the hypothesis test are determined by the alternative hypothesis (H,)-this is one
• of the most important attributes of the test, regardless of which method is used
There are two major methods for carrying out a hypothesis test: the traditional approach (orfixed
s ignificance) and the p-value approach (observ e d s ignificanc e ); the following table lists the steps for
each ap roach:
p-value approach traditional approach
~ ~ mulate null and alternative hypotheses formulate a null and an alternative h.:.y-'-p _th_ s_is_ _ _ -1
reject the null hypothesis (supporting the reject the null hypothesis (supporting the alternative) at
Oalternative) at a significance level 0 , if the the significance level, if the test statistic falls in the
p-value ::; 0.; otherwise, fail to reject the rejection region; otherwise, fail to reject the
6 With the p-value approach, the final decision is made by comparing probabilities, whereas with
the traditional approach, the decision is made by comparing values of random variables; because
there is a one-to-one correspondence between the values of the random variables and the
probabilities, the two methods will always yield consistent results; we can convert between
the two using the following simple (but important!) rule:
reject the null hypothesis (Ho) ~ p-value ::; 0
Test Statistics
Parameter Test Statistic Distribution Under Ho Assumptions
n(1 - p) 2: 15
= SE(p)
population distribution
6 Since the t distribution approaches the standard normal Z, many teachers and texts advise that
• it's OK to use Z if n is sufficiently large
x2 distribution with df = k - 1 hypothesis and k = # of categories
/,\ x2 tests for categorical data do not have directional alternative hypotheses; rejection
~regions are always in the right tail
5
j)(left-tailed)
2 Your friend says that a coin you are tossing
is not fair
'" A fair coin is one that shows heads 50% of
i l lthe time; the friend states that the coin is NOT fair
This is an H, claim: Ho: P - 0.5, vs H,: p '" 0.5
3 A highway patrolman claims that the average speed of cars on a highway is at most
70 mph
itThe claim directly refers to the average;
• since this is an "at most" claim, it is represented by Ho'
The hypothesis test is: Ho: J.I = 70, vs H,: J.I > 70
j)(right-tailed)
4 A motorist claims that more than 80% of the cars on a highway travel at a speed exceeding 70 mph
'" Since the claim is really about a proportion
i l ldon't be fooled by the "70 mph!"-the hypotheses refer to p
As the motorist makes a "more than" claim,
it is the null hypothesis, Ho'
P (right-tailed)
5 The manager of a snack-food factory states that the average weight of a bag of their potato chips is exactly 5 oz (no more, no less) '" This is an "is exactly" claim that refers
i l l to the average; thus, the claim is Ho' The test is: Ho: J.I 5, vs H,: J.I '" 5
and the hypothesis test is two-tailed "'"
" is less than "
and the hypothesis test is left-tailed
" is greater than " alternative hypothesis (H,
and the hypothesis test is right-tailed
" : is equal t~; " /" null hypothesis (H • IS exactly
and the hypothesis test is two-tailed
" is at least " null hypothesis (H and the hypothesis test is left-tailed
" .is at most " null hypothesis (H and the hypothesis test is right-tailed
Trang 65tatiIIieaI _
1_ In some hypothesis tests, the null hypothesis is
of error is it?
• hypothesis is rejected is a type I error
6 When the null hypothesis (Ho) is rejected, we can support the alternative hypothesis (H,),
• This is a substantive finding: We have sufficient evidence that Ho is not correct
2_ A researcher conducts a hypothesis test at a
6 If Ho is not rejected, then we cannot support H, either; this is NOT a substantive finding: We have what is her decision ls it some type of error?
• failed to find evidence against Ho' but have not "confirmed" or "proved" it to be true!
P First, consider her decision:
a specific value for the parameter She will reject or fail to reject the null
1 - ex can be precisely determined about the parameter
it But, since the p-value is less than the
6 It is important to note that these probabilities are conditioned on reality, rather than the decision significance level, 0., Ho is rejected; but also,
• That is, given that Ho is true, ex is the probability of rejecting Ho; it is NOT the probability that Ho since Ho false, this is a type II
Finding Rejection Regions & P-Values Percentage
Distribution
< left-tailed values of the test statistic less than some
> right-tailed
critical value with area a in the right tail
Z '" two-tailed critical value with area a in the left tail, or away from
greater than some critical value with area
in the right
&
o1 At an aquaculture facility, a large number of eels are kept in a tank; they die
Q.S'f b What is the probability that she'll catch a mouse on her first attempt? .I11III independently of each other at an average rate of 2.5 eels per day
With a 20% chance of success each time, the probability of succeeding the i"IIIII a Which distribution is appropriate?
first time is simply 0.2
n Since the events are independent, and we're given an average rate per jJWe can also use the geometric pdf, with x=1 : P(1) = (1_0.2)'" (0.2) = 0_2
~fixed interval, a Poisson distribution can be used, with parameter: A = 2.5
c What is the probability that she'll catch a mouse on her third attempt?
jJSince the Poisson distribution has no maximum, there is no alternative but
hitting a bull's-eye, independently of the result for any other dart thrown; he
HAlm d_ Compute the probability that at least one eel dies in the span of 12 hours:
6 This is harder, since the duration of the interval has changed; but, we can
scale the Poisson parameter A proportionally: If the average rate is 2.5 jJ
independent events, the total number of successes follows a binomial
b How many bull's-eyes is John expected to hit? E(X) =np = 5(0.08) = 0.4
2 A cat is hunting some mice; every time she pounces at a mouse, she has
c What is the probability that he hits exactly two bull's-eyes?
a Which distribution is appropriate?
d_ What is the probability that he hits at least one bull's-eye?
.411 n As there is a fixed probability of the event, but the experiment will be
hU [l O c eo &, or ti tles at
L Customer Hotline # 1.800.230.9522 qUICKstudY·COm
oN OTE TO ST U EN T: Thi s guid e is in te nd e d for infor ma tional p urp o ses o nl y Due ISBN-13: 978-142320857-0
t o its con dense d format, thi s g uid e c not co v r every a s p ect of the subj ec t; r a t her, ISBN-10: 142320857-9
•
6
it i s int en d d for u se in conjunction w ith c o urse work and ass i gned texts N e ithe r
BarCh a n s, Inc., it s w riter s, editor s nor des i g n s ta ff, are in any way r es ponsible or All rights res e rved No p ar t o f thi s publi c at ion m a y b e r ep ro duc e d o r t ransmitted i n any
li ble for the use or m is u s e of the inf o nnation con ta e d in thi s g uide form, o r b y a n y mean s, e l ec tr o nic or mec h nic a l in cl udin g ph o t ocop y r ecording, or any
informati o n s tor age and r e tri ev sy s t e m , wi thout wr itt en p r m i s i on from th p ublis h er ,
~ AUTHOR: Stephen V Kizlik Ph.D, 911~j1111~ ~lllll!~Ijllil~llllfII1llf Ilillll © 2009 BarCharts Inc, 04 09