1. Trang chủ
  2. » Khoa Học Tự Nhiên

Quick study academic statistics 600dpi

6 411 2

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 6
Dung lượng 15,82 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

FREQUENCY DISTRIBUTION MEASURES OF DISPERSION STATISTICS: The study of methods for collecting, organizing, and analyzing data o Descriptive Statistics: Procedures used to organize and pr

Trang 1

FREQUENCY DISTRIBUTION MEASURES OF DISPERSION

STATISTICS: The study of methods for collecting,

organizing, and analyzing data

o Descriptive Statistics: Procedures used to organize and

present data in a convenient and communicable form

o Inferential Statistics: Procedures employed to

arrive at broader conclusions or inferences about

populations on the basis of samples

POPULATION: The complete set of actual or

potential elements about which inferences are made

SAMPLE: A subset of the population selected using

some sampling method

oSampling methods

-Cluster sample: A population is divided into

groups called clusters; some clusters are randomly

selected, and every member in them is observed

-Stratified sample: The population is divided into

strata, and a fixed number of elements of each

stratum are selected for the sample

-Simple random sample: A sample selected so that

each possible sample of the same size has an equal

probability of being selected; used for most

elementary inference

VARIABLE: An attribute of elements ofa population

or sample that can be measured; ex: height, weight,

IQ, hair colo~ and pulse rate are some of the many

variables that can be measured for people

DATA: Values of variables that have been

observed

oTypes of data

-Qualitative (or "categorical") data are descriptive

but not numeric; ex: your gender, your birthplace,

the color of an automobile

-Quantitative data take numeric values

-Discrete data take counting numbers (0, 1,2, ) as

values, usually representing things that can be

counted; ex: the number of fleas on a dog, the

number of times a professor is late in a semester

-Continuous data can take a range of numeric

values, not just counting numbers; ex: the height of

a child, the weight of a bag of beans, the amount of

time a professor is late

oLevels of measurement

-Qualitative data can be measured at the:

o Nominal level: Values are just names, without any

order; ex: color of a car, major in college

oOrdinal level: Values have some natural order;

ex: high school class (freshman!sophomore!

junior/senior), military rank

-Quantitative data can be measured at the :

o Interval level: Numeric data with no natural zero

point; intervals (differences) are meaningful, but

ratios are not; ex: temperature in Fahrenheit

degrees; 80°F is 20°F hotter than 60°F, but it is not

150% as hot

oRatio level: Numeric data for which there is a

true zero; both intervals and ratios are

meaningful; ex: weight, length, duration, most

physical properties

STATISTIC: A numeric measure computed

from sample data, used to describe the sample and to

estimate the corresponding popUlation parameter

PARAMETER: A numeric measure that describes a

population; parameters are usually not computed, but

are inferred from sample statistics

Provides the frequency (number of times observed)

of each value of a variable Table #1: Students in a driving class are polled regarding number of accidents they've had:

(# of accidents) (frequency) (relative frequency)

GROUPED FREQUENCY DISTRIBUTION:

Values of the variable are grouped into classes Table #2: The scores on a midterm exam are grouped into classes:

RELATIVE FREQUENCY DISTRIBUTION: Each frequency is divided by the total number of observa­

tions to produce the proportion or percentage of the data set having that value; ex: third column of Table 1 CUMULATIVE FREQUENCY DISTRIBUTION:

Frequencies count all observations at a particular value

or class and all those less; ex: third column of Table 2

MEAN: Most commonly used measure of central tendency, usually meant by "average"; sensitive to extreme values

1 n

X = Ii LX;

i = I oTrimmed mean: Computed discarding some number of the highest and lowest values; less sensitive than ordinary mean

G

oWeighted mean: Computed with a L W;X;

weight multiplied to each value, making ; =d

some values influence the mean more L W;

MEDIAN: Value that divides the set so the same number of observations lie on each side of it; less sensitive to extreme values; for an odd number of values, it is the middle value; for an even number, it

is the average of the middle two; ex: in Table 1, the median is the average of the 28th and 29th observa­

tions, or 1.5 MODE: Observation that occurs with the greatest frequency; ex: in Table 1, the mode is 1

SUM OF SQUARES (SS): The sum of squared deviations from the mean (" )2

2 2 L.,.X·

oPopulation SS: L(X;- f.1x) orLx; - ,;-­

2

2 2 (LX;)

oSample ss: L(X;- x) orLx; - n

-VARIANCE: The average of square differences between observations and their mean

oPopulation variance: (J2= N L(x;- f.1)

; = I

oSample variance: s2= =-1 L(x;- x)

n ; = I

o Variances for grouped data:

-Population: (J = N

Lf;(m;-i =

-Sample: s = =-1 Lf;(m;-x)

n ; = I STANDARD DEVIATION: The square root of the variance; unlike variance, it has the same units as the original data and is more commonly used:

ex: Pop S,D.: (J = N L(X;- f.1)

i = I STANDARD SCORES: Also known as Z-scores; the standard score of a value is the directed number

of standard deviations from the mean at which the value is found; that is, z = x ~J.l

o A positive z-score indicates a value greater than the mean; a negative z-score indicates a value less than the mean; a z-score of zero indicates the mean value

·Converting every value in a data set or distribution

to a z-score is called standardization; once a data set

or distribution has been standardized, it has a new mean ~=0 and a new standard deviation a = I

GRAPHING TECHNIQUES

BAR GRAPH: A graph that uses bars to indicate the frequency of occurrence of observations

o Histogram: A bar graph used with quantitative, continuous variables

FREQUENCY CURVE: A graph representing a frequency distribution in the form of a continuous line that traces a histogram

oCumulative frequency curve: A continuous line that traces a histogram where bars in all the lower classes are stacked up in the adjacent higher class; cannot have a negative slope

oSymmetric curve: The frequency curve is unchanged

if rotated around its center; median = mean oNormal curve: Bell-shaped curve; symmetric -Skewed curve: Deviates from symmetry; frequency curve is shifted with a longer "tail" to the left (mean

< median) or to the right (mean > median)

SKEWED CURVE

1

Trang 2

A measure of the likelihood of a random event; the

long-term relative frequency with which an outcome

or event occurs

Probability of occurrence of Event A

(A) = Number of outcomes favoring Event A

·Sample space: All possible simple outcomes of an

experiment

• Relationships between events

-Exhaustive: 2 or more events are said to be

exhaustive if they represent all possible outcomes

·Symbolically, peA or B or ) = I

-Non-exhaustive: Two or more events are said to be

non-exhaustive if they do not exhaust all possible

outcomes

-Mutually exclusive: Events that cannot occur

simultaneously: peA and B) = 0, and peA or B) =

peA) + PCB); ex: males, females

-Non-mutually exclusive: Events that can occur

simultaneously: peA or B) = peA) +PCB) - peA and

B); ex: males, brown eyes

-Independent: Events whose probability is

unaffected by occurrence or nonoccurrence of each

other: P(AIB) = peA); P(BIA) = PCB); and peA and

B) = P(A)P(B); ex: gender and eye color

-Dependent: Events whose probability changes

depending upon the occurrence or non-occurrence

of each other: P(AIB) differs from peA); P(BIA)

differs from PCB); and peA and B) = peA) P(BIA) =

PCB) P(AIB); ex: race and eye color

·Joint probabilities: Probability that 2 or more

events occur simultaneously

• Marginal probabilities or unconditional

probabilities = summation of probabilities

·Conditional probabilities: Probability of A given

the existence of S, written, P(AIS)

•Ex: Given the numbers I to 9 as observations in a

sample space:

-Events mutually exclusive and complementary;

ex: P(all odd numbers); P(all even numbers)

-Events mutually exclusive but not complementary;

ex: P(an even number); P(the numbers 7 and 5)

- Events neither mutually exclusive or exhaustive;

ex: P(an even number or a 2)

FREQUENCY TABLE

Event C Event D I Totals

EX: Joint Probability Between C and E

p(C & E) = 52/220 = 0.24

JOINT, MARGINAL & CONDITIONAL

PROBABILITY TABLE

Event D Marginal Conditional EventC Probability ProbablHty

(CfE)=O.60 Event E 0.24 0.16 0.40 (DfE)=O.40

(CIF)=O.47 Event F 0.28 0.32 0.60 (D/F)=O.53

Probability

Conditional (E/C)=O.46 (EfD)=O.33

Probability (F / C)=O.54 (FfD)=O.67

Sampling distribution: A theoretical probability

distribution of a statistic that would result from

drawing all possible samples of a given size from

some population

- A random variable takes numeric values randomly, with probabilities specified by a probability distri­

bution (or density) function

• Discrete random variables: Take only distinct values (as with quantitative data)

• Binomial distribution: A model for the number (x)

of successes in a series of n independent trials where each trial results in success with probability p, or failure with probability 1 - p; ex: The number (x) of heads ("successes") obtained in 12 (n) tosses of a fair (probability of heads = p = 0.5) coin P(x)=nCx Px(l_p)n-x where P(x) is the probability

of exactly x successes out of n trials with a constant probability p of success on each trial;

nCx = n!/(n-x)!x!

-Binomial mean: 11 = np -Binomial variance: (J2 = np(l- p) -As n increases, the binomial approaches the Normal distribution

•Hypergeometric distribution:

- Represents the number of successes from a series

of n trials where each trial results in success or failure

- Like the binomial, except that each trial is drawn from a small population with N elements split between NI successes and Nz failures

-Then the probability of splitting the n trials between x I successes and Xz failures is:

NI!

P(

XI an X2 - N!

n! (N - n)!

nNI

-Hypergeometric mean: 111 = E(xl)= N a n d variance' O'z= N - n[nNI][Nz]

·Poisson distribution: A model for the number of occurrences of an event x=0,l,2, , counted over some fixed interval of space and time rather than some fixed number of trials; the parameter is the average number of occurrences, A , for x=0,1,2,3, , and >0, otherwise P(x)=O

-AA x

p (x) = _e_,_ Poisson mean and variance: 1

x

- A continuous random variable may take on any value along an uninterrupted interval of a number line

- Probabilities are measured only over intervals, never for single values; the probability that a continuous random variable falls between two values is exactly equal to the area under the density curve between those two values

•Normal distribution: Bell curve; a distribution whose values cluster symmetrically around the

mean (also median and mode); common in nature and important in making inferences

- The density curve is the graph of:

I(x) = 1_e - (x - Ji)' /2(5' where f(x) =

0'/2ii

frequency at a given value (J = standard deviation of the normal distribution J1 = the mean of the normal distribution

x = value of normally distributed variable

·Standard normal distribution: A normal distri­

bution with a mean of 0, and standard deviation of I;

values following a normal distribution can be transformed to the standard normal distribution by using z-scores [see Measures of Dispersion, page I]

2

STATISTICAL INFERENCE

• In order to make inferences about a population, which is unobserved, a random sample is drawn -The sample is used to compute statistics, which are then used to draw probability conclusions about the parameters of the population

Population (unobserved) measured by

'andom mmpUng

(observed)

m ea sured b y

Parameters statistical inferellce Statistics

BIASED & UNBIASED ESTIMATORS

• Unbiased estimator of a parameter: An estimator (sample statistic) with an average value equal to the value of the parameter; ex: the sample mean is an unbiased estimator of the population mean; the average value of all possible sample means is the population mean; all other factors being equal, an unbiased estimator is preferable to a biased one

• Biased estimator of a parameter: An estimator (sample statistic) that does not equal on the average the value of the parameter; ex: the median is a biased estimator, since the average of sample medians is not always equal to the population median; variance calculated from a sample, dividing by n, is a biased estimator of the population variance; however, when calculated with n-I it is unbiased

- Note: Estimators themselves present only one source

of bias; even when an unbiased estimator is used, bias in the sample (elements not all equally likely to be chosen) may still be present

- Elementary methods of in ference assume unbiased sampling

-Sampling distribution: The probability distribution

of a sample statistic that would result from drawing all possible samples of a given size from some

population; because smples are drawn at random, every sample statistic is a random variable, and has a probability distribution that can be described using mean and standard deviation

·Standard error: The standard deviation of th estimator; do not confuse this with the standard deviation of the sample itself; measures the

variability in the estimates around their expected value, while the standard deviation of the sample

reflects the variability within the sample around the

sample mean

-The standard deviation of all possible sample means

of a given sample size, drawn from the same

population, is called the standard error of the

sample mean -If the population standard deviation (J is known, the

standard error is: O' x= ~

';11

- Usually, the popUlation standard deviation s is unknown, and is estimated by s in this case, the

estImate stan ar error IS: 0' ,, '" s" = in

-Note: in either case, the standard error of the sample mean decreases as sample size is increased - a larger sample provides more reliable information ab ut the

population

Trang 3

•In a hypothesis test, sample data is used to accept or

reject a null hypothesis (Ho) in favor of an

alternative hypothesis (H ); the significance level

at which the null hypothesis can be rejected

provides against the null hypothesis

• Null hypothesis (Ho): Always specifies a

value (the null hypothesis value) for a

population parameter; the null hypothesis is

assumed to be true-this assumption underlies

the computations for the hypothesis test; ex:

Ho: "a coin is unbiased," that is, the proportion

of heads is 0.5: Ho: P = 0.5

•Alternative hypothesis (H.): Never specifies a

value for a parameter; the alternative hypothesis

states that a population parameter has some value

different from the one specified under the null

hypothesis; ex: H.: A coin is biased; that is, the

proportion of heads is not 0.5: HI: p *0.5

I Two-tailed (or nondirectional): An alternative

hypothesis (H 1) that states only that the

population parameter is simply different from

the one specified under Ho; two-tailed probability

is employed; ex: To use sample data to test

whether the population mean pulse rate is

different from 65, we would use the two-tailed

hypothesis test Ho: ~ = 65 vs HI: ~ *65

2 One-tailed (or directional): An alternative

hypothesis (H t) that states that the population

parameter is greater than (right-tailed) or less

than (left-tailed) the value specified under Ho; one­

tailed probability is employed; ex: to use sample

data to test whether the population mean pulse rate

1 _ is greater than 65, we would use the right-tailed

I/" hypothesis test Ho: ~ = 65 vs H.: ~ > 65

• The alternative hypothesis HI is also

sometimes known as the "research hypothesis,"

as only claims expressed as alternative

hypotheses can be positively asserted

• Level of significance: The probability of observing

sample results as extreme or more extreme than

those actually observed, under the assumption the

null hypothesis is true; if this probability is small

enough, we conclude there is sufficient evidence to

reject the null hypothesis; two basic approaches:

I Fixed significance level (traditional method): A

level of significance a is predetermined; commonly

used significance levels are 0.0 I, 0.05, and 0.10

• Thesmaller the significance level a, the higher the

standard for rejecting Ho; critical value(s) for the

test statistic are determined such that the probability

of the test statistic being farther from zero than the

critical value (in one or two tails, depending on HI)

is a; if the test statistic falls beyond the critical

value-in the rejection region- then Ho can be·

rejected at that fixed significance level a

2 Observed significance level (p-value

method): The test statistic is computed using

the sample data, then the appropriate probability

distribution is used to find the probability of

observing a sample statistic that differs at least

that much from the null hypothesis value for the

population parameter (the probability value, or

~ p-value); the smaller the p-value, the better the

evidence against Ho

I ~

·This method is more commonly used by

computer applications

·The p-value also represents the smallest signifi­

cance level a at which Ho can be rejected; thus,

p-value results can be used with a fixed signifi­

cance level by rejecting Ho ifp-va[ue ~ a

• Generally, the larger (farther from zero, positive

or negative) the value of the test statistic, the smaller the p-value will be, providing better evidence against the null hypothesis in favor of the alternative

• Notion of indirect proof: Through traditional hypothesis testing, the null hypothesis can never be proven true; ex: if we toss a coin 200 times and tails

comes up exactly 100 times, we have no evidence the coin is biased, but cannot prove the coin is fair because of the random nature of sampling-it is possible to flip an unfair coin 200 times and get exactly 100 heads, just as it is possible to draw a sample from a population with mean 104.5 and find

a sample mean of 101; failing to reject the null hypothesis does not prove it true and rejecting it does not prove it false

·Two types of errors

- Type I error: Rejecting Ho when it is actually true;

the probability of a type I error is given by the significance level a; type I is generally more prominent, as it can be controlled

- Type II error: Failing to reject Ho when it is actually false; the probability of a type II error is denoted Ii; type II error is often (foolishly) disregarded: it is difficult to measure or control, as

Ii depends on the unknown true value of the parameter in question, which is not known

True Status of Ho

Ho True Ho False

, - - - r - - - 1- - - - -t . -. -. . -.­

i Type II error

Type

CENTRAL '-IMIT THEOREM

(for sample mean x)

If XI, X2, X3, xn , is a simple random sample of n elements from a large (infinite) population, with mean Il( m) and standard deviation (J, then the distri­

bution of x takes on the bell shaped distribution of a normal random variable as n increases and the distri­

bution of the ratio: x ~f1 approaches the standard

I"rn

normal distribution as n goes to infinity; in practice,

a normal approximation is acceptable for samples of size 30 or larger

(0

Requires that the sample must be drawn from a normal distribution or have a sample size (n) of at least 30

• Used when the population standard deviation (J is known: If (J is known (treated as a constant, not

random) and the above conditions are met, then the distribution of the sample mean follows a normal distribution, and the test statistic z follows a standard normal distribution: Note that this is rarely the case in reality and the t-distribution is more widely used

3

·The test statistic is z = x;;: where ~ = population mean (either known or hypothesized under Ho) and (J" x= (J"/Iii

·Critical region: The portion of the area under the curve which includes those values of the test statistic that provide sufficient evidence for the rejection of the null hypothesis

- The most often used significance levels are 0.01, 0.05, and 0.1; for a one-tailed test using z-statistic, these correspond to z-values of 2.33, 1.65, and 1.28 respectively-positive values for a right-tailed test, negative for a left-tailed test

•For a two-tailed test, the critical region for a = 0.01 is split into two equal outer areas marked by z-values of 12.581; for a = 0.05, the critical values

of z are 11.961, and for a = 0.10, the critical values are 11.651

-Ex 1: Given a population with (J = 50, a simple random sample of n = 100 values is chosen with a sample mean X of 255; test using the p-value method Ho: ~ = 250 vs HI: ~ > 250; is there sufficient evidence to reject the null hypothesis?

• In this case, the test statistic

(255-250)/(501'" I 00) =

•Looking at Table A, the area given for z = 1.00 is 0.3413; the area to its right (since H is ">", this

is a right-tailed test) is 0.5 - 0.3413 = 0.1587, or 15.87%

·This is the p-value: the probability, if Ho is true (that is, if ~ = 250), of obtaining a sample mean of

255 or greater; it also represents the smallest significance level a at which Ho can be rejected

• Since, even if Ho is true, the probability of obtaining a sample mean ~ 255 from this popUlation with a sample of size n = 100 is about 16%, it is quite plausible that Ho is true- there is not very good evidence to support the alternative hypothesis that the population mean is greater than 250 so we fail to reject Ho

·It can't even be rejected at the weakest common significance level of a = 0.10, since 0.1587 > 0.10; remember, this doesn't prove the population mean to be equal to 250; we just haven't accumu­ lated sufficient evidence against the claim -Ex 2: A simple random sample of size n = 25 is taken from a population following a normal distri­ bution with (J = 15; the sample mean x is 95; use the p-value method to test Ho: ~ = 100 vs H.: 1.1 *

100; is there sufficient evidence to reject the claim that the population mean is 100 at a significance level a of 0.1 O? At a = 0.05?

•In this case, the test statistic z = (95 ­ 100)/(15/"'25) = -5/3 = -1.67

·Since the normal curve is symmetric, we can look

up a z-score of 1.67 - the value in Table A is 0.4525, that is, P(O < z < 1.67) = P(-1.67 < z < 0)

= 0.4525 -Thus, P(z < -1.67) = P(z > 1.67) = 0.5 - 0.4525 = 0.0475

• Since this is a two-tailed test (HI: ~* 100), the p­ value is twice this area, or 0.095

• Since the p-value = 0.095 < 0.10 = a, there is

sufficient evidence to reject the null hypothesis at

a significance level a of 0.10, but in the second case, the p-value = 0.095 > 0.05 = a, so the sample data are not strong enough to reject at the higher (0.05) level of significance

Trang 4

~

Table A -Ex: A simple random sample of size 25 is taken from a population following a

normal distribution, with a sample mean 42, and the sample standard deviation

Normal Curve Areas

7.5; test at a fixed significance level a = 0.05: "0: 11 = 45 vs HI: 11 > 45

·Consulting Table B to find the appropriate critical value, with df = n - I = 24,

produces a critical value of -1.711; the null hypothesis can be rejected at a =

0.05 if the value of the test statistic t < -1.711

·The test statistic t = (42 - 45) / (7 5 / -.,125) = -3 1 1.5 = -2; since this is less than

0.0 0000 0040 0080 0120 0160 0199 0239 0279 0319 0359 the critical value of -1.711 , HO is rejected at a = 0.05

0.1 0398 0438 0478 0517 0557 0596 0636 0675 0714 0753

0.3 1179 1217 1255 1293 1331 1368 1406 1443 1480 1517

0.6 2257 2291 2324 2357 2389 2422 2454 2486 2517 2549

0.8 2881 2910 2939 2967 2995 3023 3051 3078 3106 3133

0.9 3159 3186 3212 3238 3264 3289 3315 3340 3365 3389

1 0 3413 3438 3461 3485 3508 3531 3554 3577 3599 3621

1.7 4554 4564 4573 4582 4591 4599 4608 4616 4625 4633

1.8 4641 4649 4656 4664 4671 4678 4686 4693 4699 4706

2.4 4918 4920 4922 4925 4927 4929 4931 4932 4934 4936

2.5 4938 4940 4941 4943 4945 4946 4948 4949 4951 4952

have a sampl e size (n) ofat least 30

a larger critical value of t as the boundary for the rejection region

-The t-distribution is characterized by its degrees of freedom (dt),

Confidence interval: Interval within which a population parameter is likely to be referring to the number of values that are free to vary after placing cer­

tain restrictions on the data

refers to the level of significance])

·Common confidence levels are 90%, 95%, and 99%, just as common levels of

87, we know that the sum of the numbers is 4 * 87 = 348; this tells us

significance are 0.10, 0.05, and 0.01 nothing about the individual values in the sample- there are an infi­

• (I - a) confidence interval for 11:

•For instance, the first number might be 84, the second 98,

third 81; but if the first three numbers are 84, 98, and 81 , then the normal variable z that puts an area a/2

with n -1 df

•As df increases, the t-distribution approaches the standard normal z­

-The test statistic t used for testing hypotheses about a population mean

on this sample

Note: This is not so dif.forent from the test statisti c z used when (J is known! -Any hypothesized 11 below 102 or above 114 would be rejected at 0.05 significance

3

4

sider

can

Trang 5

l"'"1l'!'1I~~ CJ~ "'

COMPARING r rl] '1 I :J.!.':i I ~ [tIl'/.!.':i r.!., ~ [~+" -USING F-RATIO: F=BGV/WGV

-Homogeneity of variances (a criterion for the pooled

and n-k for the denominator

~ -Sampling distribution of the difference 2-sample t-test): The condition that the variances of

-If BGV>WGY, the experimental treatments are

Z between means: If a number of pairs of sam­ two populations are equal; to establish homo~eneity

responsible for the large differences among

III pIes were taken from the same population or of variances, test HO: 0)2 = oi vs H f 0) oF-022

~ _ -­

group means

II from two different populations, then: (note that this is e~uivalent to testing

" -The distribution of differences between HO:o)2/02 2 = ) vs Hfo) /°2 2 oF-1) - Null hypothesis: The group sample means are

pairs of sample means tends to be normal - Under the null hypothesis, the test statistic s 12/S22 all estimates of a common population mean;

= (z-distribution) follows an F-distribution with degrees of freedom: that is, HO:I1) =112=113= = l1k' for all k

- The mean of these differences between means treatment groups, vs HI: at l e a s t on e pair of

(n I-I, n2 -1); if the test statistic exceeds the critical

~ f.i xI - X2 is equal to the difference between means is different (determining which pair(s)

value in Table C, then the null hypothesis can be the population means, that IS 111 - 112 are different requires follow-up testing)

rejected at the indicated level of significance -Independent samples

- We are testing whether or not two samples

are drawn from populations with the same Top row = 05; bottom row = 0 I

mean, that is, HO: 111 = 112, versus a one- or Points for distribution of

In random samples of size n, the sample proportion

Degrees of freedom for numerator

- When 01 and 02 are known, the test statistic p fluctuates around the population

161 200 216 225 230 234 237 239 241 1 242 proportIOn p WIt a vanance 0 n

under the null hypothesis 1

- The standard error of the difference between 4052 4999 5403 5625 5764 5859 59 2 8 5981 6022 1 6056 proportion standard error of In (I - n) In

18 51 19.00 19.16 19.25 19.30 19.33 19 36 1937 19 38 19.39 As the sample size increases, it concentrates more means CY x _ x = j(cy~) In,+ (CY~) Inz 2

1 9930 1 99 33 99.34 99.36 99.38 99.40 around its target mean; also gets closer to the nomlal

- Where (Ill - 112) represents the hypothesized 10.

13 9.55 9.28 912 I 9.01 I 894 8.88 8.64 8 81 8 78

difference in means, the following statistic 34.12 30.81 29.46 1

1

28 71 28 24 27.91 27.67 27.49 27 34 27.23 In(l - n) In

~

can be used for hypothesis tests: .S 7.71 6.94 6.59 6.39 6.26 6.16 6.08 6.04 6.00

(x,- xz)(f.1, - f.1z) E Q c 21.20 18.00 16.69 1 1598 15.52 15 21 14.98 14.80 14.66 14.54

Z= CY x - x " 6.61 5.79 5.41 5 19 5 05 4 95 4.88 4.82 4.78 4 74

5

- When 01 and 02 ~re unknown, which is usu­

16.26 13.27 12 06 11 39 10.97 10.67 10.45 10.27 10 15 10.05

ally the case, substitute s 1 and s2 for ° I and E

Q 5.99 5.14 4 76 4 53 4.39 4.28 4.21 4.15 4 10 4.06 - The correlation coefficient r (also known as e

02, respectively, in the above formulas, and 1 6

13.74 10 92 9.78 9 15 8.75 8.47 8 Z6 8 10 7.98 7.87 "Pearson Product-Moment Correlation

use the t-distribution with df= n I + n2 -2 .:: "

Q 5 59 4.74 4.35 4 12 3 97 3.87 3 79 3.73 3.68 3.63 Coefficient") is a measure of the lin e ar (straight­

12.25 9.55 8.45 7.85 7 46 7.19 7.00 6.64 6.71 6 62 line) relationship between two quantitative vari­

Both populations have normal distributions t

50 3.44 3.39 3.34

~ " 8

- Ex: Given observations to two variables X and -Requires homogeneity of variance: 01 and 11.26 8 65 7.59 7.01 6.63 6.37 6 19 6.03 5 91 5.82

IU 02 are not known but assumed equal- a Y, we can compute their corresponding sums of

5 12 4 26 3.86 3.63 3.48 3.37 3 29 3.23 3.18 3.1 3

~ -Many statisticians do not recommend the t­ SSx = I,(x-(x-x)/sx)2 and SSy = I,(y-(y-y)/s i

4 96 4.10 3.71 3.48 3.33 3.22 3.14 3 07 3.02 2 97

distribution with pooled standard error, the 10 10.04 7

56 6.55 5.99 5.54 5.39 5.21 5 08 4 95 1 4.85 -The formulas for the Pearson correlation (r):

above approach is more conservative

-The hypothesis test may be 2 tailed (= vs oF-) or

1 tailed: III :5112 and the alternative is

111 <

(nl-)+(nT1)= n 1+nT2 ference exists between more than two group means [LX2- ( L;) Z llLy2 - (L,;) 2 ]

-Use the given formula below for estimating - Indicates possibility of overall mean effect of the

-DetermIne the critical region for rejection by the means are different 1 the data are said to have perfect positive cor­

assigning an acceptable level of significance -ANOVA: Consists of obtaining independent esti­ relation- if plotted, they would form a straight

and looking at the Hable with

mates from population subgroups line with positive (upward) slope; when r =-\

- The total sum of squares is partitioned into known -Use the following formula for the estimated lation- if plotted, they would form a straight

components of variation standard ec;r:.:.r::: o:: :r:_ _ _ _ _ _

~, _ line with negative (downward) slope; if r = 0

- Partition of variances

s-_-= l(n,-os,z+ (nz - l)si][n,+ nz] -Between-group variance (BGV): Reflects the the data are said to have no linear correlation;

magnitude of the difference(s)

some other way)

- Within-group variance (WGV): Reflects the dis­ pl e ji"Olll a population >,vith :: e ro c o rr e lation 1 0

test for the mean difference

persion within each treatment group; also referred -For instance, clients of a weight-loss pro­ produ c e b y c han ce a sampl e with r'" O!

to as the error term gram might be weighed before and after the -Test

~ program, and a significant mean difference - When the BGV is large relative to the WGY, the F­

CHI-SQUARE ( X 2) TESTS

ascribed to the effectiveness ratio will also be large

Z -Standard error of the mean difference - Most widely used non-parametric test

IU -General formula: sd = S;t= J,; where sd is BGV = k'- 1 0 where Xi = mean of i -The X2 mean = its degrees of freedom

treatment group and Xtot = mean of all n values -The X2 variance = twice its degrees of freedom

-Can be used to test independence, homogeneity,

and goodness-o f- fit

SSI+ SS2+ + SSk

WGV = n _ k where the SS's are -The square of a standard normal variable is a

= the sums of squares [see Measures of Central chi-square variable with df = I

around the subgroup mean tion depends on the value of df

- We can then test HO: IJ.d = 0 versus a one- or Tendency, page I] of each subgroup's values

two-tailed alternative by using a t-test statistic

5 standard deviation

Trang 6

(continued)

DEGREES OF FREEDOM (dt) COMPUTATION

• If chi-square tests for the goodness-of-fit to a

hypothesized distribution (uses frequency distri­

bution), df = g ~ I, where g = number of groups, or

classes, in the frequency distribution

• If chi-square tests for homogeneity or independence

(uses two-way contingency table), df = (#of

rows ~I)(# of columns ~ I)

GOODNESS-OF-FIT TEST: To apply the chi-square

distribution in this manner, the critical chi-square value

is expressed as: L 0 Ie ' where fo = observed

frequency of the variable, fe = expected frequency

(based on hypothesized population distribution)

TESTS OF CONTINGENCY: Application of chi­

square tests to two separate populations to test statis­

tical independence of attributes

TESTS OF HOMOGENEITY: Application of chi­

square tests to two samples to test if they came from

populations with like distributions

RUNS TEST: Tests whether a sequence (to comprise a

sample) is random; the following equations are applied:

( 'R-) = ~+ 2nl1l2 1 d an 21111l,(2111112 ~ 1I1~1I,) h

1l1+1l2 (1I1+1l2)-(nl+n2~1)

R = mean number of runs

nl = number of outcomes of one type

n2 = number of outcomes of the other type

SR = standard deviation of the distribution of the

number of runs

With a simple random sample of size n producing a

sample correlation coefficient r, it is possible to test

for the linear correlation in the population, P; that is,

we conduct the hypothesis test Ho: P = Po, versus a

right-, left-, or two-tailed alternative; usually we are

interested in determining whether there is any linear

correlation at all; that is, po = 0

The test statIstIc IS: t = /

(I ~r2)/(n~2) which follows a t-distribution with n ~ 2 degrees of

freedom under Ho; this hypothesis test assumes that

the sample is drawn from a population with a bivariate

normal distribution

'Ex: A simple random sample of size 27 produces a

correlation coefficient r =

evidence at a

we need a left-tailed test: Ho: P = 0 vs HI: P <

bution with n

~I.708 ,

)(1 ~(~0.412)/(27 -2))

to reject the null hypothesis of no linear correlation

and support the alternative hypothesis of a negative

linear correlation at a = 0.05

Regression is a method for predicting values of one variable (the outcome or dependent variable)

on the basis of the values of one or more

independent or predictor variables; fitting a regression model is the process of using sample data to determine an equation to represent the relationship

In a simple linear regression model, we use only one predictor variable and assume that the relationship to the outcome variable is linear; that is, the graph of the regression equation is that of a straight line; (we often refer to the "regression line"); for the entire population, the model can be expressed as:

y = ~+ ~lX + e

y is called the dependent variable (or outcome variable), as it is assumed (0 depend on a linear relationship to x

x is the independent variable, also called the predictor variable

~ is the intercept of the regression line; that is, the predicted value for y when x = 0

~I is the slope of the regression line-the marginal change in y per unit change in x

e refers to random error; the error term is assumed (0 follow a normal distribution with a mean olzero and

constant variation-that is, there should be no increase or decrease in dispersion for different regions along the regression line; in addition, it is assumed that error terms are independent for different (x, y) observations

On the basis of sample data, we find estimates bo and bl of the intercept ~o and slope ~l; this gives

us the estimated (or sample) regression equation y= bo+ blX

The parameter estimates bo and bl can be derived in

a variety of ways; one of the most common is known

as the method of least squares; least squares estimates minimize the sum of squared differences between predicted and actual values of the dependent variable y

For a simple linear regression model, the least squares estimates of the intercept and slope are:

estimated slope = b 1 = SSxy 1SSx estimated intercept = bo = Y~ b I x

These estimates-and other calculations in regression-involve sums of squares:

SSxy = I(x ~ x)(y ~ y) = Ixy ~ (Ix)(Iy)/n SSx = I(x ~ x)2 = I(x2) ~ (.h)2/n

SSy = I(y ~ Y)2 = I(y2) ~ (Iy)2/n

Ex: A simple random sample of 8 cars provides the following data on engine displacement (x) and highway mileage (y); fit a simple linear regression model

(displacement) (mileage)

4.6 17 21.16 289 78.2 , ,

51.2 ,., 1.6 32 2.56 1024

Fitting a model entails computing the least-squares estimates bo and bl; note that there are 8 observations-that is, n = 8

First, SSxy = Ixy ~ (Ix)(Iy)/n = ~54.9, SSx = I(x2)

~ CIx)2/n = 17.26, and SSy = I(y2) ~ (Iy)2/n = 268 Then the estimated slope is bl = SSxy/SSx = ~3.18 , and (he estimated intercept is bo = Y~ b l X = 32.54 The estimated regression model, then, is mileage = 32.54 ~ 3.18 displacement

We can assess the significance of the model by testing to see if the sample provides sufficient evidence of a linear relationship in the population; that is, we conduct the hypothesis test: Ho: ~1 = 0 versus HI: ~I;C0; this is exactly equivalent to testing for linear correlation in the population: Ho: P = 0 versus HI: p;c 0; the test for correlation is somewhat simpler:

The correlatIOn coeffiCient r = / = -0.8072

~SSx.SSy

(r~O)

The test statistic t = I = ~3.350

f(l~r2)/(n~2) Consulting Table S, with degrees of freedom

=n~2=6 , we obtain a critical value of 3.143 at a=0.02, and a critical value of 3.707 at a=O.OI; since we have a two-tailed test, we should consider the absolute value of the test statistic, which exceeds 3.143, but does not exceed 3.707; that is we can reject Ho at a=0.02 but not at a=O.OI , so the p­ value is between 0.02 and 0.01; (the actual p-value, which can be found using computer applications is 0.0154); this is a reasonably significant model

LINEAR DETERMINATION

Regression models are also assessed by the coefficient

of linear determination, r2; this represents the proportion of total variation in y that is explained by the regression model; the coefficient of linear determination can be calculated in a variety of ways; the easiest is to compute r2 = (r)2; that is, the coefficient of determi­ nation is the square of the coefficient of correlation

RESIDUALS

The difference between an observed and a fitted value of y(y ~y) is called a residual; examining the residuals is useful (0 identify outliers (observations far from the regression line, representing unusual values for x

3

and y) and to check the assumptions of the model

l'iOTICE TO STUU[NT: This Qu!ckQuickstud y® guid e uJVer s the b ;;tsies of

ISBN~13: 978-157222944-0 U.S $5.95 I CAN.$8.95

[ ntroduc t ory Statistics Due to its con den se d fonnm, h owe ver, u se it a s Stat i s tics gi de an ISBN-10 157222944-6 Customer Hotline # 1.800.230.9522 not :.IS a repl a ce me nt for a ss lg!H"!d

All rI g hts reserved, No pan of this publication may be repr odu ce d or tra ns mi tlc li in

free downloads & form, or by any means, electronic or mec han ical, inclu din g p hot ocopy, rec ord i n or

J'l U tl dr e dS o f tltres at information stora ge and retrieval sy s tem, without written permi ss ion fro lll the publi sh e

.c 2002, 2UU5 BarCharts, Jnc, Boc a a ton, FL 0308

6

6

Ngày đăng: 30/01/2017, 09:11

TỪ KHÓA LIÊN QUAN