Page 1 Random Variables, Probability Distributions, and Expected Values Random Variables RV’s: Numerical value assigned to the outcomes of an experiment.. Probability distribution of a d
Trang 1Page 1 Random Variables, Probability Distributions, and Expected Values
Random Variables (RV’s): Numerical value assigned to the outcomes of an
experiment Capital letters X, Y, Z, with or without subscripts, are used to denote RV’s
Examples
B – V colors of stars
Absolute magnitude of quasars M_i in the i band
Number of electrons emitted from a cathode in a time interval of length t
Two types: Discrete and Continuous
a Probability distribution of a discrete random variable: table of
values of the variable and the proportion of times (or probability) it occurs (which may be expressible in functional form) The first two RV’s above are ‘continuous’
b Probability distribution of a continuous random variable:
idealized curve (perhaps from a histogram) which represents probability that a value of the variable occurs as an area under the curve
Example: Discrete Random Variable
Consider observing some phenomena with exactly two possible outcomes (say, success and failure) until the first success occurs, when the phenomena are independent of one another The it can be shown that the probability function of the number Y of trials until the first success occurs is given by
p(y|) = (1 - )y-1 y = 1, 2, … and 0 otherwise (geometric distribution)
The parameter is the probability of success For example suppose we are looking for some astronomical object at random and count the number of objects examined until the first occurrence of the object is found
Expected Value of a Discrete RV.
The mean µ of a probability distribution or the mean of a random variable or the expected value of X is defined to be
Trang 2Page 2
µ = E(X) = kP(X k) and more generally for a function g(x) E[g(x)] = Σg(x) p(x)
In particular, the expected value of the RV X2 is given by
E(X2) = k2 P(X k)
The Variance 2of a RV X is given by 2 = Var(X) = E(X2) -[E(X)]2 ; and the standard deviation of X =SD(X) is defined to be = 2
A special discrete probability distribution we will encounter is the Poisson.
The 'Poisson Distribution'.
Situations in which there are many opportunities for some phenomena to occur but the chance that the phenomenon will occur in any given time interval, region of space or whatever is very small, lead to the distribution of the number X of occurrences of the phenomena having a Poisson distribution The Poisson distribution has a parameter measuring the rate at which the phenomena occur per unit (time period, interval, area, etc.) Here are some examples:
1 Number X of earthquakes in a region (for example, California, Indonesia, Iran, Turkey, Mexico) in a specified period (five years?)of magnitudes greater than 5.0
2 Number X of times lightning strikes in a 30 minute period in a region (like the state of Colorado)
3 The arrival times of photons from a non-variable astronomical object
4 The spatial distribution of instrumental background photons in an
image
5 The number of photons arriving in adjacent bins in a spectrum of a
faint continuum source
6. The number of ‘arguments’ married couples have in one year
Trang 3Page 3
The probability distribution (frequency function) p(y) of a Poisson random variable with rate parameter is given by
p( y | ) = e- y/ y! , y = 0, 1, 2, ,
Fact: The sum of independent Poisson random variables has a Poisson distribution with parameter the sum of the parameters of the individual variables: Assume Yi,
i = 1, 2, …, n, have a Poisson distribution with parameter i Then
Y = Yi has a Poisson distribution with parameter = i
The mean and variance of the Poisson distribution are both equal to For values
of ‘large’, say > 25 (or even smaller), the Poisson distribution is approximately normal A probability histogram of the Poisson distribution with = 25 is given below
What does the distribution look like? Yeah, normal! So, if is large, one can approximate Poisson probabilities using the normal distribution with mean and standard deviation
If a response variable in a regression context has a Poisson distribution, one can perform a ‘Poisson regression’ analogously to what one does if Y has a normal distribution in conventional linear or multiple regression We will illustrate this later, as an example of ‘generalized linear models’
y
40 35 30 25
20 15 10
0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.00
Scatterplot of p(y| 25) vs y
Trang 4Page 4 Continuous Random Variables
Definition A continuous random variable X is one for which the outcome can be any number in an interval or collection of intervals
Examples Height, weight, time, head circumference, rainfall amounts, lifetime of light bulbs, physical measurements, etc
Probabilities are obtained as areas under a curve, called the probability density function f(x) Below is a graph of the pdf f(x|20) = / 20
20
1 x
e
, for x > 0 and 0 elsewhere it is called the exponential pdf with mean µ =20; the standard deviation
is also 20 It could represent the lifetimes of batteries until re-charging,e.g The cumulative distribution function CDF gives the total area under the curve (or
cumulative probability):
CDF = F(x) =
x
dy y
f( | 20 )
= 1 – e-x/20, for x > 0 and 0 elsewhere Areas under the curve between two points give the proportion of a population that have values between the two points For example, Prob(10 < X < 30) = e y 20/ dy
30
10 20
1
= e-10/20 - e-30/20 = e-0.5 - e-1.5
The Normal Distribution The most well-known continuous distribution is
probably the normal, with probability density function (pdf) f(x) given by
f(x|µ, σ) = 2 ( )2/2 2
2
x
e , -∞ < x < ∞ and CDF Φ(x) =
x
dy y
f( | , )
The graph of a normal pdf is the (familiar) uni-modal symmetric bell-shaped curve The CDF Φ(x) is an elongated ess-shaped curve The mean and variance of a
x
180 160 140 120 100 80 60 40 20
0
0.05
0.04
0.03
0.02
0.01
0.00
Scatterplot of f(x| 20) vs x
Trang 5normal distribution are the parameters µ and σ2 Many natural phenomena have normal distributions—physical measurements, astronomical variables etc
Page 5 Descriptive Statistics.
Types of Data: We classify all ‘data’ about a variable into two types:
a Categorical: data with ‘names’ as values:
Ex 1 type of gamma ray burst GRB (short–hard, long-soft),
b Numerical (or quantitative) data:: value is ‘numerical’
Ex.’s mass of black holes, distance to stars, temperature at launch time of a shuttle, brightness of a star
Numerical (also called quantitative) variables are divided into two types: discrete and continuous.
Parameters and Statistics
Samples When we obtain a sample from the population we also say we obtained a
sample from the probability distribution
Statistics are quantities calculated from samples
Parameters are characteristics computed from the population as a whole or a
probability distribution
The quantities , , and 2 are parameters Statistics are used to estimate
parameters For example, the sample mean is used to estimate the mean of the
population from which the sample is obtained
Graphical and Numerical Summaries of Quantitative Variables
Numerical Summaries:
1 Measures of Location:
Three commonly used measures of the center of a set of numerical values are the mean, median, and trimmed mean
x= Average of the data values,
Trang 6Trimmed Mean: Delete a (fixed) proportion of smallest and largest observations
(e.g., 5% or 10% each) and then re-calculate the mean Judging in contests?
Page 6 Median, Arrange data in order from smallest to largest, with n observations
If n is odd, the median is the middle number If n is even, the median is the average of the middle two numbers
Measures of Position in the Dataset:
The First Quartile Q1 is the median of the numbers below the median or the
25th percentile
The Third Quartile Q3 is the median of the numbers above the median or the
75th percentile
Quantiles are order statistics expressed as fractions or proportions For example,
the pth quantile Qp(or 100pth percentile) divides the lower p of the data from the upper 1-p For example, Q.67 (or 67th percentile) divides the lower 67 of the data from the upper 33 of the data Q.25 and Q.75 are the first and third quartiles
The Interquartile Range (IQR) = Q3 – Q1
Example 1 The body temperatures of 18 adults were measured, resulting in the
following values:
98.2 97.8 99.0 98.6 98.2 97.8 98.4 99.7 98.2 97.4 97.6
98.4 98.0 99.2 98.6 97.1 97.2 98.5
Data Display (Sorted, from smallest to largest):
97.1 97.2 97.4 97.6 97.8 97.8 98.0 98.2 98.2 98.2 98.4
98.4 98.5 98.6 98.6 99.0 99.2 99.7
Descriptive Statistics: BodyTemp
Variable N Mean SE Mean StDev Minimum Q1 Median Q3 Maximum BodyTemp 18 98.217 0.161 0.684 97.100 97.750 98.200 98.600 99.700
Five-Number Summary: Last five quantities in the descriptive statistics above:
Trang 7Minimum Q1 Median Q3 Maximum
97.100 97.750 98.200 98.600 99.700
Page 7
A Boxplot (simple, no unusual observations) is a graphical display of the 5-# summary The ‘box’ is drawn from Q1 to Q3 with the median shown in the box, lines are drawn from the minimum value to the bottom of the box (at Q1) and from the top of the box (at Q3) to the maximum value
Outliers:
An observation is a mild outlier if it is more than 1.5 IQR’s below Q1 or 1.5 IQR’s above Q3 It is an extreme outlier if it is more than 3 IQR’s below Q1 or above
Q3
Software packages often identify outliers in some fashion; e.g., Minitab puts an ‘*’ for outliers (not necessarily all of them though)
Example Number of CD’s owned by college students at Penn State University Stat students:
Variable N Mean SE Mean StDev Min Q1 Median Q3 Max
CDs 236 78.08 5.57 85.59 0 25 50.00 100 500
Mild Outliers: IQR = Q3 – Q1 = 100 – 25 = 75; (1.5)(IQR) = (1.5)(75) = 112.5.
Mild Outliers are #CDs < 25- 112.5 or > 100 + 112.5 = 212.5 There are 17 values
> 212.5 (multiples at some values)
100.0
99.5
99.0
98.5
98.0
97.5
97.0
Boxplot of BodyTemp
Trang 8Page 8
Extreme Outliers: 3IQR = (3)(75) = 225 Extreme outliers are #CDs < 25- 225
(negative value) or > 100 + 225 = 325 By this rule, there are several extreme outliers See the boxplot below
‘
Stem-and-Leaf Plots: A stem-and-leaf plot is a graphical display of data consisting of a stem the most important part of a number and leafs—the second most important part of a number.
Example Stem and Leaf diagram of CDs:
Stem-and-leaf of CDs N = 236; Leaf Unit = 10
109 0 00000001111111111111111111111112222222222222222222222222222+ (56) 0 55555555555555555555555555555556666666667777777888899999
71 1 00000000000000000000000001122
42 1 555555555555555
27 2 00000000001
16 2 55555
11 3 000000
5 3 5
4 4 0
3 4 55
1 5 0
500
400
300
200
100
0
Boxplot of CDs
Trang 9Resistant Statistics: A statistic is said to be ‘resistant’ if its value is relatively unaffected by outliers
Page 9
Example 1 The salaries of employees in a small company are as follows:
$20K, $20K, $20K, $20K, $20K, $500K, and $800K
The average salary is $200K Delete the highest salary and find that the mean is
$100K Delete the two highest salaries and calculate the mean to be $20K The
median is $20 in both situations The median is a resistant statistic and the average is not.
Example 2 Remove the 5 extreme outliers in the CDs dataset and redo the
descriptive statistics
Descriptive Statistics: CD’s with extreme outliers removed:
Variable N Mean SE Mean StDev Min Q1 Med Q3 Max CDs(outliers out) 231 70.46 4.50 68.39 0 25 50 100 300 CDs outliers in) 236 78.08 5.57 85.59 0 25 50 100 500
Note that the only statistic in the 5%-number summary that changed was the Max (which had to change!) Note also that the mean decreased
Examples Resistant Statistics: Median, 1st and 3rd quartiles and IQR (for
moderate samples—n = 10 or more roughly))
Non-Resistant Statistic: Mean (average)
Measures of Spread (Variability):
Interquartile Range IQR, Standard Deviation, Range = Maximum – Minimum, Mean Absolute Deviation, and Median Absolute Deviation
The IQR measures the middle 50% of the data
The Standard Deviation (SD) is roughly the average distance values are from the
mean The actual definition of the standard deviation (sd), denoted by s, is the square root of the sample variance s2, where
Trang 10s 2 = ∑ (x i - x ) 2 / (n – 1)
Page 10
is the sum of squared deviations of the values from the mean The sd is not a resistant statistic
The mean absolute deviatio0n = | x i - x| / n
The median absolute deviatio0n = median [ | x i – median(x i )|]
Example Body Temperature
The interquartile range IQR = Q3 – Q1 = 98.600 - 97.750 = 0.850
The sample range = Max – Min = 99.7 – 97.1 = 2.60
The sample variance s2 = 0.467353; the SD = s = s2 = 0.6836
The mean absolute deviation = 0.516667
The median absolute deviation = 0.400
Astronomy Example We use data from Mukherjee, Feigelson, Babu, etal in
“Three types of Gamma-Ray Bursts (The Astrophysical Journal, 508, pp
314-327, 1998), in which there 11 variables, including 2 measures of burst durations T50 and T90 (times in which 50% and 90% of the flux arrives) and total fluence (flu_tot) as the sum of 4 time integrated fluences Descriptive Statistics for the Variables‘flu_tot’ and ln(flu_tot) are given below.
Variable N Mean SE Mean StDev flu_tot 802 0.0000125 0.00000164 0.0000465 ln(flu_tot) 802 -12.955 0.0632 1.789 Five Number Summaries:
Variable Min Q1 Median Q3 Max flu_tot .0000000159 .000000720 .00000234 .00000734 .000781 ln(flu_tot) -17.957 -14.144 -12.968 -11.823 -7.155
Trang 11Empirical Rule says that if the data are symmetric and bell-shaped (unimodal),
indicative of a normal distribution, then
Page 11
About 68% of the observations will be within 1 SD of the mean
About 95% of the observations will be within 2 SDs of the mean
Almost all—99.7% of the observations will be within 3 SDs of the mean
For the variable ‘ln(flu_tot)’, we find that the intervals and the percentages are as follows:
Mean 1 StDev -12.955 1.789 (-14.744, -11.166) 554/802 = 0.6908 or 69%
Mean 2 StDev -12.955 3.578 (-16.533, -9.377) 763/802 = 0.9514 or 95%
Mean 3 StDev -12.955 5.367 (-18.322, -7.588) 800/802 = 0.9975 or 99.75%
A similar dataset on gamma ray bursts included a categorical variable—gmark—
with four values A box plot graphically displaying this data for the variables
log(flu)tot) is given below It dramatically illustrates how transforming the data
(here, using a log transformation) reduces or eliminates outliers, gives a visual
comparison of the five #-statistics, and enables one to compare the values (median) four the four types of gamma ray bursts
gmark
4 3
2 1
-3
-4
-5
-6
-7
-8
Boxplot of lf_ tot vs gmark
gmark
4 3
2 1
0.0008 0.0007 0.0006 0.0005 0.0004 0.0003 0.0002 0.0001 0.0000
Boxplot of flutot vs gmark