Compare large and small stepwise models This example show how to compare the models that LinearModel.stepwise returns starting from aconstant model and starting from a full interaction m
Trang 1Free ebooks ==> www.Ebook777.com
ADVANCED ECONOMETRIC MODELS with MATLAB
Stepwise Regression to Select Appropriate Models
LinearModel.stepwise creates a linear model and automatically adds to or trims the model To create asmall model, start from a constant model To create a large model, start with a model containing manyterms A large model usually has lower error as measured by the fit to the original data, but might nothave any advantage in predicting new data
LinearModel.stepwise can use all the name-value options from LinearModel.fit, with additional optionsrelating to the starting and bounding models In particular:
• For a small model, start with the default lower bounding model: 'constant' (a model that has no
predictor terms)
• The default upper bounding model has linear terms and interaction terms (products of pairs of
predictors) For an upper bounding model that also includes squared terms, set the Upper name-valuepair to 'quadratic'
Compare large and small stepwise models
This example show how to compare the models that LinearModel.stepwise returns starting from aconstant model and starting from a full interaction model
Load the carbig data and create a dataset array from some of the data
load carbig
ds = dataset(Acceleration,Displacement,Horsepower,Weight,MPG);
Create a mileage model stepwise starting from the constant model
mdl1 = LinearModel.stepwise(ds,'constant','ResponseVar','MPG')
1 Adding Weight, FStat = 888.8507, pValue = 2.9728e-103
2 Adding Horsepower, FStat = 3.8217, pValue = 0.00049608
3 Adding Horsepower:Weight, FStat = 64.8709, pValue = 9.93362e-15
mdl1 =
Linear regression model: MPG ~ 1 + Horsepower*Weight
Estimated Coefficients: Estimate SE tStat pValue (Intercept) 63.558 2.3429 27.127 1.2343e-91 Horsepower -0.25084 0.027279 -9.1952 2.3226e-18 Weight -0.010772 0.00077381 -13.921 5.1372e-36 Horsepower:Weight 5.3554e-05 6.6491e-06 8.0542 9.9336e-15
Number of observations: 392, Error degrees of freedom: 388 Root Mean Squared Error: 3.93 R-squared: 0.748, Adjusted R-Squared 0.746 statistic vs constant model: 385, p-value = 7.26e-116
F-www.Ebook777.com
Trang 2Free ebooks ==> www.Ebook777.com
Create a mileage model stepwise starting from the full interaction model
mdl2 = LinearModel.stepwise(ds,'interactions','ResponseVar','MPG')
1 Removing Acceleration:Displacement, FStat = 0.024186, pValue = 0.8765
2 Removing Displacement:Weight, FStat = 0.33103, pValue = 0.56539
3 Removing Acceleration:Horsepower, FStat = 1.7334, pValue = 0.18876
4 Removing Acceleration:Weight, FStat = 0.93269, pValue = 0.33477
5 Removing Horsepower:Weight, FStat = 0.64486, pValue = 0.42245
mdl2 =
Linear regression model: MPG ~ 1 + Acceleration + Weight + Displacement*Horsepower
Estimated Coefficients: Estimate SE tStat pValue (Intercept) 61.285 2.8052 21.847 1.8593e-69 Acceleration -0.34401 0.11862 -2.9 0.0039445 Displacement -0.081198 0.010071 -8.0623 9.5014e-15 Horsepower -0.24313 0.026068 -9.3265 8.6556e-19 Weight -0.0014367 0.00084041 -1.7095 0.088166 Displacement:Horsepower 0.00054236 5.7987e-05 9.3531 7.0527e-19
Number of observations: 392, Error degrees of freedom: 386 Root Mean Squared Error: 3.84 R-squared: 0.761, Adjusted R-Squared 0.758 statistic vs constant model: 246, p-value = 1.32e-117
F-Notice that:
• mdl1 has four coefficients (the Estimate column), and mdl2 has six coefficients.
• The adjusted R-squared of mdl1 is 0.746, which is slightly less (worse) than that of mdl2, 0.758.
Create a mileage model stepwise with a full quadratic model as the upper bound, starting from the fullquadratic model:
Trang 3The models have similar residuals It is not clear which fits the data better Interestingly, the morecomplex models have larger maximum deviations of the residuals:
Rrange1 = [min(mdl1.Residuals.Raw),max(mdl1.Residuals.Raw)]; Rrange2 =
“What Is Robust Regression?” on page 9-116
“Robust Regression versus Standard Least-Squares Fit” on page 9-116
What Is Robust Regression?
The models described in “What Are Linear Regression Models?” on page 9-7 are based on certainassumptions, such as a normal distribution of errors in the observed responses If the distribution oferrors is asymmetric or prone to outliers, model assumptions are invalidated, and parameter estimates,confidence intervals, and other computed statistics become unreliable Use LinearModel.fit with the
Trang 4RobustOpts name-value pair to create a model that is not much affected by outliers The robust fittingmethod is less sensitive than ordinary least squares to large changes in small parts of the data.
Robust regression works by assigning a weight to each data point Weighting is done automatically and
iteratively using a process called iteratively reweighted least squares In the first iteration, each point is
assigned equal weight and model coefficients are estimated using ordinary least squares At subsequentiterations, weights are recomputed so that points farther from model predictions in the previous iterationare given lower weight Model coefficients are then recomputed using weighted least squares Theprocess continues until the values of the coefficient estimates converge within a specified tolerance
Robust Regression versus Standard Least-Squares Fit
This example shows how to use robust regression It compares the results of a robust fit to a standardleast-squares fit
Step 1 Prepare data.
Load the moore data The data is in the first five columns, and the response in the sixth
load moore X = [moore(:,1:5)]; y = moore(:,6);
Step 2 Fit robust and nonrobust models.
Fit two linear models to the data, one using robust fitting, one not
mdl = LinearModel.fit(X,y); % not robust mdlr = LinearModel.fit(X,y,'RobustOpts','on');
Step 3 Examine model residuals.
Examine the residuals of the two models
subplot(1,2,1);plotResiduals(mdl,'probability') subplot(1,2,2);plotResiduals(mdlr,'probability')
The residuals from therobust fit (right half of the plot) are nearly all closer to the straight line, except for the one obviousoutlier
4 Remove the outlier from the standard model
Trang 5Free ebooks ==> www.Ebook777.com
Find the index of the outlier Examine the weight of the outlier in the robust fit
[~,outlier] = max(mdlr.Residuals.Raw); mdlr.Robust.Weights(outlier)
“Introduction to Ridge Regression” on page 9-119 “Ridge Regression” on page 9-119
Introduction to Ridge Regression
Coefficient estimates for the models described in “Linear Regression” on page 9-11 rely on the
independence of the model terms When terms are correlated and the columns of the design matrix X have an approximate linear dependence, the matrix (X T X)–1becomes close to singular As a result, theleast-squares estimate
where k is the ridge parameter and I is the identity matrix Small positive values of k improve the
conditioning of the problem and reduce the variance of the estimates While biased, the reduced variance
of ridge estimates often result in a smaller mean square error when compared to least-squares estimates.The Statistics Toolbox function ridge carries out ridge regression
www.Ebook777.com
Trang 6xlabel('x2'); ylabel('x3'); grid on; axis square
Note the correlationbetween x1 and the other two predictor variables
Use ridge and x2fx to compute coefficient estimates for a multilinear model with interaction terms, for arange of ridge parameters:
Trang 7The estimates stabilize to the right of the plot Note that the coefficient of the x2x3 interaction termchanges sign at a value of the ridge parameter § 5×10–4.
Lasso and Elastic Net
In this section
“What Are Lasso and Elastic Net?” on page 9-123 “Lasso Regularization” on page 9-123
“Lasso and Elastic Net with Cross Validation” on page 9-126 “Wide Data via Lasso and Parallel
Computing” on page 9-129 “Lasso and Elastic Net Details” on page 9-134 “References” on page 9-136
What Are Lasso and Elastic Net?
Lasso is a regularization technique Use lasso to:
• Reduce the number of predictors in a regression model.
• Identify important predictors.
• Select among redundant predictors.
• Produce shrinkage estimates with potentially lower predictive errors than ordinary least squares.
Trang 8Elastic net is a related technique Use elastic net when you have several highly correlated variables.lasso provides elastic net regularization when you set the Alpha name-value pair to a number strictlybetween 0 and 1.
See “Lasso and Elastic Net Details” on page 9-134
For lasso regularization of regression ensembles, see regularize
Lasso Regularization
To see how lasso identifies and discards unnecessary predictors:
1Generate 200 samples of five-dimensional artificial data X from exponential distributions with variousmeans:
rng(3,'twister') % for reproducibility X = zeros(200,5);
for ii = 1:5
X(:,ii) = exprnd(ii,200,1); end
2Generate response dataY= X*r+ eps where r has just two nonzero components, and the noise eps isnormal with standard deviation 0.1:
The plot shows the nonzero coefficients in the regression for various values of the Lambda
regularization parameter Larger values of Lambda appear on the left side of the graph, meaning moreregularization, resulting in fewer nonzero regression coefficients
Trang 9The dashed vertical lines represent the Lambda value with minimal mean squared error (on the right),and the Lambda value with minimal mean squared error plus one standard deviation This latter value is
a recommended setting for Lambda These lines appear only when you perform cross validation Crossvalidate by setting the 'CV' name-value pair This example uses 10-fold cross validation
The upper part of the plot shows the degrees of freedom (df), meaning the number of nonzero
coefficients in the regression, as a function of Lambda On the left, the large value of Lambda causes allbut one coefficient to be 0 On the right all five coefficients are nonzero, though the plot shows only twoclearly The other three coefficients are so small that you cannot visually distinguish them from 0.For small values of Lambda (toward the right in the plot), the coefficient values are close to the least-squares estimate See step 5 on page 9-126
4Find the Lambda value of the minimal cross-validated mean squared error plus one standard deviation.Examine the MSE and coefficients of the fit at that Lambda:
lam = fitinfo.Index1SE; fitinfo.MSE(lam)
lasso did a good job finding the coefficient vector r
5For comparison, find the least-squares estimate of r:
rhat = X\Y
rhat =
-0.0038 1.9952 0.0014
-2.9993 0.0031
The estimate b(:,lam) has slightly more mean squared error than the mean squared error of rhat:
res = X*rhat - Y; % calculate residuals MSEmin = res'*res/200 % b(:,lam) value is 0.1398
MSEmin =
0.0088
But b(:,lam) has only two nonzero components, and therefore can provide better predictive estimates onnew data
Lasso and Elastic Net with Cross Validation
Consider predicting the mileage ( MPG) of a car based on its weight, displacement, horsepower, andacceleration The carbig data contains these measurements The data seem likely to be correlated,
making elastic net an attractive choice
Trang 10Free ebooks ==> www.Ebook777.com
1Load the data:
load carbig
2Extract the continuous (noncategorical) predictors (lasso does not handle categorical predictors):
X = [Acceleration Displacement Horsepower Weight];
3Perform a lasso fit with 10-fold cross validation:
[b fitinfo] = lasso(X,MPG,'CV',10);
4Plot the result:
lassoPlot(b,fitinfo,'PlotType','Lambda','XScale','log');
5Calculate thecorrelation of the predictors:
% Eliminate NaNs so corr runs
7Plot the result Name each predictor so you can tell which curve is which:
pnames = {'Acceleration','Displacement', 'Horsepower','Weight'};
lassoPlot(ba,fitinfoa,'PlotType','Lambda', 'XScale','log','PredictorNames',pnames);
www.Ebook777.com
Trang 11When you activate the data cursor
and click the plot, you see the name of the predictor, the coefficient, the value of Lambda, and the index
of that point, meaning the column in b associated with that fit
Here, the elastic net and lasso results are not very similar Also, the elastic net plot reflects a notablequalitative property of the elastic net technique The elastic net retains three nonzero coefficients asLambda increases (toward the left of the plot), and these three coefficients reach 0 at about the sameLambda value In contrast, the lasso plot shows two of the three coefficients becoming 0 at the samevalue of Lambda, while another coefficient remains nonzero for higher values of Lambda
This behavior exemplifies a general pattern In general, elastic net tends to retain or drop groups ofhighly correlated predictors as Lambda increases In contrast, lasso tends to drop smaller groups, or evenindividual predictors
Wide Data via Lasso and Parallel Computing
Lassoandelasticnetareespeciallywellsuitedto wide data, meaning data with more predictors than
observations Obviously, there are redundant predictors in this type of data Use lasso along with crossvalidation to identify important predictors
Cross validation can be slow If you have a Parallel Computing Toolbox license, speed the computationusing parallel computing
1Load the spectra data:
load spectra Description
Description =
== Spectral and octane data of gasoline == NIR spectra and octane numbers of 60 gasoline samples
NIR: NIR spectra, measured in 2 nm intervals from 900 nm to 1700 nm octane: octane numbers spectra: a dataset array containing variables for NIR and octane
Trang 12Reference: Kalivas, John H., "Two Data Sets of Near Infrared Spectra," Chemometrics and Intelligent Laboratory Systems, v.37 (1997) pp.255 259
2Compute the default lasso fit:
Elapsed time is 226.876926 seconds
5Plot the result:
lassoPlot(b,fitinfo,'PlotType','Lambda','XScale','log');
You can see thesuggested value of Lambda is over 1e-2, and the Lambda with minimal MSE is under 1e-2 These valuesare in the fitinfo structure:
Trang 13ans =
0.0532
fitinfo.DF(lambdaindex)
ans = 11
The fit uses just 11 of the 401 predictors, and achieves a cross-validated MSE of 0.0532
7Examine the plot of cross-validated MSE:
lassoPlot(b,fitinfo,'PlotType','CV');
% Use a log scale for MSE to see small MSE values better set(gca,'YScale','log');
As Lambda increases (toward the left), MSE increases rapidly The coefficients are reduced too muchand they do not adequately fit the responses
As Lambda decreases, the models are larger (have more nonzero coefficients) The increasing MSEsuggests that the models are overfitted
The default set of Lambda values does not include values small enough to include all predictors In thiscase, there does not appear to be a reason to look at smaller values However, if you want smaller valuesthan the default, use the LambdaRatio parameter, or supply a sequence of Lambda values using theLambda parameter For details, see the lasso reference page
8To compute the cross-validated lasso estimate faster, use parallel computing (available with a ParallelComputing Toolbox license):
matlabpool open
Starting matlabpool using the 'local' configuration connected to 4 labs
Trang 14opts = statset('UseParallel',true);
tic;
[b fitinfo] = lasso(NIR,octane,'CV',10,'Options',opts); toc
Elapsed time is 107.539719 seconds
Computing in parallel is more than twice as fast on this problem using a quad-core processor
Lasso and Elastic Net Details
Overview of Lasso and Elastic Net
Lasso is a regularization technique for performing linear regression Lasso includes a penalty term thatconstrains the size of the estimated coefficients Therefore, it resembles ridge regression Lasso is a
shrinkage estimator:it generates coefficient estimates that are biased to be small Nevertheless, a lasso
estimator can have smaller mean squared error than an ordinary least-squares estimator when you apply
it to new data
Unlike ridge regression, as the penalty term increases, lasso sets more coefficients to zero This meansthat the lasso estimator is a smaller model, with fewer predictors As such, lasso is an alternative tostepwise regression and other model selection and dimensionality reduction techniques
Elastic net is a related technique Elastic net is a hybrid of ridge regression and lasso regularization Likelasso, elastic net can generate reduced models by generating zero-valued coefficients Empirical studieshave suggested that the elastic net technique can outperform lasso on data with highly correlated
predictors
Definition of Lasso
The lasso technique solves this regularization problem For a given value of , a nonnegative parameter,
lasso solves the problem
• N is the number of observations.
•y i is the response at observation i.
•x i is data, a vector of p values at observation i.
• is a positive regularization parameter corresponding to one value of Lambda.
• The parameters0and are scalar and p-vector respectively.
As increases, the number of nonzero components of decreases
The lasso problem involves the L1norm of , as contrasted with the elastic net algorithm
Trang 15Definition of Elastic Net
The elastic net technique solves this regularization problem For an strictly between 0 and 1, and a
nonnegative , elastic net solves the problem min
Elastic net is the same as lasso when =1 As shrinks toward 0, elastic net approaches ridge regression
For other values of , the penalty term P ( ) interpolates between the L1norm of and the squared L2norm
of
References
[1] Tibshirani, R Regression shrinkage and selection via the lasso Journal of the Royal Statistical
Society, Series B, Vol 58, No 1, pp 267–288, 1996
[2] Zou,H.andT.Hastie Regularization and variable selection via the elastic net Journal of the Royal
Statistical Society, Series B, Vol 67, No 2, pp 301–320, 2005
[3] Friedman, J., R Tibshirani, and T Hastie Regularization paths for generalized linear models via coordinate descent Journal of Statistical Software, Vol 33, No 1, 2010 http://www.jstatsoft.org/v33/i01 [4] Hastie, T., R Tibshirani, and J Friedman The Elements of Statistical Learning, 2nd edition.
Springer, New York, 2008
In this section
“Introduction to Partial Least Squares” on page 9-137 “Partial Least Squares” on page 9-138
Introduction to Partial Least Squares
Partial least-squares (PLS) regression is a technique used with data that contain correlated predictor variables This technique constructs new predictor variables, known as components, as linear
Trang 16combinations of the original predictor variables PLS constructs these components while considering theobserved response values, leading to a parsimonious model with reliable predictive power.
The technique is something of a cross between multiple linear regression and principal componentanalysis:
• Multiple linear regression finds a combination of the predictors that best fit aresponse.
• Principal component analysis finds combinations of the predictors with large variance, reducing
correlations The technique makes no use of response values
• PLS finds combinations of the predictors that have a large covariance with theresponsevalues.
PLS therefore combines information about the variances of both the predictors and the responses, whilealso considering the correlations among them
PLS shares characteristics with other regression and feature transformation techniques It is similar toridge regression in that it is used in situations with correlated predictors It is similar to stepwise
regression (or more general feature selection techniques) in that it can be used to select a smaller set ofmodel terms PLS differs from these methods, however, by transforming the original predictor space intothe new component space
The Statistics Toolbox function plsregress carries out PLS regression For example, consider the data onbiochemical oxygen demand in moore.mat, padded with noisy versions of the predictors to introducecorrelations:
Trang 17Choosing the number of components in a PLS model is a critical step The plot gives a rough indication,showing nearly 80% of the variance in y explained by the first component, with as many as five
additional components making significant contributions
The following computes the six-component model:
[XL,yl,XS,YS,beta,PCTVAR,MSE,stats] = plsregress(X,y,6); yfit = [ones(size(X,1),1) X]*beta;
plot(y,yfit,'o')
The scatter shows a reasonable
correlation between fitted and observed responses, and this is confirmed by the R2statistic:
TSS = sum((y-mean(y)).^2);
RSS = sum((y-yfit).^2);
Trang 18Rsquared = 1 - RSS/TSS
Rsquared =
0.8421
A plot of the weights of the ten predictors in each of the six components shows that two of the
components (the last two computed) explain the majority of the variance in X:
Trang 19The calculation of mean-squared errors by plsregress is controlled by optional parameter name/valuepairs specifying cross-validation type and the number of Monte Carlo repetitions.
Generalized Linear Models
In this section
“What Are Generalized Linear Models?” on page 9-143 “Prepare Data” on page 9-144
“Choose Generalized Linear Model and Link Function” on page 9-146 “Choose Fitting Method andModel” on page 9-150
“Fit Model to Data” on page 9-155
“Examine Quality and Adjust the Fitted Model” on page 9-156 “Predict or Simulate Responses to NewData” on page 9-168 “Share Fitted Models” on page 9-171
“Generalized Linear Model Workflow” on page 9-173
What Are Generalized Linear Models?
Linear regression models describe a linear relationship between a response and one or more predictiveterms Many times, however, a nonlinear relationship exists “Nonlinear Regression” on page 9-198
describes general nonlinear models A special class of nonlinear models, called generalized linear models, uses linear methods.
Recall that linear models have these characteristics:
• At each set of values for the predictors, the response has a normal distribution with mean
• A coefficient vector b defines a linear combination Xb of the predictors X.
• The model is = Xb.
In generalized linear models, these characteristics are generalized as follows:
Trang 20Free ebooks ==> www.Ebook777.com
• At each set of values for the predictors, the response has a distribution that can be normal, binomial,
Poisson, gamma, or inverse Gaussian, with parameters including a mean
• A coefficient vector b defines a linear combination Xb of the predictors X.
• A link function f defines the model as f( )= Xb.
Prepare Data
To begin fitting a regression, put your data into a form that fitting functions expect All regression
techniques begin with input data in an array X and response data in a separate vector y, or input data in adataset array ds and response data as a column in ds Each row of the input data represents one
observation Each column represents one predictor (variable)
For a dataset array ds, indicate the response variable with the 'ResponseVar' name-value pair:
mdl = LinearModel.fit(ds,'ResponseVar','BloodPressure'); %or
mdl = GeneralizedLinearModel.fit(ds,'ResponseVar','BloodPressure');
The response variable is the last column by default
You can use numeric categorical predictors A categorical predictor is one that takes values from a fixed
set of possibilities
• For a numeric array X, indicate the categorical predictors using the 'Categorical' name-value pair For
example, to indicate that predictors 2 and 3 out of six are categorical:
- Categorical (nominal or ordinal)
- String or character array
If you want to indicate that a numeric predictor is categorical, use the 'Categorical' name-value pair.Represent missing numeric data as NaN To represent missing data for other data types, see “MissingGroup Values” on page 2-53
• For a 'binomial' model with data matrix X, the response y can be:
- Binary column vector — Each entry represents success (1)orfailure(0).
- Two-column matrix of integers — The first column is the number of successes in each observation, the
second column is the number of trials in that observation
• For a 'binomial' model with dataset ds:
- Use the ResponseVar name-value pair to specify the column of ds that gives the number of successes in
www.Ebook777.com
Trang 21ds = dataset(MPG,Weight); ds.Year = ordinal(Model_Year);
Numeric Matrix for Input Data, Numeric Vector for Response For example, to create numeric arrays
from workspace variables:
load carsmall
X = [Weight Horsepower Cylinders Model_Year]; y = MPG;
To create numeric arrays from an Excel spreadsheet:
[X Xnames] = xlsread('hospital.xls'); y = X(:,4); % response y is systolic pressure X(:,4) = []; % remove
y from the X matrix Notice that the nonnumeric entries, such as sex, do not appear in X
Choose Generalized Linear Model and Link Function
Often, your data suggests the distribution type of the generalized linear model
Response Data Type
Any real number
Any positive number
Any nonnegative integer
Integer from 0 to n,where n is a fixed positive value
Suggested Model Distribution Type
'normal'
'gamma' or 'inverse gaussian' 'poisson'
'binomial'
Set the model distribution type with the Distribution name-value pair After selecting your model type,
choose a link function to map between the mean μ and the linear predictor Xb.
Value
'comploglog'
'identity', default for the distribution 'normal'
'log', default for the distribution 'poisson'
'logit' , default for the distribution 'binomial' 'loglog'
Trang 22p (a number), default for the distribution 'inverse gaussian' (with p = –2)
Cell array of the form
{FL FD FI}, containing
three function handles, created using @, that define the link (FL),thederivativeofthe link (FD), and theinverse link (FI) Equivalently, can be a structure of function handles with field Link containing FL,field Derivative containing FD,and field Inverse
containing FI
Description μ p = Xb
User-specified link function (see “Custom Link Function” on page 9-147)
The nondefault link functions are mainly useful for binomial models These nondefault link functionsare 'comploglog', 'loglog',and 'probit'
Custom Link Function
The link function defines the relationship f(μ)= Xb between the mean response μ and the linear
combination Xb = X*b of the predictors You can choose one of the built-in link functions or define your
own by specifying the link function FL, its derivative FD, and its inverse FI:
• The link function FL calculates f(μ).
• The derivative of the link function FD calculates df(μ)/dμ.
• The inverse function FI calculates g(Xb)= μ.
You can specify a custom link function in either of two equivalent ways Each way contains function
handles that accept a single array of values representing μ or Xb, and returns an array the same size The
function handles are either in a cell array or a structure:
• Cell array of the form {FL FD FI}, containing three function handles,
created using @, that define the link (FL),thederivativeofthelink(FD), and the inverse link (FI)
• Structure s with three fields, each containing a function handle created using @:
- s.Link — Link function
- s.Derivative — Derivative of the link function
- s.Inverse — Inverse of the link function
Trang 23For example, to fit a model using the 'probit' link function:
Chi^2-statistic vs constant model: 241, p-value = 2.25e-54
You can perform the same fit using a custom link function that performs identically to the 'probit' linkfunction:
s = {@norminv,@(x)1./normpdf(norminv(x)),@normcdf}; g = GeneralizedLinearModel.fit(x,[y n], 'linear','distr','binomial','link',s)
Trang 24Generalized Linear regression model: link(y) ~ 1 + x1
Chi^2-statistic vs constant model: 241, p-value = 2.25e-54
Choose Fitting Method and Model
Therearetwoways to create a fitted model
• Use GeneralizedLinearModel.fit when you have a good idea of your generalized linear model, or when
you want to adjust your model later to include or exclude certain terms
• Use GeneralizedLinearModel.stepwise when you want to fit your model using stepwise regression.
GeneralizedLinearModel.stepwise starts from one model, such as a constant, and adds or subtracts termsone at a time, choosing an optimal term each time in a greedy fashion, until it cannot improve further.Use stepwise fitting to find a good model, one that has only relevant terms
The result depends on the starting model Usually, starting with a constant model leads to a small model.Starting with more terms can lead to a more complex model, but one that has lower mean squared error
In either case, provide a model to the fitting function (which is the starting model for
GeneralizedLinearModel.stepwise)
Specify a model using one of these methods
• “Brief String” on page 9-150
• “Terms Matrix” on page 9-151
Model contains only a constant (intercept) term
Model contains an intercept and linear terms for each predictor
Model contains an intercept, linear terms, and all products of pairs of distinct predictors (no squaredterms)
Trang 25Model contains an intercept, linear terms, and squared terms.
Model contains an intercept, linear terms, interactions, and squared terms
Model is a polynomial with all terms up to degree i in the first predictor, degree j in the second predictor,
etc Use numerals 0 through 9.For example, 'poly2111' has a constant plus all linear and product terms,and also contains terms with predictor 1 squared
Terms Matrix A terms matrix is a T-byP+1 matrix specifying terms in a model, where T is the number
of terms, P is the number of predictor variables, and plus one is for the response variable The value ofT(i,j) is the exponent of variable j in term i For example, if there are three predictor variables A, B,andC:
[0 0 0 0] % constant term or intercept [0 1 0 0] % B; equivalently, A^0 * B^1 * C^0 [1 0 1 0] % A*C[2 0 0 0] % A^2
[0 1 2 0] % B*(C^2)
The 0 at the end of each term represents the response variable In general,
• If you have the variables in a dataset array, then a 0 must represent the response variable depending on
the position of the response variable in the dataset array For example:
Load sample data and define the dataset array
Trang 26Now, the response variable is the first term in the data set array Specify the same linear model,
'BloodPressure ~ 1 + Sex + Age + Smoker', using a term matrix
• If you have the predictor and response variables in a matrix and column vector, then you must include
a 0 for the response variable at the end of each term For example:
Load sample data and define the matrix of predictors
load carsmall
X = [Acceleration,Weight];
Specify the model 'MPG ~ Acceleration + Weight +
Acceleration:Weight + Weight^2' using a term matrix and fit the model to data This model includes themain effect and two way interaction terms for the variables, Acceleration and Weight,anda second orderterm for the variable, Weight
Number of observations: 94, Error degrees of freedom: 89 Root Mean Squared Error: 4.1
R-squared: 0.751, Adjusted R-Squared 0.739
F-statistic vs constant model: 67, p-value = 4.99e-26
Only the intercept and x2 term, which corresponds to the Weight variable, are significant at the 5%significance level
Trang 27Now, perform a stepwise regression with a constant model as the starting model and a linear model withinteractions as the upper model.
Number of observations: 94, Error degrees of freedom: 92 Root Mean Squared Error: 4.13
R-squared: 0.738, Adjusted R-Squared 0.735
F-statistic vs constant model: 259, p-value = 1.64e-28
The results of the stepwise regression are consistent with the results of LinearModel.fit in the previousstep
- + to include the next variable
- to exclude the next variable
- : to define an interaction, a product of terms
- * to define an interaction and all lower-order terms
- ^toraisethepredictortoapower,exactlyasin * repeated, so ^ includes lower order terms as well
- () to group terms
Tip Formulas include a constant (intercept) term by default To exclude a constant term from the model,
include -1 in the formula
Examples:
'Y ~A +B+ C' is a three-variable linear model with intercept 'Y ~A +B+ C- 1' is a three-variable linearmodel without intercept 'Y ~A +B+ C+ B^2' is a three-variable model with intercept and a B^2 term.'Y ~A +B^2 +C' is the same as the previous example, since B^2 includes a B term
'Y ~A +B+ C+ A:B' includes an A*B term
'Y ~A*B+C' is the same as the previous example, sinceA*B= A+ B + A:B
'Y ~ A*B*C - A:B:C' has all interactions among A, B,and C,exceptthe three-way interaction
'Y ~ A*(B + C + D)' has all linear terms, plus products of A with each of the other variables
Trang 28Fit Model to Data
Create a fitted model using GeneralizedLinearModel.fit or GeneralizedLinearModel.stepwise Choosebetween them as in “Choose Fitting Method and Model” on page 9-150 For generalized linear modelsother than those with a normal distribution, give a Distribution name-value pair as in “Choose
Generalized Linear Model and Link Function” on page 9-146 For example,
mdl = GeneralizedLinearModel.fit(X,y,'linear','Distribution','poisson') %or
mdl = GeneralizedLinearModel.fit(X,y,'quadratic',
'Distribution','binomial')
Examine Quality and Adjust the Fitted Model
After fitting a model, examine the result
• “Model Display”onpage9-156
• “Diagnostic Plots” on page 9-157
• “Residuals — Model Quality for Training Data” on page 9-160
• “Plots to Understand Predictor Effects and How to Modify a Model” on page 9-163
Model Display
A linear regression model shows several diagnostics when you enter its name or enter disp(mdl) Thisdisplay gives some of the basic information to check whether the fitted model represents the dataadequately
For example, fit a Poisson model to data constructed with two out of five predictors not affecting theresponse, and with no intercept term:
rng('default') % for reproducibility X = randn(100,5);
100 observations, 94 error degrees of freedom
Dispersion: 1
Chi^2-statistic vs constant model: 44.9, p-value = 1.55e-08
Trang 29Notice that:
• The display contains the estimated values of each coefficient in the Estimate column These values are
reasonably near the true values [0;.4;0;0;.2;.3], except possibly the coefficient of x3 is not terribly near0
• There is a standard error column for the coefficient estimates.
• The reported pValue (which are derived from the t statistics under the assumption of normal errors) for
predictors 1, 4, and 5 are small These are the three predictors that were used to create the response datay
• The pValue for (Intercept), x2 and x3 are larger than 0.01 These three predictors were not used to
create the response data y.The pValue for x3 is just over 05, so might be regarded as possibly
It is reasonable to assume that the values of poor follow binomial distributions, with the number of trialsgiven by total and the percentage of successes depending on w This distribution can be accounted for in
the context of a logistic model by using a generalized linear model with link function log(μ/(1 – μ)) =
Xb This link function is called 'logit'.
Trang 3012 observations, 10 error degrees of freedom
Dispersion: 1
Chi^2-statistic vs constant model: 242, p-value = 1.3e-54
See how well the model fits the data
Trang 31This is typical of a regression with points ordered by the predictor variable The leverage of each point
on the fit is higher for points with relatively extreme predictor values (in either direction) and low forpoints with average predictor values In examples with multiple predictors and with points not ordered
by predictor value, this plot can help you identify which observations have high leverage because theyare outliers as measured by their predictor values
Residuals — Model Quality for Training Data
There are several residual plots to help you discover errors, outliers, or correlations in the model or data.The simplest residual plots are the default histogram plot, which shows the range of the residuals andtheir frequencies, and the probability plot, which shows how the distribution of the residuals compares
to a normal distribution with matched variance
This example shows residual plots for a fitted Poisson model The data construction has two out of fivepredictors not affecting the response, and no intercept term:
rng('default') % for reproducibility X = randn(100,5);
Trang 32While most residualscluster near 0, there are several near ±18 So examine a different residuals plot.
Now it is clear The residuals do not follow a normal distribution Instead, they have fatter tails, much as
an underlying Poisson distribution
Plots to Understand Predictor Effects and How to Modify a Model
This example shows how to understand the effect each predictor has on a regression model, and how tomodify the model to remove unnecessary terms
Trang 331Create a model from some predictors in artificial data The data do not use the second and thirdcolumns in X So you expect the model not to show much dependence on those predictors.rng('default') % for reproducibility X = randn(100,5);
mu = exp(X(:,[1 4 5])*[2;1;.5]); y = poissrnd(mu);
mdl = GeneralizedLinearModel.fit(X,y, 'linear','Distribution','poisson');
2Examine a slice plot of the responses This displays the effect of each predictor separately.plotSlice(mdl)
Trang 34The scale of the first predictor is overwhelming the plot Disable it using the Predictors menu.
Now it is clear that predictors 2 and 3 have little to no effect
You can drag the individual predictor values, which are represented by dashed blue vertical lines Youcan also choose between simultaneous and non-simultaneous confidence bounds, which are represented
by dashed red curves Dragging the predictor lines confirms that predictors 2 and 3 have little to noeffect
3Remove the unnecessary predictors using either removeTerms or step Using step can be safer, in casethere is an unexpected importance to a term that becomes apparent after removing another term
However, sometimes removeTerms can be effective when step does not proceed In this case, the twogive identical results
Trang 35Free ebooks ==> www.Ebook777.com
100 observations, 96 error degrees of freedom
Generalized Linear regression model: log(y) ~ 1 + x1 + x4 + x5 Distribution = Poisson
Estimated Coefficients: Estimate SE tStat pValue (Intercept) 0.17604 0.062215 2.8295 0.004662 x1 1.9122 0.024638 77.614 0 x4 0.98521 0.026393 37.328 5.6696e-305 x5 0.61321 0.038435 15.955 2.6473e-57 100 observations, 96 error degrees of freedom Dispersion: 1 Chi^2-statistic
vs constant model: 4.97e+04, p-value = 0
Predict or Simulate Responses to New Data
There are three ways to use a linear model to predict the response to new data:
1Create a model from some predictors in artificial data The data do not use the second and third
columns in X So you expect the model not to show much dependence on these predictors Construct themodel stepwise to include the relevant predictors automatically
rng('default') % for reproducibility
1 Adding x1, Deviance = 2515.02869, Chi2Stat = 47242.9622, PValue = 0
2 Adding x4, Deviance = 328.39679, Chi2Stat = 2186.6319, PValue = 0
3 Adding x5, Deviance = 96.3326, Chi2Stat = 232.0642, PValue = 2.114384e-52
2Generate some new data, and evaluate the predictions from the data
Xnew = randn(3,5) + repmat([1 2 3 4 5],[3,1]); % new data [ynew,ynewci] = predict(mdl,Xnew)
ynew =
1.0e+04 *
www.Ebook777.com
Trang 36This example shows how to predict mean responses using the feval method.
1Create a model from some predictors in artificial data The data do not use the second and third
columns in X So you expect the model not to show much dependence on these predictors Construct themodel stepwise to include the relevant predictors automatically
rng('default') % for reproducibility
1 Adding x1, Deviance = 2515.02869, Chi2Stat = 47242.9622, PValue = 0
2 Adding x4, Deviance = 328.39679, Chi2Stat = 2186.6319, PValue = 0
3 Adding x5, Deviance = 96.3326, Chi2Stat = 232.0642, PValue = 2.114384e-52
2Generate some new data, and evaluate the predictions from the data
Xnew = randn(3,5) + repmat([1 2 3 4 5],[3,1]); % new data
ynew = feval(mdl,Xnew(:,1),Xnew(:,4),Xnew(:,5)) % only need predictors 1,
Trang 371.7375
3.7471
random
The random method generates new random response values for specified predictor values The
distribution of the response values is the distribution used in the model random calculates the mean ofthe distribution from the predictors, estimated coefficients, and link function For distributions such asnormal, the model also provides an estimate of the variance of the response For the binomial and
Poisson distributions, the variance of the response is determined by the mean; random does not use aseparate “dispersion” estimate
This example shows how to simulate responses using the random method
1Create a model from some predictors in artificial data The data do not use the second and third
columns in X So you expect the model not to show much dependence on these predictors Construct themodel stepwise to include the relevant predictors automatically
rng('default') % for reproducibility
1 Adding x1, Deviance = 2515.02869, Chi2Stat = 47242.9622, PValue = 0
2 Adding x4, Deviance = 328.39679, Chi2Stat = 2186.6319, PValue = 0
3 Adding x5, Deviance = 96.3326, Chi2Stat = 232.0642, PValue = 2.114384e-52
2Generate some new data, and evaluate the predictions from the data
Xnew = randn(3,5) + repmat([1 2 3 4 5],[3,1]); % new data ysim = random(mdl,Xnew)
ysim =
1111
17121
37457
The predictions from random are Poisson samples, so are integers
3Evaluate the random method again, the result changes
Trang 38Share Fitted Models
The model display contains enough information to enable someone else to recreate the model in atheoretical sense For example,
rng('default') % for reproducibility X = randn(100,5);
mu = exp(X(:,[1 4 5])*[2;1;.5]);
y = poissrnd(mu);
mdl = GeneralizedLinearModel.stepwise(X,y,
'constant','upper','linear','Distribution','poisson')
1 Adding x1, Deviance = 2515.02869, Chi2Stat = 47242.9622, PValue = 0
2 Adding x4, Deviance = 328.39679, Chi2Stat = 2186.6319, PValue = 0
3 Adding x5, Deviance = 96.3326, Chi2Stat = 232.0642, PValue = 2.114384e-52
mdl =
Generalized Linear regression model: log(y) ~ 1 + x1 + x4 + x5 Distribution = Poisson
Estimated Coefficients: Estimate SE tStat pValue (Intercept) 0.17604 0.062215 2.8295 0.004662 x1 1.9122 0.024638 77.614 0 x4 0.98521 0.026393 37.328 5.6696e-305 x5 0.61321 0.038435 15.955 2.6473e-57
100 observations, 96 error degrees of freedom Dispersion: 1 Chi^2-statistic vs constant model: 4.97e+04, p-value = 0
You can access the model description programmatically, too For example,
Generalized Linear Model Workflow
This example shows a typical workflow: import data, fit a generalized linear model, test its quality,modify it to improve the quality, and make predictions based on the model It computes the probabilitythat a flower is in one of two classes, based on the Fisher iris data
Step 1 Load the data.
Load the Fisher iris data Extract the rows that have classification versicolor or virginica These are rows
51 to 150 Create logical response variables that are trueforversicolorflowers
load fisheriris
X = meas(51:end,:); % versicolor and virginica y = strcmp('versicolor',species(51:end));
Trang 39Step 2 Fit a generalized linear model.
Fit a binomial generalized linear model to the data
Dispersion: 1
Chi^2-statistic vs constant model: 127, p-value = 1.95e-26
Step 3 Examine the result, consider alternative models.
Some p-values in the pValue column are not very small Perhaps the model can be simplified.
See if some 95% confidence intervals for the coefficients include 0 If so, perhaps these model termscould be removed
Only two of the predictors have coefficients whose confidence intervals do not include 0
The coefficients of 'x1' and 'x2' have the largest p-values Test whether both coefficients could be zero.
M = [0 1 0 0 0 % picks out coefficient for column 1
0 0 1 0 0]; % picks out coefficient for column 2
Trang 40Free ebooks ==> www.Ebook777.com
Perhaps it would have been better to have
GeneralizedLinearModel.stepwise identify the model initially
mdl2 = GeneralizedLinearModel.stepwise(X,y,
'constant','Distribution','binomial','upper','linear')
1 Adding x4, Deviance = 33.4208, Chi2Stat = 105.2086, PValue = 1.099298e-24
2 Adding x3, Deviance = 20.5635, Chi2Stat = 12.8573, PValue = 0.000336166
3 Adding x2, Deviance = 13.2658, Chi2Stat = 7.29767, PValue = 0.00690441
mdl2 =
Generalized Linear regression model: logit(y) ~ 1 + x2 + x3 + x4 Distribution = Binomial
Estimated Coefficients: Estimate SE tStat pValue (Intercept) 50.527 23.995 2.1057 0.035227 x2 8.3761 4.7612 1.7592 0.078536 x3 -7.8745 3.8407 -2.0503 0.040334 x4 -21.43 10.707 -2.0014 0.04535
100 observations, 96 error degrees of freedom Dispersion: 1 Chi^2-statistic vs constant model: 125, p-value = 5.4e-27
GeneralizedLinearModel.stepwise included 'x2' in the model, because it neither adds nor removes terms
with p-values between 0.05 and 0.10.
Step 4 Look for outliers and exclude them.
Examine a leverage plot to look for influential outliers