Spss® Data Analysis For Univariate, Bivariate And Multivariate Statistics (2019).Pdf

Daniel J Denis SPSS Data Analysis for Univariate, Bivariate, and Multivariate Statistics This edition first published 2019 © 2019 John Wiley & Sons, Inc Printed in the United States of America Set in[.]

Trang 1

Daniel J Denis

SPSS Data Analysis for Univariate, Bivariate, and Multivariate Statistics

Trang 2

Printed in the United States of America

Set in 10/12pt Warnock by SPi Global, Pondicherry, India

Names: Denis, Daniel J., 1974– author.

Title: SPSS data analysis for univariate, bivariate, and multivariate statistics / Daniel J Denis.

Description: Hoboken, NJ : Wiley, 2019 | Includes bibliographical references and index |

Identifiers: LCCN 2018025509 (print) | LCCN 2018029180 (ebook) | ISBN 9781119465805 (Adobe PDF) | ISBN 9781119465782 (ePub) | ISBN 9781119465812 (hardcover)

Subjects: LCSH: Analysis of variance–Data processing | Multivariate analysis–Data processing | Mathematical statistics–Data processing | SPSS (Computer file)

Classification: LCC QA279 (ebook) | LCC QA279 D45775 2019 (print) | DDC 519.5/3–dc23

LC record available at https://lccn.loc.gov/2018025509

Library of Congress Cataloging‐in‐Publication Data

Trang 3

Preface ix

1 Review of Essential Statistical Principles 1

1.1 Variables and Types of Data 2

1.2 Significance Tests and Hypothesis Testing 3

1.3 Significance Levels and Type I and Type II Errors 4

1.4 Sample Size and Power 5

1.5 Model Assumptions 6

2.1 How to Communicate with SPSS 9

2.2 Data View vs Variable View 10

2.3 Missing Data in SPSS: Think Twice Before Replacing Data! 12

3 Exploratory Data Analysis, Basic Statistics, and Visual Displays 19

3.1 Frequencies and Descriptives 19

3.2 The Explore Function 23

3.3 What Should I Do with Outliers? Delete or Keep Them? 28 3.4 Data Transformations 29

4.1 Computing a New Variable 33

4.2 Selecting Cases 34

4.3 Recoding Variables into Same or Different Variables 36

4.4 Sort Cases 37

4.5 Transposing Data 38

5 Inferential Tests on Correlations, Counts, and Means 41

5.1 Computing z‐Scores in SPSS 41

Trang 4

5.7 Two‐sample t‐Test for Means 59

6.1 Example Using G*Power: Estimating Required Sample Size for

Detecting Population Correlation 64

6.2 Power for Chi‐square Goodness of Fit 66

6.3 Power for Independent‐samples t‐Test 66

6.4 Power for Paired‐samples t‐Test 67

7 Analysis of Variance: Fixed and Random Effects 69

7.1 Performing the ANOVA in SPSS 70

7.2 The F‐Test for ANOVA 73

7.3 Effect Size 74

7.4 Contrasts and Post Hoc Tests on Teacher 75

7.5 Alternative Post Hoc Tests and Comparisons 78

7.6 Random Effects ANOVA 80

7.7 Fixed Effects Factorial ANOVA and Interactions 82

7.8 What Would the Absence of an Interaction Look Like? 86

7.9 Simple Main Effects 86

7.10 Analysis of Covariance (ANCOVA) 88

7.11 Power for Analysis of Variance 90

8.1 One‐way Repeated Measures 91

8.2 Two‐way Repeated Measures: One Between and One Within Factor 99

9.1 Example of Simple Linear Regression 103

9.2 Interpreting a Simple Linear Regression: Overview of Output 105 9.3 Multiple Regression Analysis 107

9.4 Scatterplot Matrix 111

9.5 Running the Multiple Regression 112

9.6 Approaches to Model Building in Regression 118

9.7 Forward, Backward, and Stepwise Regression 120

9.8 Interactions in Multiple Regression 121

9.9 Residuals and Residual Plots: Evaluating Assumptions 123

9.10 Homoscedasticity Assumption and Patterns of Residuals 125

9.11 Detecting Multivariate Outliers and Influential Observations 126 9.12 Mediation Analysis 127

9.13 Power for Regression 129

10.1 Example of Logistic Regression 132

10.2 Multiple Logistic Regression 138

10.3 Power for Logistic Regression 139

Trang 5

11.1 Example of MANOVA 142

11.2 Effect Sizes 146

11.3 Box’s M Test 147

11.4 Discriminant Function Analysis 148

11.5 Equality of Covariance Matrices Assumption 152

11.6 MANOVA and Discriminant Analysis on Three Populations 153

11.7 Classification Statistics 159

11.8 Visualizing Results 161

11.9 Power Analysis for MANOVA 162

12.1 Example of PCA 163

12.2 Pearson’s 1901 Data 164

12.3 Component Scores 166

12.4 Visualizing Principal Components 167

12.5 PCA of Correlation Matrix 170

13.1 The Common Factor Analysis Model 175

13.2 The Problem with Exploratory Factor Analysis 176

13.3 Factor Analysis of the PCA Data 176

13.4 What Do We Conclude from the Factor Analysis? 179

13.5 Scree Plot 180

13.6 Rotating the Factor Solution 181

13.7 Is There Sufficient Correlation to Do the Factor Analysis? 182

13.8 Reproducing the Correlation Matrix 183

13.9 Cluster Analysis 184

13.10 How to Validate Clusters? 187

13.11 Hierarchical Cluster Analysis 188

14.1 Independent‐samples: Mann–Whitney U 192

14.2 Multiple Independent‐samples: Kruskal–Wallis Test 193

14.3 Repeated Measures Data: The Wilcoxon Signed‐rank

Test and Friedman Test 194

14.4 The Sign Test 196

Closing Remarks and Next Steps 199

References 201

Index 203

Trang 6

The goals of this book are to present a very concise, easy‐to‐use introductory primer of a host of computational tools useful for making sense out of data, whether that data come from the social,

behavioral, or natural sciences, and to get you started doing data analysis fast The emphasis on the book is data analysis and drawing conclusions from empirical observations The emphasis of the

book is not on theory Formulas are given where needed in many places, but the focus of the book is

on concepts rather than on mathematical abstraction We emphasize computational tools used in

the discovery of empirical patterns and feature a variety of popular statistical analyses and data management tasks that you can immediately apply as needed to your own research The book features analyses and demonstrations using SPSS Most of the data sets analyzed are very small and convenient,

so entering them into SPSS should be easy If desired, however, one can also download them from www.datapsyc.com Many of the data sets were also first used in a more theoretical text written by the same author (see Denis, 2016), which should be consulted for a more in‐depth treatment of the topics presented in this book Additional references for readings are also given throughout the book

Target Audience and Level

This is a “how‐to” book and will be of use to undergraduate and graduate students along with researchers and professionals who require a quick go‐to source, to help them perform essential statistical analyses and data management tasks The book only assumes minimal prior knowledge of

statistics, providing you with the tools you need right now to help you understand and interpret your

data analyses A prior introductory course in statistics at the undergraduate level would be helpful,

but is not required for this book Instructors may choose to use the book either as a primary text for

an undergraduate or graduate course or as a supplement to a more technical text, referring to this

book primarily for the “how to’s” of data analysis in SPSS The book can also be used for self‐study It

is suitable for use as a general reference in all social and natural science fields and may also be of interest to those in business who use SPSS for decision‐making References to further reading are provided where appropriate should the reader wish to follow up on these topics or expand one’s knowledge base as it pertains to theory and further applications An early chapter reviews essential statistical and research principles usually covered in an introductory statistics course, which should

be sufficient for understanding the rest of the book and interpreting analyses Mini brief sample write‐ups are also provided for select analyses in places to give the reader a starting point to writing

up his/her own results for his/her thesis, dissertation, or publication The book is meant to be an

Trang 7

ing their implementation in SPSS Please contact me at daniel.denis@umontana.edu or psyc.com with any comments or corrections.

Glossary of Icons and Special Features

When you see this symbol, it means a brief sample write‐up has been provided for the accompanying output These brief write‐ups can be used as starting points to writing up your own results for your thesis/dissertation or even publication

When you see this symbol, it means a special note, hint, or reminder has been provided or signifies extra insight into something not thoroughly discussed in the text

When you see this symbol, it means a special WARNING has been issued that if not lowed may result in a serious error

Acknowledgments

Thanks go out to Wiley for publishing this book, especially to Jon Gurstelle for presenting the idea to Wiley and securing the contract for the book and to Mindy Okura‐Marszycki for taking over the project after Jon left Thank you Kathleen Pagliaro for keeping in touch about this project and the former book Thanks goes out to everyone (far too many to mention) who have influenced me in one way or another in my views and philosophy about statistics and science, including undergraduate and graduate students whom I have had the pleasure of teaching (and learning from) in my courses taught

at the University of Montana

This book is dedicated to all military veterans of the United States of America, past, present, and future, who teach us that all problems are relative.

Trang 8

The purpose of statistical modeling is to both describe sample data and make inferences about that

sample data to the population from which the data was drawn We compute statistics on samples (e.g sample mean) and use such statistics as estimators of population parameters (e.g population

mean) When we use the sample statistic to estimate a parameter in the population, we are engaged

in the process of inference, which is why such statistics are referred to as inferential statistics, as

opposed to descriptive statistics where we are typically simply describing something about a sample

or population All of this usually occurs in an experimental design (e.g where we have a control vs treatment group) or nonexperimental design (where we exercise little or no control over variables)

As an example of an experimental design, suppose you wanted to learn whether a pill was effective

in reducing symptoms from a headache You could sample 100 individuals with headaches, give them

a pill, and compare their reduction in symptoms to 100 people suffering from a headache but not receiving the pill If the group receiving the pill showed a decrease in symptomology compared with the nontreated group, it may indicate that your pill is effective However, to estimate whether the effect observed in the sample data is generalizable and inferable to the population from which the data were drawn, a statistical test could be performed to indicate whether it is plausible that such a difference between groups could have occurred simply by chance If it were found that the difference

was unlikely due to chance, then we may indeed conclude a difference in the population from which

the data were drawn The probability of data occurring under some assumption of (typically) equality

is the infamous p‐value, usually set at 0.05 If the probability of such data is relatively low (e.g less

than 0.05) under the null hypothesis of no difference, we reject the null and infer the statistical alter‑native hypothesis of a difference in population means

Much of statistical modeling follows a similar logic to that featured above – sample some data,

apply a model to the data, and then estimate how good the model fits and whether there is inferential evidence to suggest an effect in the population from which the data were drawn The actual model you

will fit to your data usually depends on the type of data you are working with For instance, if you have

collected sample means and wish to test differences between means, then t‐test and ANOVA tech‑

niques are appropriate On the other hand, if you have collected data in which you would like to see

if there is a linear relationship between continuous variables, then correlation and regression are

usually appropriate If you have collected data on numerous dependent variables and believe these

variables, taken together as a set, represent some kind of composite variable, and wish to determine mean differences on this composite dependent variable, then a multivariate analysis of variance (MANOVA) technique may be useful If you wish to predict group membership into two or more

1

Review of Essential Statistical Principles

Big Picture on Statistical Modeling and Inference

Trang 9

categories based on a set of predictors, then discriminant analysis or logistic regression would be

an option If you wished to take many variables and reduce them down to fewer dimensions, then

principal components analysis or factor analysis may be your technique of choice Finally, if you are interested in hypothesizing networks of variables and their interrelationships, then path analysis and structural equation modeling may be your model of choice (not covered in this book) There

are numerous other possibilities as well, but overall, you should heed the following principle in guid‑ing your choice of statistical analysis:

1.1 Variables and Types of Data

Recall that variables are typically of two kinds – dependent or response variables and independent

or predictor variables The terms “dependent” and “independent” are most common in ANOVA‐

type models, while “response” and “predictor” are more common in regression‐type models, though

their usage is not uniform to any particular methodology The classic function statement Y = f(X) tells the story – input a value for X (independent variable), and observe the effect on Y (dependent vari‑ able) In an independent‐samples t‐test, for instance, X is a variable with two levels, while the depend‑ ent variable is a continuous variable In a classic one‐way ANOVA, X has multiple levels In a simple linear regression, X is usually a continuous variable, and we use the variable to make predictions of another continuous variable Y Most of statistical modeling is simply observing an outcome based on

something you are inputting into an estimated (estimated based on the sample data) equation.Data come in many different forms Though there are rather precise theoretical distinctions between different forms of data, for applied purposes, we can summarize the discussion into the fol‑

lowing types for now: (i) continuous and (ii) discrete Variables measured on a continuous scale can,

in theory, achieve any numerical value on the given scale For instance, length is typically considered

to be a continuous variable, since we can measure length to any specified numerical degree That is, the distance between 5 and 10 in on a scale contains an infinite number of measurement possibilities (e.g 6.1852, 8.341 364, etc.) The scale is continuous because it assumes an infinite number of possi‑bilities between any two points on the scale and has no “breaks” in that continuum On the other hand, if a scale is discrete, it means that between any two values on the scale, only a select number of possibilities can exist As an example, the number of coins in my pocket is a discrete variable, since I cannot have 1.5 coins I can have 1 coin, 2 coins, 3 coins, etc., but between those values do not exist

an infinite number of possibilities Sometimes data is also categorical, which means values of the

variable are mutually exclusive categories, such as A or B or C or “boy” or “girl.” Other times, data

come in the form of counts, where instead of measuring something like IQ, we are only counting the

number of occurrences of some behavior (e.g number of times I blink in a minute) Depending on the type of data you have, different statistical methods will apply As we survey what SPSS has to offer, we identify variables as continuous, discrete, or categorical as we discuss the given method However, do not get too caught up with definitions here; there is always a bit of a “fuzziness” in

The type of statistical model or method you select often depends on the types of data you have and your purpose for wanting to build a model There usually is not one and only one method that is possible for a given set of data The method of choice will be dictated often by the ration- ale of your research You must know your variables very well along with the goals of your research

to diligently select a statistical model.

Trang 10

learning about the nature of the variables you have For example, if I count the number of raindrops

in a rainstorm, we would be hard pressed to call this “count data.” We would instead just accept it as continuous data and treat it as such Many times you have to compromise a bit between data types to best answer a research question Surely, the average number of people per household does not make sense, yet census reports often give us such figures on “count” data Always remember however that

the software does not recognize the nature of your variables or how they are measured You have to

be certain of this information going in; know your variables very well, so that you can be sure SPSS is treating them as you had planned.

Scales of measurement are also distinguished between nominal, ordinal, interval, and ratio A

nominal scale is not really measurement in the first place, since it is simply assigning labels to objects

we are studying The classic example is that of numbers on football jerseys That one player has the number 10 and another the number 15 does not mean anything other than labels to distinguish

between two players If differences between numbers do represent magnitudes, but that differences

between the magnitudes are unknown or imprecise, then we have measurement at the ordinal level For example, that a runner finished first and another second constitutes measurement at the ordinal level Nothing is said of the time difference between the first and second runner, only that there is a

“ranking” of the runners If differences between numbers on a scale represent equal lengths, but that

an absolute zero point still cannot be defined, then we have measurement at the interval level A classic example of this is temperature in degrees Fahrenheit – the difference between 10 and 20° represents the same amount of temperature distance as that between 20 and 30; however zero on the scale does not represent an “absence” of temperature When we can ascribe an absolute zero point in addition

to inferring the properties of the interval scale, then we have measurement at the ratio scale The number of coins in my pocket is an example of ratio measurement, since zero on the scale represents

a complete absence of coins The number of car accidents in a year is another variable measurable on

a ratio scale, since it is possible, however unlikely, that there were no accidents in a given year

The first step in choosing a statistical model is knowing what kind of data you have, whether they are continuous, discrete, or categorical and with some attention also devoted to whether the data are nominal, ordinal, interval, or ratio Making these decisions can be a lot trickier than it sounds, and you may need to consult with someone for advice on this before selecting a model Other times, it is very easy to determine what kind of data you have But if you are not sure, check with a statistical consultant to help confirm the nature of your variables, because making an error at this initial stage

of analysis can have serious consequences and jeopardize your data analyses entirely

1.2 Significance Tests and Hypothesis Testing

In classical statistics, a hypothesis test is about the value of a parameter we are wishing to estimate with our sample data Consider our previous example of the two‐group problem regarding trying to establish whether taking a pill is effective in reducing headache symptoms If there were no differ‑ence between the group receiving the treatment and the group not receiving the treatment, then we would expect the parameter difference to equal 0 We state this as our null hypothesis:

Null hypothesis: The mean difference in the population is equal to 0.

The alternative hypothesis is that the mean difference is not equal to 0 Now, if our sample means

come out to be 50.0 for the control group and 50.0 for the treated group, then it is obvious that we do

Trang 11

not have evidence to reject the null, since the difference of 50.0 – 50.0 = 0 aligns directly with

expecta-tion under the null On the other hand, if the means were 48.0 vs 52.0, could we reject the null? Yes,

there is definitely a sample difference between groups, but do we have evidence for a population

difference? It is difficult to say without asking the following question:

What is the probability of observing a difference such as 48.0 vs 52.0

under the null hypothesis of no difference?

When we evaluate a null hypothesis, it is the parameter we are interested in, not the sample statis‑

tic The fact that we observed a difference of 4 (i.e 52.0–48.0) in our sample does not by itself indicate

that in the population, the parameter is unequal to 0 To be able to reject the null hypothesis, we need to conduct a significance test on the mean difference of 48.0 vs 52.0, which involves comput‑

ing (in this particular case) what is known as a standard error of the difference in means to estimate

how likely such differences occur in theoretical repeated sampling When we do this, we are compar‑ing an observed difference to a difference we would expect simply due to random variation Virtually all test statistics follow the same logic That is, we compare what we have observed in our sample(s)

to variation we would expect under a null hypothesis or, crudely, what we would expect under simply

“chance.” Virtually all test statistics have the following form:

Test statistic = observed/expected

If the observed difference is large relative to the expected difference, then we garner evidence that such a difference is not simply due to chance and may represent an actual difference in the popula‑tion from which the data were drawn

As mentioned previously, significance tests are not only performed on mean differences, however Whenever we wish to estimate a parameter, whatever the kind, we can perform a significance test on

it Hence, when we perform t‐tests, ANOVAs, regressions, etc., we are continually computing sample

statistics and conducting tests of significance about parameters of interest Whenever you see such

output as “Sig.” in SPSS with a probability value underneath it, it means a significance test has been

performed on that statistic, which, as mentioned already, contains the p‐value When we reject the null at, say, p < 0.05, however, we do so with a risk of either a type I or type II error We review these

next, along with significance levels

1.3 Significance Levels and Type I and Type II Errors

Whenever we conduct a significance test on a parameter and decide to reject the null hypothesis, we

do not know for certain that the null is false We are rather hedging our bet that it is false For

instance, even if the mean difference in the sample is large, though it probably means there is a dif‑ ference in the corresponding population parameters, we cannot be certain of this and thus risk falsely

rejecting the null hypothesis How much risk are we willing to tolerate for a given significance test? Historically, a probability level of 0.05 is used in most settings, though the setting of this level should

depend individually on the given research context The infamous “p < 0.05” means that the

probabil-ity of the observed data under the null hypothesis is less than 5%, which implies that if such data are

so unlikely under the null, that perhaps the null hypothesis is actually false, and that the data are more probable under a competing hypothesis, such as the statistical alternative hypothesis The

point to make here is that whenever we reject a null and conclude something about the population

Trang 12

parameters, we could be making a false rejection of the null hypothesis Rejecting a null hypothesis

when in fact the null is not false is known as a type I error, and we usually try to limit the probability

of making a type I error to 5% or less in most research contexts On the other hand, we risk another

type of error, known as a type II error These occur when we fail to reject a null hypothesis that in

actuality is false More practically, this means that there may actually be a difference or effect in the population but that we failed to detect it In this book, by default, we usually set the significance level

at 0.05 for most tests If the p‐value for a given significance test dips below 0.05, then we will typically call the result “statistically significant.” It needs to be emphasized however that a statistically signifi‑

cant result does not necessarily imply a strong practical effect in the population.

For reasons discussed elsewhere (see Denis (2016) Chapter 3 for a thorough discussion), one can

potentially obtain a statistically significant finding (i.e p < 0.05) even if, to use our example about the

headache treatment, the difference in means is rather small Hence, throughout the book, when we

note that a statistically significant finding has occurred, we often couple this with a measure of effect size, which is an indicator of just how much mean difference (or other effect) is actually present The

exact measure of effect size is different depending on the statistical method, so we explain how to interpret the given effect size in each setting as we come across it

1.4 Sample Size and Power

Power is reviewed in Chapter 6, but an introductory note about it and how it relates to sample size

is in order Crudely, statistical power of a test is the probability of detecting an effect if there is an effect to be detected A microscope analogy works well here – there may be a virus strain present under the microscope, but if the microscope is not powerful enough to detect it, you will not see it

It still exists, but you just do not have the eyes for it In research, an effect could exist in the popula‑

tion, but if you do not have a powerful test to detect it, you will not spot it Statistically, power is the probability of rejecting a null hypothesis given that it is false What makes a test powerful? The

determinants of power are discussed in Chapter 6, but for now, consider only the relation between effect size and sample size as it relates to power All else equal, if the effect is small that you are trying

to detect, you will need a larger sample size to detect it to obtain sufficient power On the other hand,

if the effect is large that you are trying to detect, you can get away with a small sample size in detect‑

ing it and achieve the same degree of power So long as there is at least some effect in the population,

then by increasing sample size indefinitely, you assure yourself of gaining as much power as you like

That is, increasing sample size all but guarantees a rejection of a null hypothesis! So, how big do

you want your samples? As a rule, larger samples are better than smaller ones, but at some point, collecting more subjects increases power only minimally, and the expense associated with increasing sample size is no longer worth it Some techniques are inherently large sample techniques and require relatively large sample sizes How large? For factor analysis, for instance, samples upward of 300–500 are often recommended, but the exact guidelines depend on things like sizes of communalities and

other factors (see Denis (2016) for details) Other techniques require lesser‐sized samples (e.g t‐tests

and nonparametric tests) If in doubt, however, collecting larger samples than not is preferred, and you need never have to worry about having “too much” power Remember, you are only collecting smaller samples because you cannot get a collection of the entire population, so theoretically and pragmatically speaking, larger samples are typically better than smaller ones across the board of statistical methodologies

Trang 13

1.5 Model Assumptions

The majority of statistical tests in this book are based on a set of assumptions about the data that if

violated, comprise the validity of the inferences made What this means is that if certain assumptions about the data are not met, or questionable, it compromises the validity with which interpreting

p‑values and other inferential statistics can be made Some authors also include such things as adequate

sample size as an assumption of many multivariate techniques, but we do not include such things when discussing any assumptions, for the reason that large sample sizes for procedures such as factor

analysis we see more as a requirement of good data analysis than something assumed by the theoreti‑

cal model

We must at this point distinguish between the platonic theoretical ideal and pragmatic reality In

theory, many statistical tests assume data were drawn from normal populations, whether univari‑

ate, bivariate, or multivariate, depending on the given method Further, multivariate methods usually

assume linear combinations of variables also arise from normal populations But are data ever

drawn from truly normal populations? No! Never! We know this right off the start because perfect normality is a theoretical ideal In other words, the normal distribution does not “exist” in the real world in a perfect sense; it exists only in formulae and theoretical perfection So, you may ask, if nor‑mality in real data is likely to never truly exist, why are so many inferential tests based on the assump‑tion of normality? The answer to this usually comes down to convenience and desirable properties when innovators devise inferential tests That is, it is much easier to say, “Given the data are multi‑variate normal, then this and that should be true.” Hence, assuming normality makes theoretical

statistics a bit easier and results are more tractable However, when we are working with real data in

the real world, samples or populations while perhaps approximating this ideal, will never truly Hence, if we face reality up front and concede that we will never truly satisfy assumptions of a statisti‑cal test, the quest then becomes that of not violating the assumptions to any significant degree such that the test is no longer interpretable That is, we need ways to make sure our data behave “reason‑ably well” as to still apply the statistical test and draw inferential conclusions

There is a second concern, however Not only are assumptions likely to be violated in practice, but

it is also true that some assumptions are borderline unverifiable with real data because the data occur

in higher dimensions, and verifying higher‐dimensional structures is extremely difficult and is an evolving field Again, we return to normality Verifying multivariate normality is very difficult, and hence many times researchers will verify lower dimensions in the hope that if these are satisfied, they can hopefully induce that higher‐dimensional assumptions are thus satisfied If univariate and bivari‑ate normality is satisfied, then we can be more certain that multivariate normality is likely satisfied However, there is no guarantee Hence, pragmatically, much of assumption checking in statistical modeling involves looking at lower dimensions as to make sure such data are reasonably behaved As

concerns sampling distributions, often if sample size is sufficient, the central limit theorem will

assure us of sampling distribution normality, which crudely says that normality will be achieved as sample size increases For a discussion of sampling distributions, see Denis (2016)

A second assumption that is important in data analysis is that of homogeneity or

homoscedastic-ity of variances This means different things depending on the model In t‐tests and ANOVA, for

instance, the assumption implies that population variances of the dependent variable in each level of the independent variable are the same The way this assumption is verified is by looking at sample data and checking to make sure sample variances are not too different from one another as to raise a

concern In t‐tests and ANOVA, Levene’s test is sometimes used for this purpose, or one can also

Trang 14

use a rough rule of thumb that says if one sample variance is no more than four times another, then the assumption can be at least tentatively justified In regression models, the assumption of

homoscedasticity is usually in reference to the distribution of Y given the conditional value of the predictor(s) Hence, for each value of X, we like to assume approximate equal dispersion of values

of Y This assumption can be verified in regression through scatterplots (in the bivariate case) and

residual plots in the multivariable case

A third assumption, perhaps the most important, is that of independence The essence of this

assumption is that observations at the outset of the experiment are not probabilistically related For example, when recruiting a sample for a given study, if observations appearing in one group “know

each other” in some sense (e.g friendships), then knowing something about one observation may tell

us something about another in a probabilistic sense This violates independence In regression analy‑sis, independence is violated when errors are related with one another, which occurs quite frequently

in designs featuring time as an explanatory variable Independence can be very difficult to verify in

practice, though residual plots are again helpful in this regard Oftentimes, however, it is the very structure of the study and the way data was collected that will help ensure this assumption is met When you recruited your sample data, did you violate independence in your recruitment procedures?

The following is a final thought for now regarding assumptions, along with some recommenda‑

tions While verifying assumptions is important and a worthwhile activity, one can easily get caught

up in spending too much time and effort seeking an ideal that will never be attainable In consulting

on statistics for many years now, more than once I have seen some students and researchers obsess and ruminate over a distribution that was not perfectly normal and try data transformation after data transformation to try to “fix things.” I generally advise against such an approach, unless of course there are serious violations in which case remedies are therefore needed But keep in mind as well

that a violation of an assumption may not simply indicate a statistical issue; it may hint at a tive one A highly skewed distribution, for instance, one that goes contrary to what you expected to

substan-obtain, may signal a data collection issue, such as a bias in your data collection mechanism Too often

researchers will try to fix the distribution without asking why it came out as “odd ball” as it did As a scientist, your job is not to appease statistical tests Your job is to learn of natural phenomena and use statistics as a tool in that venture Hence, if you suspect an assumption is violated and are

not quite sure what to do about it, or if it requires any remedy at all, my advice is to check with a statistical consultant about it to get some direction on it before you transform all your data and make

a mess of things! The bottom line too is that if you are interpreting p‐values so obsessively as to be

that concerned that a violation of an assumption might increase or decrease the p‐value by miniscule

amounts, you are probably overly focused on p‐values and need to start looking at the science (e.g

effect size) of what you are doing Yes, a violation of an assumption may alter your true type I error

rate, but if you are that focused on the exact level of your p‐value from a scientific perspective, that

is the problem, not the potential violation of the assumption Having said all the above, I summarize

with four pieces of advice regarding how to proceed, in general, with regard to assumptions:

1) If you suspect a light or minor violation of one of your assumptions, determine a potential source

of the violation and if your data are in error Correct errors if necessary If no errors in data collec‑tion were made, and if the assumption violation is generally light (after checking through plots and residuals), you are probably safe to proceed and interpret results of inferential tests without any adjustments to your data

Trang 15

2) If you suspect a heavy or major violation of one of your assumptions, and it is “repairable,” (to the contrary, if independence is violated during the process of data collection, it is very difficult or

impossible to repair), you may consider one of the many data transformations available,

assum-ing the violation was not due to the true nature of your distributions For example, learnassum-ing that

most of your subjects responded “zero” to the question of how many car accidents occurred to them last month is not a data issue – do not try to transform such data to ease the positive skew! Rather, the correct course of action is to choose a different statistical model and potentially reop‑erationalize your variable from a continuous one to a binary or polytomous one

3) If your violation, either minor or major, is not due to a substantive issue, and you are not sure whether to transform or not transform data, you may choose to analyze your data with and then

without transformation, and compare results Did the transformation influence the decision on

null hypotheses? If so, then you may assume that performing the transformation was worthwhile and keep it as part of your data analyses This does not imply that you should “fish” for statistical significance through transformations All it means is that if you are unsure of the effect of a viola‑tion on your findings, there is nothing wrong with trying things out with the original data and then transformed data to see how much influence the violation carries in your particular case.4) A final option is to use a nonparametric test in place of a parametric one, and as in (3), compare results in both cases If normality is violated, for instance, there is nothing wrong with trying out

a nonparametric test to supplement your parametric one to see if the decision on the null changes Again, I am not recommending “fishing” for the test that will give you what you want to see (e.g

p < 0.05) What I am suggesting is that comparing results from parametric and nonparametric

tests can sometimes helps give you an inexact, but still useful, measure of the severity (in a very

crude way) of the assumption violation Chapter 14 reviews select nonparametric tests

Throughout the book, we do not verify each assumption for each analysis we conduct, as to save

on space and also because it detracts a bit from communicating how the given tests work Further, many of our analyses are on very small samples for convenience, and so verifying parametric assump‑tions is unrealistic from the outset However, for each test you conduct, you should be generally aware that it comes with a package of assumptions, and explore those assumptions as part of your data analyses, and if in doubt about one or more assumptions, consult with someone with more expertise on the severity of any said violation and what kind of remedy may (or may not be) needed

In general, get to know your data before conducting inferential analyses, and keep a close eye out

for moderate‐to‐severe assumption violations

Many of the topics discussed in this brief introductory chapter are reviewed in textbooks such as Howell (2002) and Kirk (2008)

Trang 16

In this second chapter, we provide a brief introduction to SPSS version 22.0 software IBM SPSS

provides a host of online manuals that contain the complete capabilities of the software, and beyond brief introductions such as this one should be consulted for specifics about its programming options These can be downloaded directly from IBM SPSS’s website Whether you are using version 22.0 or an earlier or later version, most of the features discussed in this book will be consistent from version to version, so there is no cause for alarm if the version you are using is not the one featured in this book This is a book on using SPSS in general, not a specific version Most software upgrades of SPSS ver-sions are not that different from previous versions, though you are encouraged to keep up to date with SPSS bulletins regarding upgrades or corrections (i.e bugs) to the software We survey only select possibilities that SPSS has to offer in this chapter and the next, enough to get you started performing data analysis quickly on a host of models featured in this book For further details on data manage-ment in SPSS not covered in this chapter or the next, you are encouraged to consult Kulas (2008)

2.1 How to Communicate with SPSS

There are basically two ways a user can communicate with SPSS – through syntax commands entered directly in the SPSS syntax window and through point‐and‐click commands via the graphi- cal user interface (GUI) Conducting analyses via the GUI is sufficient for most essential tasks fea-

tured in this book However, as you become more proficient with SPSS and may require advanced computing commands for your specific analyses, manually entering syntax code may become neces-sary or even preferable once you become more experienced at programming In this introduction, we feature analyses performed through both syntax commands and GUI In reality, the GUI is simply a reflection of the syntax operations that are taking place “behind the scenes” that SPSS has automated

through easy‐to‐access applications, similar to how selecting an app on your cell phone is a type of

fast shortcut to get you to where you want to go The user should understand from the outset ever that there are things one can do using syntax that cannot automatically be performed through the GUI (just like on your phone, there is not an app for everything!), so it behooves one to learn at least elementary programming skills at some point if one is going to work extensively in the field of data analysis In this book, we show as much as possible the window commands to obtaining output and, in many places, feature the representative syntax should you ever need to adjust it to customize

how-your analysis for the given problem you are confronting One word of advice – do not be

2

Introduction to SPSS

Trang 17

intimidated when you see syntax, since as mentioned, for the majority of analyses presented in this

book, you will not need to use it specifically However, by seeing the corresponding syntax to the window commands you are running, it will help “demystify” what SPSS is actually doing, and then through trial and error (and SPSS’s documentation and manuals), the day may come where you are adjusting syntax on your own for the purpose of customizing your analyses, such as one regularly does in software packages such as R or SAS, where typing in commands and running code is the habitual way of proceeding

2.2 Data View vs Variable View

When you open SPSS, you will find two choices for SPSS’s primary window – Data View vs Variable View (both contrasted in Figure 2.1) The Data View is where you will manually enter data into SPSS,

whereas the Variable View is where you will do such things as enter the names of variables, adjust the numerical width of variables, and provide labels for variables

The case numbers in SPSS are listed along the left‐hand column For instance, in Figure 2.1, in the Data View (left), approximately 28 cases are shown In the Variable View, 30 cases are shown Entering data into SPSS is very easy As an example, consider the following small hypothetical data set (left) on verbal, quantitative, and analytical scores for a group of students

on a standardized “IQ test” (scores range from 0 to 100, where 0 indicates virtually no ability and 100 indicates very much ability) The “group” variable denotes whether students have studied “none” (0), “some” (1), or “much” (2).Entering data into SPSS is no more complicated than what we have done above, and barring a few adjustments, we could easily go ahead and start conducting analyses on our data immediately Before we do so, let us have

a quick look at a few of the features in the Variable View for these data and how to adjust them

Figure 2.1 SPSS Data View (left) vs Variable View (right).

Trang 18

Let us take a look at a few of the above column

headers in the Variable View:

Name – this is the name of the variable we have

entered

Type – if you click on Type (in the cell), SPSS will

open the following window:

Verify for yourself that you are able to read the data correctly The first person (case 1) in the data set scored “56.00” on verbal, “56.00” on quant, and “59.00” on analytic and is in group “0,” the group that studied “none.” The second person (case 2) in the data set scored “59.00” on verbal, “42.00” on quant, and “54.00” on analytic and is also in group “0.” The 11th individual in the data set scored “66.00” on verbal, “55.00” on quant, and “69.00” on analytic and is in group “1,” the group that studied “some” for the evaluation

Notice that under Variable Type are many options We can specify the variable as numeric (default choice) or comma or dot, along with specifying the width of the variable and the number of decimal

places we wish to carry for it (right‐hand side of window) We do not explore these options in this book

for the reason that for most analyses that you conduct using quantitative variables, the numeric

varia-ble type will be appropriate, and specifying the width and number of decimal places is often a matter

of taste or preference rather than one of necessity Sometimes instead of numbers, data come in the form of words, which makes the “string” option appropriate For instance, suppose that instead of “0 vs

1 vs 2” we had actually entered “none,” “some,” or “much.” We would have selected “string” to represent our variable (which I am calling “group_name” to differentiate it from “group” [see below])

Trang 19

Having entered our data, we could begin conducting analyses immediately However, sometimes

researchers wish to attach value labels to their data if they are using numbers to code categories This can easily be accomplished by selecting the Values tab For example, we will do this for our

group variable:

There are a few other options available in Variable View such as Missing, Columns, and Measure,

but we leave them for now as they are not vital to getting started If you wish, you can access the

Measure tab and record whether your variable is nominal, ordinal, or interval/ratio (known as scale

in SPSS), but so long as you know how you are treating your variables, you need not record this in

SPSS For instance, if you have nominal data with categories 0 and 1, you do not need to tell SPSS the variable is nominal; you can simply select statistical routines that require this variable to be nominal and interpret it as such in your analyses

2.3 Missing Data in SPSS: Think Twice Before Replacing Data!

Ideally, when you collect data for an experiment or study, you are able to collect measurements from every participant, and your data file will be complete However, often, missing data occurs For example, suppose our IQ data set, instead of appearing nice and complete, had a few missing observations:

Whether we use words to categorize this variable or numbers makes little difference so

long as we are aware ourselves regarding what the variable is and how we are using the able For instance, that we coded group from 0 to 2 is fine, so long as we know these

vari-numbers represent categories rather than true measured quantities Had we incorrectly analyzed

the data such that 0 to 2 is assumed to exist on a continuous scale rather than represent categories,

we risk ensuing analyses (e.g such as analysis of variance) being performed incorrectly

Trang 20

Any attempt to replace a missing data point, less of the approach used, is nonetheless an educated

regard-“guess” at what that data point may have been had the participant answered or it had not gone missing

Presumably, the purpose of your scientific investigation

was to do science, which means making measurements on

objects in nature In conducting such a scientific tion, the data is your only true link to what you are study-ing Replacing a missing value means you are prepared to

investiga-“guesstimate” what the observation is, which means it

is no longer a direct reflection of your measurement process In some cases, such as in repeated measures or longitudinal designs, avoiding missing data is difficult because participants may drop out of longitudinal studies

or simply stop showing up However, that does not necessarily mean you should automatically replace

their values Get curious about your missing data For our IQ data, though we may be able to attribute

the missing observations for cases 8 and 13 as possibly “missing at random,” it may be harder to draw this conclusion regarding case 18, since for that case, two points are missing Why are they missing? Did the participant misunderstand the task? Was the participant or object given the opportunity to respond? These are the types of questions you should ask before contemplating and carrying out a missing data routine in SPSS Hence, before we survey methods for replacing missing data then, you should heed the following principle:

Let us survey a couple approaches to replacing missing data We will demonstrate these proce-dures for our quant variable To access the feature:TRANSFORM → REPLACE MISSING VALUES

We can see that for cases 8, 13, and 18, we have missing data SPSS offers many capabilities for replacing missing data, but if they are to be used at all, they should be used with extreme caution

Never, ever, replace missing data as

an ordinary and usual process of data

analysis Ask yourself first WHY the data

point might be missing and whether it is missing

“at random” or was due to some systematic error or

omission in your experiment If it was due to some

systematic pattern or the participant

misunder-stood the instructions or was not given full

oppor-tunity to respond, that is a quite different scenario

than if the observation is missing at random due to

chance factors If missing at random, replacing

missing data is, generally speaking, more

appro-priate than if there is a systematic pattern to the

missing data Get curious about your missing data

instead of simply seeking to replace it.

Trang 21

In this first example, we will replace the missing observation with the series mean Move quant over to New

Variable(s) SPSS will automatically rename the variable “quant_1,” but underneath that, be sure Series mean

is selected The series mean is defined as the mean of all the other observations for that variable The mean for

quant is 66.89 (verify this yourself via Descriptives) Hence, if SPSS is replacing the missing data correctly, the new value imputed for cases 8 and 18 should be 66.89 Click on OK:

RMV /quant_1=SMEAN(quant)

Result Variables

Case Number of Non-Missing Values First

1 2

1 quant_1

Result Variable

N of Replaced Missing Values N of ValidCases CreatingFunction

SMEAN (quant)

Last

Replace Missing Values

● SPSS provides us with a brief report revealing that two missing values were replaced (for cases 8 and 18, out

of 30 total cases in our data set)

● The Creating Function is the SMEAN for quant (which

means it is the “series mean” for the quant variable)

● In the Data View, SPSS shows us the new variable ated with the missing values replaced (I circled them manually to show where they are)

cre-Another option offered by SPSS is to replace with the mean of nearby points For this option, under Method, select Mean of nearby points, and click on Change to activate it in the New Variable(s) window (you will notice that quant becomes MEAN[quant 2]) Finally, under Span of nearby points, we will use the number 2

(which is the default) This means SPSS will take the two valid observations above the given case and two below it, and use that average as the replaced value Had we chosen Span of nearby points = 4, it would have taken the mean of the four points above and four points below This is what SPSS means by the mean of

“nearby points.”

● We can see that SPSS, for case 8, took the mean of two cases above and two cases below the given missing observation and replaced it with that mean That is, the number 47.25 was computed

by averaging 50.00 + 54.00 + 46.00 + 39.00, which when that sum is divided by 4, we get 47.25

● For case 18, SPSS took the mean of observations

74, 76, 82, and 74 and averaged them to equal 76.50, which is the imputed missing value

Trang 22

Replacing with the mean as we have done above is an easy way of doing it, though is often not the most preferred (see Meyers et al (2013), for a discussion) SPSS offers other alternatives, including

replacing with the median instead of the mean, as well as linear interpolation, and more cated methods such as maximum likelihood estimation (see Little and Rubin (2002) for details) SPSS offers some useful applications for evaluating missing data patterns though Missing Value Analysis and Multiple Imputation.

sophisti-As an example of SPSS’s ability to identify patterns in missing data and replace these values using imputation, we can perform the following (see Leech et al (2015) for more details on this approach):ANALYZE → MULTIPLE IMPUTATION → ANALYZE PATTERNS

Missing Value Patterns

Type 1

The pattern analysis can help you identify whether there is any systematic

features to the missingness or whether you can assume it is random SPSS

will allow us to replace the above missing values through the following:

MULTIPLE IMPUTATION → INPUT MISSING DATA VALUES

● Move over the variables of interest to the Variables in Model side.

● Adjust Imputations to 5 (you can experiment with greater values, but for demonstration, keep

as well, including the final row, which

is the pattern of missingness across two variables

Trang 23

● SPSS requires us to name a new file that will contain the upgraded data (that now includes filled values) We named our data set “missing.” This will create a new file in our session called

“missing.”

● Under the Method tab, we will select Custom and Fully Conditional Specification (MCMC) as

the method of choice

● We will set the Maximum Iterations at 10 (which is the default).

● Select Linear Regression as the Model type for scale variables.

● Under Output, check off Imputation model and Descriptive statistics for variables with imputed values.

Dependent Variables Imputed

Not Imputed (Too Many Missing Values) Not Imputed (No Missing Values)

Fully Conditional Specification Method

verbal, analytic verbal, quant

2 2 10 10

The above summary is of limited use What is more useful is to look at the accompanying file that was created, named “missing.” This file now contains six data sets, one being the original data and five containing inputted values For example, we contrast the original data and the first imputation below:

Trang 24

We can see that the procedure replaced the missing data points for cases 8, 13, and 18 Recall however that the imputations above are only one iteration We asked SPSS to produce five iterations,

so if you scroll down the file, you will see the remaining iterations SPSS also provides us with a summary of the iterations in its output:

/STATISTICS=MEAN STDDEV MIN MAX

Descriptive Statistics

Imputation Number N Minimum

30 28 49.00 29.00

Maximum 98.00 97.00

Mean 72.8667 70.8929

Std Deviation 12.97407 18.64352 27

30 30 49.00 35.00 98.00 98.00 72.8667 66.9948 12.97407 18.78684 30

30 30 49.00 35.00 98.00 98.00 72.8667 66.2107 12.97407 19.24780 30

30 30 49.00 35.00 98.00 98.00 72.8667 66.9687 12.97407 18.26461 30

30 30 49.00 29.00 98.00 98.30 72.8667 71.6004 12.97407 18.71685 30

30 30 49.00 35.00 98.00 98.00 72.8667 66.0232 12.97407 18.96753 30

30 30

72.8667 71.3429

Original data verbal

66.8929 68.4214 56.6600 68.0303 72.5174 53.8473 66.9948 66.2107 66.9687 67.2678 66.0232

18.86863 24.86718 30.58958 7.69329 11.12318 22.42527 18.78684 19.24780 18.26461 18.37864 18.96753

35.0000 50.8376 35.0299 62.5904 64.6521 37.9903 35.0000 35.0000 35.0000 35.0000 35.0000

98.0000 86.0051 78.2901 73.4703 80.3826 69.7044 98.0000 98.0000 98.0000 98.0000 98.0000

SPSS gives us first the original data on which there are 30 complete cases for verbal, and 28 complete cases for analytic and quant, before the imputation algorithm goes to work on replacing the missing data SPSS then created, as per our request, five new data sets, each time imputing a missing value for quant and analytic We see

that N has increased to 30 for each data set, and

SPSS gives descriptive statistics for each data set The pooled means of all data sets for analytic and quant are now 71.34 and 66.69, respectively, which was computed by summing the means of

all the new data sets and dividing by 5.

Trang 25

Let us try an ANOVA on the new file:

ONEWAY quant BY group

2 25 4043.984 66.307 000 60.988

27

Mean Square F Sig.

df Original data Between Groups

Within Groups

Total

8368.807 1866.609 10235.416

2 27 4184.404 60.526 000 69.134

2 27 4512.903 70.922 000 63.632

29 2

2 27 3917.441 57.503 000 68.126

2 27 3884.281 51.742 000 75.070

2 27 4430.556 76.091 000 58.227

29

Between Groups

Within Groups

Total

This is as far as we go with our brief discussion of

missing data We close this section with reiterating the

warning – be very cautious about replacing missing

data Statistically it may seem like a good thing to do for

a more complete data set, but scientifically it means you

are guessing (albeit in a somewhat sophisticated

esti-mated fashion) at what the values are that are missing If

you do not replace missing data, then common methods

of handling cases with missing data include listwise and

pairwise deletion Listwise deletion excludes cases with

missing data on any variables in the variable list, whereas

pairwise deletion excludes cases only on those variables for which the given analysis is being

conducted For instance, if a correlation is run on two variables that do not have missing data, the

correlation will compute on all cases even though for other variables, missing data may exist (try a few correlations on the IQ data set with missing data to see for yourself) For most of the procedures

in this book, especially multivariate ones, listwise deletion is usually preferred over pairwise deletion (see Meyers et al (2013) for further discussion)

SPSS gives us the ANOVA results for each imputation, revealing that regard-less of the imputation, each analysis supports rejecting the null hypothesis

We have evidence that there are mean group differences on quant

A one‐way analysis of variance (ANOVA) was performed comparing students’ quantitative performance, measured on a continuous scale, based on how much they studied (none, some, or much) Total sample size was 30, with each group having 10 observations Two cases (8 and 18) were missing values on quant SPSS’s Fully Conditional Specification was used to impute values for this variable, requesting five imputations Each imputation resulted in ANOVAs that rejected the null hypothesis of equal population means (p < 0.001) Hence, there

is evidence to suggest that quant mance is a function of how much a student studies for the evaluation.

Trang 26

perfor-Due to SPSS’s high‐speed computing capabilities, a researcher can conduct a variety of exploratory analyses to immediately get an impression of their data, as well as compute a number of basic sum-mary statistics SPSS offers many options for graphing data and generating a variety of plots In this chapter, we survey and demonstrate some of these exploratory analyses in SPSS What we present here is merely a glimpse at the capabilities of the software and show only the most essential functions

for helping you make quick and immediate sense of your data.

3.1 Frequencies and Descriptives

Before conducting formal inferential statistical analyses, it is always a good idea to get a feel for one’s

data by conducting so‐called exploratory data analyses We may also be interested in conducting

exploratory analyses simply to confirm that our data has been entered correctly Regardless of its

purpose, it is always a good idea to get very familiar with one’s data before analyzing it in any significant way Never simply enter data and conduct formal analyses without first exploring all of

your variables, ensuring assumptions of analyses are at least tentatively satisfied, and ensuring your data were entered correctly

3

Exploratory Data Analysis, Basic Statistics, and Visual Displays

Trang 27

SPSS offers a number of options for conducting a variety of data summary tasks For example, pose we wanted to simply observe the frequencies of different scores on a given variable We could

sup-accomplish this using the Frequencies function:

As a demonstration, we will obtain frequency information for the variable verbal, along with a

number of other summary statistics Select Statistics and then the options on the right:

Trang 28

We have selected Quartiles under Percentile Values and Mean, Median, Mode, and Sum under Central Tendency We have also requested dispersion statistics Std Deviation, Variance, Range, Minimum, and Maximum and distribution statistics Skewness and Kurtosis We click on Continue and OK to see our output (below is the corresponding syntax for generating the above – remember,

you do not need to enter the syntax below; we are showing it only so you have it available to you should you ever wish to work with syntax instead of GUI commands):

12.97407 168.326 –.048

–.693 833 49.00 49.00 98.00 2186.00 62.7500 73.5000 84.2500 427

descrip-● There are a total of 30 cases (N = 30), with no missing values (0).

● The Mean is equal to 72.87 and the Median 73.50 The mode

(most frequent occurring score) is equal to 56.00 (though ple modes exist for this variable)

multi-● The Standard Deviation is the square root of the Variance, equal to

12.97 This gives an idea of how much dispersion is present in the variable For example, a standard deviation equal to 0 would mean all values for verbal are the same As the standard deviation is greater than 0 (it cannot be negative), it indicates increasingly more variability

● The distribution is slightly negatively skewed since Skewness of

−0.048 is less than zero, indicating slight negative skew The fact that the mean is less than the median is also evident of a slightly negatively skewed distribution Skewness of 0 indicates no skew Positive values indicate positive skew

● Kurtosis is equal to −0.693 suggesting that observations cluster

less around a central point and the distribution has relatively thin tails compared with what we would expect in a normal distribu-tion (SPSS 2017) These distributions are often referred to as

platykurtic.

● The range is equal to 49.00, computed as the highest score in the data minus the lowest score (98.00 – 49.00 = 49.00)

● The sum of all the data is equal to 2186.00

The scores at the 25th, 50th, and 75th percentiles are 62.75, 73.50, and 84.25 Notice that the 50% percentile corresponds to the same value as the median

Trang 29

SPSS then provides us with the frequency information for verbal:

We can also obtain some basic descriptive statistics via Descriptives:

ANALYZE → DESCRIPTIVE STATISTICS → DESCRIPTIVES

3.3 3.3 3.3 6.7 3.3 3.3 3.3 3.3 6.7 3.3 3.3 6.7 6.7 3.3 3.3 6.7 3.3 3.3 6.7 6.7 3.3 3.3 3.3 100.0

3.3 6.7 10.0 16.7 20.0 23.3 26.7 30.0 36.7 40.0 43.3 50.0 56.7 60.0 63.3 70.0 73.3 76.7 83.3 90.0 93.3 96.7 100.0

Valid

Percent

verbal

Cumulative Percent Valid

Percent

We can see from the output that the value of 49.00 occurs a single time in the data set (Frequency = 1) and consists of 3.3% of cases The value of 51.00 occurs a single time as well and denotes 3.3% of cases The cumu-lative percent for these two values is 6.7%, which con-sists of that value of 51.00 along with the value before it

of 49.00 Notice that the total cumulative percent adds

up to 100.0

After moving verbal to the Variables window, select Options

As we did with the Frequencies function, we select a variety of summary statistics Click on Continue then OK.

Trang 30

Our output follows:

N Statistic Range Minimum Statistic Statistic Statistic Statistic Statistic Statistic Statistic Statistic

Kurtosis Skewness

Variance Std Deviation Mean

Descriptive Statistics

Maximum

Std Error Std Error 49.00 49.00 98.00 72.8667 12.97407 168.326 –.048 427 –.693 833 30

30 verbal

Valid N (listwise)

3.2 The Explore Function

A very useful function in SPSS for obtaining descriptives as well as a host of summary plots is the EXPLORE function:

ANALYZE → DESCRIPTIVE STATISTICS → EXPLORE

Move verbal over to the Dependent List and group to the Factor List Since group is a

categorical (factor) variable, what this means

is that SPSS will provide us with summary tistics and plots for each level of the grouping variable

sta-Under Statistics, select Descriptives, Outliers, and

Percentiles Then under Plots, we will select, under

Boxplots, Factor levels together, then under Descriptive,

Stem‐and‐leaf and Histogram We will also select

Normality plots with tests:

Trang 31

SPSS generates the following output:

verbal group

Valid Missing Total

Cases Percent Percent

10 10 10

0 0 0

Percent N N N

Case Processing Summary

.00 1.00 2.00

The Case Processing Summary above simply reveals the variable we are subjecting to analysis

(verbal) along with the numbers per level (0, 1, 2) We confirm that SPSS is reading our data file

correctly, as there are N = 10 per group.

Statistic group

.656 687 1.334 1.70261

.687 1.334 2.13464

73.1000 69.2484 76.9516 72.8889 73.0000 28.989 5.38413 66.00 84.00 18.00 7.25 818 578 86.3000 81.4711 91.1289 86.2222 85.5000 45.567 6.75031 76.00 98.00 22.00 11.25 306 –.371

.687 1.334

Lower Bound Upper Bound

that SPSS provides statistics for verbal by group level (0, 1, 2) For verbal = 0.00, we note the

following:

● The arithmetic Mean is equal to 59.2, with a

standard error of 2.44 (we will discuss standard errors in later chapters)

● The 95% Confidence Interval for the Mean has

limits of 53.67 and 64.73 That is, in 95% of ples drawn from this population, the true popu-lation mean is expected to lie between this lower and upper limit

sam-● The 5% Trimmed Mean is the adjusted mean by

deleting the upper and lower 5% of cases on the tails of the distribution If the trimmed mean is very much different from the arithmetic mean, it could indicate the presence of outliers

● The Median, which represents the score that is

the middle point of the distribution, is equal to 57.5 This means that 1/2 of the distribution lay below this value, while 1/2 of the distribution lay above this value

● The Variance of 59.73 is the average sum of

squared deviations from the arithmetic mean and provides a measure of how much dispersion (in squared units) exists for the variable Variance

of 0 (zero) indicates no dispersion

● The Standard Deviation of 7.73 is the square root

of the variance and is thus measured in the nal units of the variable (rather than in squared units such as the variance)

origi-● The Minimum and Maximum values of the data are also given, equal to 49.00 and 74.00, respectively.

● The Range of 25.00 is computed by subtracting the lowest score in the data from the highest

(i.e. 74.00 – 49.00 = 25.00)

Trang 32

.00 Highest

Case Number Value

Extreme Values

Highest Lowest

Lowest

a Only a partial list of cases with the value 73.00 are shown

in the table of upper extremes.

b Only a partial list of cases with the value 73.00 are shown

in the table of lower extremes.

4 6 5 3 2

74.00 68.00 63.00 62.00 59.00 49.00 51.00 54.00 56.00 56.00

66.00 68.00 69.00 70.00 73.00 b

84.00 79.00 75.00 74.00 73.00 a

10 9 7 8 1 15 18 17 13 14 11 16 12 20 19 98.00 94.00 92.00 86.00 86.00 76.00 79.00 82.00 85.00 85.00

29 26 27 22 28 24 25 23 30 21

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

Tests of Normality

Shapiro-Wilk Statistic

Statistic

group

* This is a lower bound of the true significance.

a Lilliefors Significance Correction

verbal 00 161

.162

.218 10

10 10

10 10 10 200*

.200*

.197 962 948

.639 809

1.00

2.00

df df

Kolmogorov-Smirnov a

Sig.

● The Interquartile Range is computed as the third quartile (Q3) minus the first quartile (Q1) and hence is a

rough measure of how much variation exists on the inner part of the distribution (i.e between Q1 and Q3)

● The Skewness index of 0.656 suggests a slight positive skew (skewness of 0 means no skew, and negative bers indicate a negative skew) The Kurtosis index of −0.025 indicates a slight “platykurtic” tendency (crudely, a

num-bit flatter and thinner tails than a normal or “mesokurtic” distribution)

SPSS also reports Extreme Values that give the top 5

lowest and top 5 highest values in the data at each

level of the group variable A few conclusions from this

tests Crudely, these both test the null hypothesis that

the sample data arose from a normal population We

wish to not reject the null hypothesis and hence desire

a p‐value greater than the typical 0.05 A few

Trang 33

Below are histograms for verbal for each level of the group variable Along with each plot is given

the mean, standard deviation, and N per group Since our sample size per group is very small, it is rather difficult to assess normality per cell (group), but at minimum, we do not notice any gross viola-

tion of normality We can also see from the histograms that each level contains at least some ity, which is important to have for statistical analyses (if you have a distribution that has virtually almost no variability, then it restricts the kinds of statistical analyses you can do or whether analyses can be done at all)

N = 10

55.00 60.00 65.00 70.00 75.00

verbal

Mean = 73.10 Std Dev = 5.384

N = 10

0 1

N = 10

0 1

2

3 4

for group = 2.00 Histogram

75.00 80.00 85.00 90.00 95.00

verbal

The following are what are known as Stem‐and‐leaf Plots These are plots that depict the

distribu-tion of scores similar to a histogram (turned sideways) but where one can see each number in each distribution They are a kind of “naked histogram” on its side For these data, SPSS again plots them

9 14669 238 4

.

6 7 7 8

689 0334 59 4

.

7 8 8 9 9

69 2 5566 24 8

.

Let us inspect the first plot (group = 0) to explain how it is constructed The first value in the data for group = 0 has a frequency of 1.00 The score is that of 49 How do we know it is 49? Because “4”

is the stem and “9” is the leaf Notice that below the plot is given the stem width, which is 10.00

What this means is that the stems correspond to “tens” in the digit placement Recall that from

Trang 34

right to left before the decimal point, the digit positions are ones, tens, hundreds, thousands, etc SPSS also tells us that each leaf consists of a single case (1 case[s]), which means the “9” represents

a single case Look down now at the next row; We see there are five values with stems of 5 What are the values? They are 51, 54, 56, 56, and 59 The rest of the plots are read in a similar manner

To confirm that you are reading the stem‐and‐leaf plots correctly, it is always a good idea to match

up some of the values with your raw data simply to make sure what you are reading is correct With more complicated plots, sometimes discerning what is the stem vs what is the leaf can be a bit tricky!

Below are what known as Q–Q Plots As requested, SPSS also prints these out for each level

of the verbal variable These plots essentially compare observed values of the variable with expected values of the variable under the condition of normality That is, if the distribution fol-lows a normal distribution, then observed values should line up nicely with expected values

That is, points should fall approximately on the line; otherwise distributions are not perfectly normal All of our distributions below look at least relatively normal (they are not perfect, but

not too bad)

–2 –1 0 1

2 3

To the left are what are called Box‐and‐

whisker Plots For our data, they represent a

summary of each level of the grouping ble If you are not already familiar with box-plots, a detailed explanation is given in the box below, “How to Read a Box‐and‐whisker Plot.” As we move from group = 0 to group = 2, the medians increase That is, it would appear that those who receive much training do bet-ter (median wise) than those who receive some vs those who receive none

Trang 35

3.3 What Should I Do with Outliers? Delete or Keep Them?

In our review of boxplots, we mentioned that any point that falls below Q1 – 1.5 × IQR or above Q3 + 1.5 × IQR may be considered an outlier Criteria such as these are often used to identify extreme

observations, but you should know that what constitutes an outlier is rather subjective, and not quite

as simple as a boxplot (or other criteria) makes it sound There are many competing criteria for ing outliers, the boxplot definition being only one of them What you need to know is that it is a

defin-mistake to compute an outlier by any statistical criteria whatever the kind and simply delete it from

your data This would be dishonest data analysis and, even worse, dishonest science What you should do is consider the data point carefully and determine based on your substantive knowledge of the area under study whether the data point could have reasonably been expected to have arisen from the population you are studying If the answer to this question is yes, then you would be wise to keep the data point in your distribution However, since it is an extreme observation, you may also

choose to perform the analysis with and without the outlier to compare its impact on your final model

results On the other hand, if the extreme observation is a result of a miscalculation or a data error,

How to Read a Box‐and‐whisker Plot

Consider the plot below, with normal densities

given below the plot

IQR Q3 Q3 + 1.5 × IQR Q1

dis-● Q1 and Q3 represent the 25th and 75% percentiles, respectively Note that the median is often referred to

as Q2 and corresponds to the 50th percentile

● IQR corresponds to “Interquartile Range” and is puted by Q3 – Q1 The semi‐interquartile range (not shown) is computed by dividing this difference in half (i.e [Q3 − Q1]/2)

com-● On the leftmost of the plot is Q1 − 1.5 × IQR This

corre-sponds to the lowermost “inner fence.” Observations that are smaller than this fence (i.e beyond the fence, greater negative values) may be considered to be candidates for outliers The area beyond the fence to the left corresponds

to a very small proportion of cases in a normal distribution

● On the rightmost of the plot is Q3 + 1.5 × IQR This

cor-responds to the uppermost “inner fence.” Observations that are larger than this fence (i.e beyond the fence) may be considered to be candidates for outliers The area beyond the fence to the right corresponds to a very small proportion of cases in a normal distribution

● The “whiskers” in the plot (i.e the vertical lines from the quartiles to the fences) will not typically extend as far

as they do in this current plot Rather, they will extend

as far as there is a score in our data set on the inside of the inner fence (which explains why some whiskers can be very short) This helps give an idea as to how compact is the distribution on each side

Trang 36

then yes, by all means, delete it forever from your data, as in this case it is a “mistake” in your data, and not an actual real data point SPSS will thankfully not automatically delete outliers from any statistical analyses, so it is up to you to run boxplots, histograms, and residual analyses (we will dis-cuss these later) so as to attempt to spot unusual observations that depart from the rest But again,

do not be reckless with them and simply wish them away Get curious about your extreme scores, as sometimes they contain clues to furthering the science you are conducting For example, if I gave a group of 25 individuals sleeping pills to study its effect on their sleep time, and one participant slept well below the average of the rest, such that their sleep time could be considered an outlier, it may

suggest that for that person, the sleeping pill had an opposite effect to what was expected in that it

kept the person awake rather than induced sleep Why was this person kept awake? Perhaps the drug was interacting with something unique to that particular individual? If we looked at our data file further, we might see that subject was much older than the rest of the subjects Is there something

about age that interacts with the drug to create an opposite effect? As you see, outliers, if studied,

may lead to new hypotheses, which is why they may be very valuable at times to you as a scientist

3.4 Data Transformations

Most statistical models make assumptions about the structure of data For example, linear least‐squares makes many assumptions, among which, for instance, are linearity and normality and inde-pendence of errors (see Chapter 9) However, in practice, assumptions often fail to be met, and one may choose to perform a mathematical transformation on one’s data so that it better conforms to required assumptions For instance, when sample data do not follow normal distributions to a large extent, one option is to perform a transformation on the variable so that it better approximates nor-mality Such transformations often help “normalize” the distribution, so that the assumptions of such

tests as t‐tests and ANOVA are more easily satisfied There are no hard and fast rules regarding when

and how to transform data in every case or situation, and often it is a matter of exploring the data and trying out a variety of transformations to see if it helps We only scratch the surface with regard to transformations here and demonstrate how one can obtain some transformed values in SPSS and their effect on distributions For a thorough discussion, see Fox (2016)

The Logarithmic Transformation

The log of a number is the exponent to which we need to raise a number to get another number For example, the natural log of the number 10 is equal to

loge 10 2 302585093

Why? Because e2.302585093 = 10, where e is a constant equal to approximately 2.7183 Notice that the

“base” of these logarithms is equal to e This is why these logs are referred to as “natural” logarithms

We can also compute common logarithms, those to base 10:

log10 10 1

But why does taking logarithms of a distribution help “normalize” it? A simple example will help illustrate Consider the following hypothetical data on a given variable:

2 4 10 15 20 30 100 1000

Trang 37

Though the distribution is extremely small, we nonetheless notice that lower scores are closer in

proximity than are larger scores The ratio of 4 to 2 is equal to 2 The distance between 100 and 1000

is equal to 900 (the ratio is equal to 10) How would taking the natural log of these data influence these distances? Let us compute the natural logs of each score:

0 69 1 39 2 30 2 71 2 99 3 40 4 61 6 91

Notice that the ratio of 1.39–0.69 is equal to 2.01, which closely mirrors that of the original data However, look now at the ratio of 6.91–4.61, it is equal to 1.49, whereas in the original data, the ratio

was equal to 10 In other words, the log transformation made the extreme scores more “alike” the other

scores in the distribution It pulled in extreme scores We can also appreciate this idea through simply

looking at the distances between these points Notice the distance between 100 and 1000 in the

origi-nal data is equal to 900, whereas the distance between 4.61 and 6.91 is equal to 2.3, very much less

than in the original data This is why logarithms are potentially useful for skewed distributions Larger numbers get “pulled in” such that they become closer together After a log transformation, often the resulting distribution will resemble more closely that of a normal distribution, which makes

the data suitable for such tests as t‐tests and ANOVA.

The following is an example of data that was subjected to a log transformation Notice how after the transformation, the distribution is now approximately normalized:

We can perform other transformations as well on data, including taking square roots and cals (i.e 1 divided by the value of the variable) Below we show how our small data set behaves under each of these transformations:

recipro-TRANSFORM → COMPUTE VARIABLE

Trang 38

● Notice above we have named our Target Variable by the name of LOG_Y For our example, we will compute the natural log (LN), so under Functions and Special Variables, we select LN (be sure to select Function Group = Arithmetic first) We then move Y, our original variable, under Numeric Expression so it reads LN(Y).

● The output for the log transformation appears to the right of the window, along with other formations that we tried (square root (SQRT_Y) and reciprocal (RECIP_Y)

trans-● To get the square root transformation, simply scroll down

But when to do which transformation? Generally speaking, to correct negative skew in a tion, one can try ascending the ladder of powers by first trying a square transformation To reduce positive skew, descending the ladder of powers is advised (e.g start with a square root or a common log transform) And as mentioned, often transformations to correct one feature of data (e.g abnor-mality or skewness) can help also simultaneously adjust other features (e.g nonlinearity) The trick

distribu-is to try out several transformations to see which best suits the data you have at hand You are allowed

to try out several transformations

The following is a final word about transformations While some data analysts take great care in transforming data at the slight of abnormality or skewed distributions, generally, most parametric sta-

tistical analyses can be conducted without transforming data at all Data will never be perfectly normal

or linear, anyway, so slight deviations from normality, etc, are usually not a problem A safeguard against this approach is to try the given analysis with the original variable, then again with the transformed variable, and observe whether the transformation had any effect on significance tests and model results overall If it did not, then you are probably safe not performing any transformation If, however, a response variable is heavily skewed, it could be an indicator of requiring a different model than the one that assumes normality, for instance For some situations, a heavily skewed distribution, coupled with the nature of your data, might hint a Poisson regression to be more appropriate than an ordinary least‐squares regression, but these issues are beyond the scope of the current book, as for most of the proce-dures surveyed in this book, we assume well‐behaved distributions For analyses in which distributions

are very abnormal or “surprising,” it may indicate something very special about the nature of your data,

and you are best to consult with someone on how to treat the distribution, that is, whether to merely

transform it or to conduct an alternative statistical model altogether to the one you started out with Do not get in the habit of transforming every data set you see to appease statistical models.

Trang 39

Before we push forward with a variety of statistical analyses in the remainder of the book, it would

do well at this point to briefly demonstrate a few of the more common data management capacities

in SPSS SPSS is excellent for performing simple to complex data management tasks, and often the need for such data management skill pops up over the course of your analyses We survey only a few

of these tasks in what follows For details on more data tasks, either consult the SPSS manuals or simply explore the GUI on your own to learn what is possible Trial and error with data tasks is a great way to learn what the software can do! You will not break the software! Give things a shot, and see how it turns out, then try again! Getting what you want any software to do takes patience and trial and error, and when it comes to data management, often you have to try something, see if it works, and if it does not, try something else

4.1 Computing a New Variable

Recall our data set on verbal, quantitative, and analytical scores Suppose we wished to create a new

variable called IQ (i.e intelligence) and defined it by summing the total of these scores That is, we

wished to define IQ = verbal + quantitative + analytical We could do so directly in SPSS syntax or via the GUI:

4

Data Management in SPSS

Trang 40

We compute as follows:

● Under Target Variable, type in the name of the

new variable you wish to create For our data, that name is “IQ.”

● Under Numeric Expression, move over the

vari-ables you wish to sum For our data, the sion we want is verbal + quant + analytic

expres-● We could also select Type & Label under IQ to

make sure it is designated as a numeric variable,

as well as provide it with a label if we wanted

We will call it “Intelligence Quotient”:

Once we are done with the creation of the variable, we verify that it has been computed in the Data View:

We confirm that a new variable has been created

by the name of IQ The IQ for the first case, for example, is computed just as we requested, by adding verbal + quant + analytic, which for the first case is 56.00 + 56.00 + 59.00 = 171.00

4.2 Selecting Cases

In this data management task, we wish to select particular cases of our data set, while excluding

others Reasons for doing this include perhaps only wanting to analyze a subset of one’s data

Once we select cases, ensuing data analyses will only take place on those particular cases For

example, suppose you wished to conduct analyses only on females in your data and not males If females are coded “1” and males “0,” SPSS can select only cases for which the variable Gender = 1

Tiêu đề	SPSS Data Analysis For Univariate, Bivariate And Multivariate Statistics
Tác giả	Daniel J. Denis
Trường học	John Wiley & Sons, Inc.
Chuyên ngành	Statistics and Data Analysis
Thể loại	Book
Năm xuất bản	2019
Thành phố	Hoboken, NJ

Định dạng
Số trang	206
Dung lượng	9 MB