Afterthe package has been installed, simply type libraryquantreg at the command prompt: > libraryquantreg Press enter after you type this command to submit the command to R.. For help o
Trang 1September 28, 2010
Trang 21 Introduction: What is R?
The software package known as R is an interactive computing language and environment for statistical analysis, computing, and graphics R is an open source software package: the source code behind the software is free for all
to look at / modify / play around with, and R in fact grows by leaps and bounds as people from all fields develop new functions for use within R’s computing environment This is part of what makes R extremely useful! Several extremely complex statistical routines not available in other softwarepackages have been programmed in R, and these routines are freely
available for use by anyone
R is completely interactive; users type commands and program functions as they go The software is extremely similar in many ways to the commercial software package S-Plus, and offers many of the same features R, however,can be downloaded for free, while S-Plus is a commercial package that costsmoney S-Plus may be slightly easier to use than R, but after this workshop, you should be familiar enough with R and how it functions to do pretty much anything that you would like to do without a hassle!
The software provides users with a wide array of powerful and enlightening graphical techniques, and this is why many researchers love using R; the graphical capabilities are tremendous, and easy to implement Once you are able to grasp how to work with R’s graphical facilities, you will have a limitless supply of graphical tools at your fingertips that will enhance the appearance of your research presentations in many ways
We strongly encourage you to visit the central web site behind the R project, which I will frequently refer to throughout this workshop:
http://www.r-project.org/
Here you will find links for downloading R, downloading additional
packages for R, and everything else that you would like to know about the software or the people behind it
Trang 32 How to Obtain R
The R Project Web Page
At the R Project Web Page, you will find a variety of information about the
R Project, which you can peruse at your leisure The most important link will appear at the left hand side of the screen, under the “Download”
heading Click on the CRAN link (Comprehensive R Archive Network), andafter you choose one of the U.S mirrors (http://cran.stat.ucla.edu/ is
recommended), you will be taken to the page that you will use to download everything R-related
Once you find the CRAN web page, take the following steps to obtain R:
1 Click on the “R Binaries” link on the left-hand side of the page under the “Software” heading
2 Click on the folder that best describes your operating system
3 When using Windows, click on the “base” subdirectory This will allow you to download the base R package
4 Click the “Download R 2.X.X for Windows” link R is updated quite frequently, and the version number is always changing (at the time of this writing, Version 2.11.1 is available) Save the exe file
somewhere on your computer
5 Double-click on the exe file once it has been downloaded A wizard will appear that will guide you through the setup of the R software on your machine
6 Once you are finished, you should have an R icon on your desktop that gives you a shortcut to the R system Double-click on this icon, and you are ready to go!
Adding Packages to R
At step 3 above, you also have the option of clicking on “contrib”
subdirectory Doing this will allow you to download additional contributed packages in R So what exactly are “additional contributed packages”? As
Trang 4mentioned in the introduction, R is an open source software package,
meaning users of R are free to explore the code behind the software and write their own new code Several statisticians and researchers have written additional packages for R that perform complex analyses that are not very common, and in order to use these packages and the functions within them, you need to first download them The base R package comes with several additional packages, but odds are that you will discover an uncommon
analysis technique in your research that requires you to install an additional package that is not available with the base package There are many
additional contributed packages Don’t hesitate to explore the contributed packages to see if someone has developed a package that will allow you to implement a technique that you are interested in!
To download contributed packages, follow steps 1 and 2 above, and then click on the “contrib” link Then, follow these steps:
1 Select the version of R that you are using (the newest version for
Windows at the time of this workshop is Version 2.11.1)
2 Scroll through the list of contributed packages (in zip format), and
click on the package that you would like to download You can find descriptions of all of these contributed packages and the techniques implemented within them by clicking on the “Packages” link under the “Software” heading on the CRAN web page This page will also have links to help manuals for the packages
3 Save the zip file in a directory on your machine that you can
remember
4 When using R, select Install package(s) from local zip files… from the Packages menu Locate the zip file for the package that you downloaded onto your machine, click on Open, and R will install that package so that it is ready for use
5 The package will now be ready to use when you start R!
FAQ’s on the CRAN Web Page
Under the “Documentation” heading on the left-hand side of the CRAN webpage, click on the “FAQs” link This will allow you to see an FAQ page thatwill answer many of the most commonly asked questions about R You will
Trang 5find that this section will provide answers to many of your questions,
whether they are simple or difficult
Searching on the CRAN Web Page
Under the “CRAN” heading on the left-hand side of the CRAN web page, you can click on the “Search” link Although there is no formal search engine on the CRAN web page, this will take you to a set of links allowing you to search the R archives (manuals, mail, help files, etc.) for anything thatyou would like This is often useful if you are faced with a tough analysis question, and you want to see if another R user has addressed the question before
Starting R / Loading Contributed Packages
At this point (if you haven’t already), you should be able to start R! If you asked for a shortcut to R to be created on your desktop, simply double click
on the R icon to start R This will open the RGui (Graphical User Interface).You should see a window inside the RGui containing the R Console This is where you will specify all of your commands and programs interactively, at the red command prompt
For an example command, we’ll load a contributed package into R for
use Let’s download the “quantreg” package from the CRAN mirror and save it to the desktop, and then install the package as described above Afterthe package has been installed, simply type library(quantreg) at the command prompt:
> library(quantreg)
Press enter after you type this command to submit the command to R If youdon’t see anything aside from another command prompt, the library was loaded successfully, and you can use all of the functions associated with it!
If you see the error message
Error in library(quantreg) : There is no package called 'quantreg'
you did not extract the quantreg package correctly (see pages 4-5) A
contributed package must be downloaded and extracted into the R library folder correctly in order for you to use it
Trang 6This is how you load contributed packages into R for your personal use
When you submit a command to R, you will either see nothing but
another command prompt (good), a result (good), or an error message (bad)
An even quicker way to install packages is to simply select “Install
package(s)…” from the Packages menu You can pick a CRAN mirror, and then directly install a package and all of its related components This is probably the quickest way to install a package You would still need to
load the package in order to use its functions
At any point in a given R session, you can submit the command
> installed.packages()
to view packages that have been installed
You are now ready to use R!
Trang 73 Help Tools
In most well-written statistical software packages, help is never far away This holds true for R Although the help is somewhat technical in nature andrequires a good understanding of the R language, it is very easy to access
Once you’ve gained some experience in working with the R language, 90%
of your help questions will be directed at how particular functions in R work, what arguments they take, etc For help on ANY function in R that is
a part of a package that has already been loaded, simply type and submit
If you would like to see a list of all of the functions that come with the base
R package, including brief descriptions of each, you can simply type
> library(help = “base”)
to generate the list
Hint: Don’t forget, R is an open source language! If you want to see
exactly how a given function has been written, simply type
> fix(function.name)
to see the code in a program editor You can copy it, update it, and do
whatever else you would like with it Just make sure not to save any
changes to the code behind a function unless you know that they will
work!!!
Another easy way of obtaining help via the Internet is to type and submit
Trang 8> help.start()
Doing so will open up a web-based help system that is very easy to navigate
A third and obvious way to obtain help is via the Help menu when you are working in the R Console Here you will find FAQ’s, help with navigating the console, and most importantly the official R manuals from the authors of
R themselves Again, these are somewhat technical in nature, but very useful once you have been working with R a lot I would recommend the
“Introduction to R” manual very highly
Finally, don’t hesitate to contact the Center for Statistical Consultation and Research, or CSCAR (734.764.7828; cscar@umich.edu,
online.stat@umich.edu) if you need further assistance with performing your analyses in R!
Trang 94 Importing / Exporting Data Sets
The “bank2” Data Set
All of the following examples of using R will be demonstrated using a data set that appears in a variety of formats in the archive at
www.umich.edu/~bwest The “bank2” data set contains a variety of
information on each of the 474 employees that work for a large bank The most important first step in using R for statistical analysis is of course to import a data set!
Objects in R
Before you can successfully import a data set, you need to know about objects in R The entire R computing environment is based on objects What exactly is an object? Objects take numerous forms:
> nine <- 9
will create an object called nine In this case, we have an object that is nothing more than a number (9) You can “look” at an object by simply typing the name of the object at the command prompt:
> nine
[1] 9
R, in this case, returns the value of the object (a number) Many objects (such as results of analyses) are much more complex, and there are ways to
Trang 10look at specific aspects of objects Fields within objects (e.g variables within data set objects, or parts of result objects, such as the estimated
regression coefficients that come from a regression analysis) can be accessed
using the object$field command Suppose we run a regression analysis, and
then want to investigate the resulting coefficients Submitting the command
> fit <- lm(mo.fail ~ lc, data)
tells R to fit a simple linear regression model to two variables in a data set object named data, using the lm() function for general linear modeling Weare trying to predict the mo.fail variable with the lc variable, and storing
the results of the regression analysis in an object called fit This object contains the results of a regression analysis So what does it look like?
Notice how the object$field command was submitted The “coef” result is a
field located in the “fit” object If you want to see a list of all possible fields located within an object (for example, object.name), simply type
Trang 11Look at all of the output associated with our one little object! Objects
containing results actually contain several pieces of information: you just need to know how to access them We will go over objects in great detail
Hint: Previously submitted commands can be recovered by using the up
and down arrows This is helpful for editing old commands that may have resulted in errors
Hint: To delete old objects in R that you no longer desire to work with,
simply type
> rm(object.name)
Importing Data
Now that you have seen objects in R for the first time, you can import
external data sets into objects In the simple regression example above,
data was an object containing a data set
THE GOAL OF IMPORTING DATA: TO CREATE A DATA SET SUITABLE FOR PERFORMING STATISTICAL AND GRAPHICAL ANALYSIS IN R.
The easiest way to import data into R from some other software package is
to save the data set in a text format (with a delimited or space-separated format most preferred) in the other package and then import the text data
Trang 12into R On the workshop flash drive is a data set named “bank2.dat” which was exported from SPSS as a tab-delimited text file (any kind of space separation between variables is fine for importing data into R) Notice that variable names were included in the first row; this is also recommended for importing data sets into R.
read.table()
The read.table() function is used in R for importing text data into data set objects This function requires that you have a valid data table in a text format (where rows are observations, and columns are variables) with every cell containing a data point If there are any blanks, the function will not work properly Missing values in R should be coded as NA before
attempting to import text data Columns should be separated by white space
To import the “bank2.dat” data set into R using the read.table() function,start R, and then submit the following command:
> bank2 <- read.table("e:\\bank2.dat", h=T)
When using the read.table() function, the first argument should be the file name in double-quotes, with double-slashes instead of single slashes defining the directory where the file is located The second argument tells Rthat there is a header in the text data file, containing variable names (header
= TRUE, for short) This is recommended when reading in data! This
assigns variable names to the fields in the data set object that will be created,and makes data analysis easier
Unfortunately, you’ll immediately see the following error message after submitting the command:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 434 did not have 10 elements
Remember, seeing nothing after submitting a command is a good thing In this case, we had a problem…there were not 10 data points in line 434 of thedata file (10 variables require 10 data points) Editing the raw data using notepad, type “NA” in the third column for ID #434, telling R that this is missing data Then resubmit the command! Now you won’t see an error, and you’ve created a data set object called bank2. You can take a look at
Trang 13the data (the object) by simply typing the object name at the command
prompt!
> bank2
If you are working with a large data set, it is sometimes useful to just take a look at the first five or so rows of the data set to see if the import appeared towork To do this, issue the command
> bank2[1:5,]
This tells R to display all variables in the first five rows of the bank2 data set object To look at all of the values for a specific variable in the data set object, issue the command
> bank2$SALARY
Remember, a data set object is an example of an object that contains several fields (in this case, the fields are variables) To look at the fields, you can
use the object$field command Field names are case-sensitive.
The read.table() function will read in either numeric or character data, depending on the format of the variables As a result, it’s probably a better idea to store dates in a numeric format before importing them (e.g., month, day, and year variables in numeric format), because they will just be
imported as characters otherwise, and will not be that useful For additional options associated with the read.table() function, simply type
Trang 14Other Useful Importing Functions
Some analysts prefer working with comma-delimited or comma-separated data files, and it is also easy to read these files into R using the read.csv()
function For example, open the “bank2.xls” file in Excel, and then save the file as a comma-separated csv file Then, issue this command:
> bank2 <- read.csv("e:\\bank2.csv",h=T)
This will read the comma-separated data file (with a header containing variable names!) into a data set object called bank2. This function will automatically convert missing data points to NA values for you, which is a nice feature The read.delim() function works in a similar way for tab-delimited data files, and could have been used to read in the earlier tab-delimited data file exported from SPSS
If you are working with raw data in a fixed-width format, where the
variables are not separated by delimiters (like spaces or commas) but rather
in pre-specified columns, you can use the read.fwf() function For
example, if you have a text data file containing ages and weights on four boys that looks like
to read the data into a data set object named fwf The c(2,3) argument tells
R that the first variable is in the first two columns, and the next variable is inthe next three columns The col.names argument indicates a vector of names for the variables that will be in the new data set object You use the
c() function in R to define vectors, as we shall see
When you download R, one of the packages that should come with the base
package is the foreign package (see pages 4-6 about packages) The foreign
Trang 15package contains a suite of functions that allows you to automatically read indata sets from SAS, SPSS, STATA, EPIINFO, MINITAB, and S+ To load the foreign package into R, simply issue the command
> library(foreign)
The following functions are now available for reading in data from these other packages:
Package Function Required Argument
SAS read.xport() SAS Transport (.xpt) file
SPSS read.spss() SPSS sav file
STATA read.dta() STATA dta file
EPIINFO read.epiinfo() EPIINFO rec file
MINITAB read.mtp() MINITAB portable worksheet
S+ read.S() S+ binary object
You can type help(read.xxxx) for more info on any of these importing functions and how they work They are extremely easy to use! For
example, to import the “bank2.dta” file into a data set object named stata, simply issue the command
> stata <- read.dta("e:\\bank2.dta")
You should definitely investigate the help files to see additional options that you can specify when reading in data sets of these types Each function was written by a different user, and they all work differently and have different defaults
Many packages in R come with pre-loaded data sets that you can look at for secondary analysis To get an idea of the data set objects that are currently available in the packages loaded for your R session, you can type
> data()
at the prompt The window that appears will identify data sets that you can use, and what packages they belong to To load a data set saved in an R package, you can type
> data(object.name)
Trang 16to load the data set object into R’s memory For more information on these data sets, you can type
> help(object.name)
However you decide to import data, once you have a data set object in R, you have the world at your fingertips! All of the powerful functions that R offers which can be applied to data sets are now available to you Importing data is definitely the hard part
Exporting Data
R is not commonly used for large data management and manipulation
projects, but occasionally the need will arise to export data sets from R so that other analysts can use them
In R, the write.table() function is used to export “rectangular” text data,
or data sets defined by rows and columns One of the options available when using the write.table() function is the sep option, which lets you define a delimiter in the exported data set So, to export the bank2 data set object that we have created to a comma-separated rectangular text file, you can submit the command
> write.table(bank2,"e:\\bank2.txt",sep=",")
There are several other options available when using this function, all of which can be viewed by typing
> help(write.table)
If you need to export large numeric matrices to text files, you can use the
write.matrix() function This function is located in the MASS package,
so you need to load that package first before using it The MASS package contains several useful statistical functions, and is used quite frequently in R.Many users of R in fact begin an R session by loading the MASS package, because of all of the functions that it offers
Trang 175 Graphical Techniques
Once you have sufficiently prepared a data set for analysis, initial
exploratory analyses of the data should always involve the production of some graphs R is very well known for its ability to produce high-quality, publication-ready graphs with a few very simple commands There are an amazing number of graphical techniques that are at your fingertips when using R, many of which are fully interactive!
Some of the simpler graphing tools for descriptive statistical analyses will bedisplayed in the next section, and it would take weeks to cover all of the potential graphing functions available to you in R The purpose of this section is to introduce you to working with R’s graphing functions
First, there are three main types of graphing functions in the R language
High-level functions create a brand new plot or figure Low-level
functions add information to an existing plot Finally, interactive graphics
allow you to interactively add information to a plot, which is both useful andfun There are many useful functions, and each function has a great deal of options! Don’t forget to use the help features to find out what these options are
R’s interactive graphical techniques are a strong reason for its growing popularity amongst applied researchers For a demonstration of some of these popular techniques, we’ll start by creating a scatter plot of SALARY and SALBEGIN, and including both the fitted linear regression line and a fitted smooth regression curve, with different line types:
> plot(bank2$SALBEGIN,bank2$SALARY,main="SALARY Regression Example",xlab="Beginning",ylab="Current") # High Level
> abline(lm(bank2$SALARY~bank2$SALBEGIN),col=2,lty=1)
# Low Level
> lines(lowess(bank2$SALARY ~ bank2$SALBEGIN),col=2,lty=2) # Low Level
Next, we want to interactively add a legend to the plot using the legend()
function, identifying the dashed line as the Lowess fit and the straight line asthe Linear fit:
Trang 18> legend(locator(1),c("Linear","Lowess"),lty=c(1,2),col=2)
Notice that nothing happens after you submit this function, but you see an hourglass for the mouse pointer R is waiting for you to place the legend somewhere in the plot, by pointing in the graph window and clicking where you want the upper left-hand corner of the legend to be Go ahead and do that to place the legend This is an example of you being able to interact with the graphing window When you have placed the legend, you will be able to continue submitting commands
Let’s look at the legend function in a little more detail The order of the labels in the character vector corresponds to the vector of line types declared
in the lty option You can also specify a vector of colors that matches with the labels in the col option For now, we want everything in red
Now that you have a legend, we want to identify unusual data points, or points that might be significantly affecting the fits of these lines To do this,
we can use the interactive identify() function:
> identify(bank2$SALBEGIN,bank2$SALARY,bank2$ID)
Again, note how R is waiting for you to click in the graphing window The first two variables in this function identify the scatterplot, and the third variable contains the labels or names for the data points in the scatterplot R
is waiting for you to click on the data points, and when you do so, R will label the data points that you have clicked on according to the third variable
Try clicking on the point corresponding to a beginning salary of ~$30,000 and a current salary of over $100,000 This case definitely does not follow the pattern of the rest of the data, and may have a significant influence on the fit of the lines When you have clicked on this point (which should be
ID #18), click on the STOP sign near the R menu bar to turn off the
interactive function You’ll get a report in the R console of all the points thatyou clicked on! Perhaps we need to look at this case in a little bit more detail
Suppose we want to compare what the predicted values would be on the two fitted curves for the same value of beginning salary in an interactive manner.Another handy interactive tool that R provides users with is the locator()
function After typing this function at the command prompt, you can click multiple times anywhere on a plot, and R will return the coordinates of those
Trang 19points where you clicked after you either click on the STOP sign or click on
a pre-specified number of points So try typing the following:
> abline(v=40000, lty=3) #draw a dotted vertical line
> locator(2) #invoke the locator() function, for 2 points
We want to compare predicted values of current salary when beginning salary is equal to $40,000 Once you’ve submitted the locator function, clickwhere the vertical line and the fitted lines intersect, and then click on the STOP sign You should see something like the following in the console:
For a final example of interactivity, we’ll add some text and labels to our plot We want to add some text that will remind us to “Investigate This Case,” referring to case 18 To add text to a plot, you can use the text()
function The first two arguments define the x and y coordinates where you want the text to begin, and the third argument is the text that you want An optional argument to the text() function is the cex option, which tells R either to decrease (< 1) or increase (> 1) the text size Here is a command that will write this text on our plot:
> text(15000,120000,"Investigate This Case",cex=0.7)
There is also a function in R that will let you draw arrows on a plot, which is
a nice feature for graphs that you are planning to publish If we want to draw an arrow from the text that we just added to the plot to the data point inquestion, you can use the arrows() function, where the first two arguments are the x and y coordinates where the arrow will begin, and the last two arguments are the x and y coordinates where the arrow will end Note how Iused locator first to determine the exact points for these coordinates!
Trang 21This is just the beginning when it comes to graphing in R! But these
examples should at least give you an idea of the basic steps in creating
publication-quality graphs and figures in R I urge you to consult the
references and help manuals for further examples and additional graphical functions that you can work with in R
To export any of the graphs that you create in R, click on the window that the graph is in (the active graphics window), select Save As… from the File menu, and choose the format that you want to save the graph in
Trang 226 Descriptive Statistical Analyses
An important first step in any thorough analysis of a set of data is descriptivestatistical analysis, and this still holds true in R Once you have successfullyprepared a data set object for analysis, you should summarize the data and see exactly what it is that you will be working with These simple initial analyses are often quite useful in finding outliers, faulty values, and unusual distributions Being careful and eliminating errors in a data set is essential, and often takes much more time than the formal statistical analysis itself
A good first step in summarizing a data set object is to use the dim()
function to check the dimensions of the data set This will tell you whether
or not you will be working with as many records and variables (rows and columns) as you think you should be! We’ll start a simple descriptive
analysis of the bank2 data set by reading in the (corrected) raw data and checking the dimensions:
Obtaining Univariate Statistics
The next step should be to obtain simple univariate numerical summaries of the variables in their original format This can be accomplished in R using the summary() function on a data set object:
> summary(bank2)
ID GENDER BDATE EDUC JOBCAT Min : 1.0 f:216 10/20/1959: 2 Min : 8.00 Min :1.000 1st Qu.:119.3 m:258 11/10/1965: 2 1st Qu.:12.00 1st Qu.:1.000 Median :237.5 2/12/1964 : 2 Median :12.00 Median :1.000 Mean :237.5 2/4/1934 : 2 Mean :13.49 Mean :1.411 3rd Qu.:355.8 2/8/1962 : 2 3rd Qu.:15.00 3rd Qu.:1.000 Max :474.0 (Other) :463 Max :21.00 Max :3.000 NA's : 1
SALARY SALBEGIN JOBTIME PREVEXP
Min : 15750 Min : 9000 Min :63.00 Min : 0.00
1st Qu.: 24000 1st Qu.:12488 1st Qu.:72.00 1st Qu.: 19.25
Median : 28875 Median :15000 Median :81.00 Median : 55.00