Data Visualization Packages and settings – R and Python Visualization techniques for categorical data Bar chart Going through the built-in examples of R Time for action – bar charts in R
Trang 2Statistical Application Development with R and Python - Second Edition
Trang 3What this book covers
What you need for this book
Who this book is for
Questionnaire and its components
Understanding the data characteristics in an R environment
Experiments with uncertainty in computer science
Installing and setting up R
Using R packages
RSADBE – the books R package
Python installation and setup
Using pip for packages
IDEs for R and Python
The companion code bundle
Discrete distributions
Trang 4Discrete uniform distribution
Packages and settings – R and Python
Understanding data.frame and other formats
Constants, vectors, and matrices
Time for action – understanding constants, vectors, and basicarithmetic
What just happened?
Doing it in Python
Time for action – matrix computations
What just happened?
Doing it in Python
The list object
Time for action – creating a list object
What just happened?
The data.frame object
Time for action – creating a data.frame object
What just happened?
Have a go hero
The table object
Time for action – creating the Titanic dataset as a table objectWhat just happened?
Have a go hero
Using utils and the foreign packages
Time for action – importing data from external files
What just happened?
Doing it in Python
Trang 5Importing data from MySQL
Doing it in Python
Exporting data/graphs
Exporting R objects
Exporting graphs
Time for action – exporting a graph
What just happened?
Managing R sessions
Time for action – session management
What just happened?
Doing it in Python
Pop quiz
Summary
3 Data Visualization
Packages and settings – R and Python
Visualization techniques for categorical data
Bar chart
Going through the built-in examples of R
Time for action – bar charts in R
What just happened?
Doing it in Python
Have a go hero
Dot chart
Time for action – dot charts in R
What just happened?
Doing it in Python
Spine and mosaic plots
Time for action – spine plot for the shift and operator dataWhat just happened?
Time for action – mosaic plot for the Titanic dataset
What just happened?
Pie chart and the fourfold plot
Visualization techniques for continuous variable data
Boxplot
Time for action – using the boxplot
What just happened?
Trang 6Time for action – plot and pairs R functions
What just happened?
Doing it in Python
Have a go hero
Pareto chart
A brief peek at ggplot2
Time for action – qplot
What just happened?
Time for action – ggplot
What just happened?
Pop quiz
Summary
4 Exploratory Analysis
Packages and settings – R and Python
Essential summary statistics
Percentiles, quantiles, and median
Hinges
Interquartile range
Time for action – the essential summary statistics for The Walldataset
What just happened?
Techniques for exploratory analysis
The stem-and-leaf plot
Time for action – the stem function in play
What just happened?
Trang 7Time for action – the bagplot display for multivariate datasetsWhat just happened?
Resistant line
Time for action – resistant line as a first regression model
What just happened?
Smoothing data
Time for action – smoothening the cow temperature data
What just happened?
Median polish
Time for action – the median polish algorithm
What just happened?
Have a go hero
Summary
5 Statistical Inference
Packages and settings – R and Python
Maximum likelihood estimator
Visualizing the likelihood function
Time for action – visualizing the likelihood function
What just happened?
Doing it in Python
Finding the maximum likelihood estimator
Using the fitdistr function
Time for action – finding the MLE using mle and fitdistr functionsWhat just happened?
Confidence intervals
Time for action – confidence intervals
What just happened?
Doing it in Python
Hypothesis testing
Binomial test
Time for action – testing probability of success
What just happened?
Tests of proportions and the chi-square test
Time for action – testing proportions
What just happened?
Tests based on normal distribution – one sample
Trang 8Time for action – testing one-sample hypotheses
What just happened?
Have a go hero
Tests based on normal distribution – two sample
Time for action – testing two-sample hypotheses
What just happened?
Have a go hero
Doing it in Python
Summary
6 Linear Regression Analysis
Packages and settings - R and Python
The essence of regression
The simple linear regression model
What happens to the arbitrary choice of parameters?
Time for action - the arbitrary choice of parameters
What just happened?
Building a simple linear regression model
Time for action - building a simple linear regression modelWhat just happened?
Have a go hero
ANOVA and the confidence intervals
Time for action - ANOVA and the confidence intervals
What just happened?
Model validation
Time for action - residual plots for model validation
What just happened?
Doing it in Python
Have a go hero
Multiple linear regression model
Averaging k simple linear regression models or a multiple linearregression model
Time for action - averaging k simple linear regression modelsWhat just happened?
Building a multiple linear regression model
Time for action - building a multiple linear regression modelWhat just happened?
Trang 9The ANOVA and confidence intervals for the multiple linear regressionmodel
Time for action - the ANOVA and confidence intervals for the
multiple linear regression model
What just happened?
Have a go hero
Useful residual plots
Time for action - residual plots for the multiple linear regressionmodel
What just happened?
Regression diagnostics
Leverage points
Influential points
DFFITS and DFBETAS
The multicollinearity problem
Time for action - addressing the multicollinearity problem for thegasoline data
What just happened?
Doing it in Python
Model selection
Stepwise procedures
The backward elimination
The forward selection
The stepwise regression
7 Logistic Regression Model
Packages and settings – R and Python
The binary regression problem
Time for action – limitation of linear regression model
What just happened?
Probit regression model
Trang 10Time for action – understanding the constants
What just happened?
Doing it in Python
Logistic regression model
Time for action – fitting the logistic regression model
What just happened?
Doing it in Python
Hosmer-Lemeshow goodness-of-fit test statistic
Time for action – Hosmer-Lemeshow goodness-of-fit statisticWhat just happened?
Model validation and diagnostics
Residual plots for the GLM
Time for action – residual plots for logistic regression modelWhat just happened?
Doing it in Python
Have a go hero
Influence and leverage for the GLM
Time for action – diagnostics for the logistic regression
What just happened?
Have a go hero
Receiving operator curves
Time for action – ROC construction
What just happened?
Doing it in Python
Logistic regression for the German credit screening dataset
Time for action – logistic regression for the German credit datasetWhat just happened?
Doing it in Python
Have a go hero
Summary
8 Regression Models with Regularization
Packages and settings – R and Python
The overfitting problem
Time for action – understanding overfitting
What just happened?
Doing it in Python
Trang 11Have a go hero
Regression spline
Basis functions
Piecewise linear regression model
Time for action – fitting piecewise linear regression models
What just happened?
Natural cubic splines and the general B-splines
Time for action – fitting the spline regression models
What just happened?
Ridge regression for linear models
Protecting against overfitting
Time for action – ridge regression for the linear regression modelWhat just happened?
Doing it in Python
Ridge regression for logistic regression models
Time for action – ridge regression for the logistic regression modelWhat just happened?
Another look at model assessment
Time for action – selecting iteratively and other topics
What just happened?
Pop quiz
Summary
9 Classification and Regression Trees
Packages and settings – R and Python
Understanding recursive partitions
Time for action – partitioning the display plot
What just happened?
Splitting the data
The first tree
Time for action – building our first tree
What just happened?
Constructing a regression tree
Time for action – the construction of a regression tree
What just happened?
Constructing a classification tree
Time for action – the construction of a classification tree
Trang 12What just happened?
Doing it in Python
Classification tree for the German credit data
Time for action – the construction of a classification treeWhat just happened?
Doing it in Python
Have a go hero
Pruning and other finer aspects of a tree
Time for action – pruning a classification tree
What just happened?
Pop quiz
Summary
10 CART and Beyond
Packages and settings – R and Python
Improving the CART
Time for action – cross-validation predictions
What just happened?
Understanding bagging
The bootstrap
Time for action – understanding the bootstrap technique
What just happened?
How the bagging algorithm works
Time for action – the bagging algorithm
What just happened?
Trang 13Statistical Application Development with R and Python - Second Edition
Trang 14Statistical Application Development with R and Python - Second Edition
Copyright © 2017 Packt Publishing
All rights reserved No part of this book may be reproduced, stored in a
retrieval system, or transmitted in any form or by any means, without theprior written permission of the publisher, except in the case of brief
quotations embedded in critical articles or reviews
Every effort has been made in the preparation of this book to ensure the
accuracy of the information presented However, the information contained inthis book is sold without warranty, either express or implied Neither theauthor, nor Packt Publishing, and its dealers and distributors will be heldliable for any damages caused or alleged to be caused directly or indirectly bythis book
Packt Publishing has endeavored to provide trademark information about all
of the companies and products mentioned in this book by the appropriate use
of capitals However, Packt Publishing cannot guarantee the accuracy of thisinformation
First published: July 2013
Second edition: August 2017
Trang 15ISBN 978-1-78862-119-9
www.packtpub.com
Trang 18About the Author
Prabhanjan Narayanachar Tattar has a combined twelve years of
experience with R and Python software He has also authored the books A Course in Statistics with R, Wiley, and Practical Data Science Cookbook,
Packt The author has built three packages in R titled gpk, RSADBE, andACSWR He has obtained a PhD (statistics) from Bangalore University underthe broad area of survival snalysis and published several articles in peer-reviewed journals During the PhD program, the author received the youngStatistician honors for the IBS(IR)-GK Shukla Young Biometrician Award(2005) and the Dr U.S Nair Award for Young Statistician (2007) and alsoheld a Junior and Senior Research Fellowship at CSIR-UGC
Prabhanjan has worked in various positions in the analytical industry andnearly 10 years of experience in using statistical and machine learning
techniques
Trang 19I would like to thank the readers and reviewers of the first edition and it istheir constructive criticism that a second edition has been possible The R andPython open source community deservers a huge applause for making thesoftware so complete that it is almost akin to rubbing a magical lamp
I continue to express my gratitude to all the people mentioned in the previousedition My family has been at the forefront as always in extending theircooperation and whenever I am working on a book, they understand the
weekends would have to be spent on the idiot box
Profs D D Pawar and V A Jadhav were my first two Statistics teachers and
I learnt my first craft from them during 1996-99 at Department of Statistics,Science College, Nanded Prof Pawar had been very kind and generous
towards me and invited in March 2015 to deliver some R talks from the firstedition Even 20 years later they are the flag-bearers of the subject in theMarathawada region and it is with profound love and affection that I express
my gratitude to both of them Thank you a lot, sirs
It was a mere formal dinner meeting with Tushar Gupta in Chennai a monthago and we thought of getting the second edition We both were convincedthat if we work in sync, do parallel publication processing, we would finishthis task within a month And it has been a roller-coaster ride with MenkaBohra, Snehal Kolte, and Dharmendra Yadav that the book is a finished
product in a record time My special thanks to this wonderful Packt team
Trang 20About the Reviewers
Dr Ratnadip Adhikari received his B.Sc degree with Mathematics Honors
from Assam University, India, in 2004 and M.Sc in applied mathematicsfrom Indian Institute of Technology, Roorkee, in 2006 After that he obtainedM.Tech in Computer Science and Technology and Ph.D in Computer
Science, both from Jawaharlal Nehru University, New Delhi, India, in 2009and 2014, respectively
He worked as an Assistant Professor in the Computer Science &
Engineering (CSE) Dept of the LNM Institute of Information
Technology (LNMIIT), Jaipur, Rajasthan, India At present, he works as a
Senior Data Scientist at Fractal Analytics, Bangalore, India His primaryresearch interests include Pattern recognition, time series forecasting, datastream classification, and hybrid modeling The research works of Dr
Adhikari has been published in various reputed international journals and atconferences He has attended a number of conferences and workshops
throughout his academic career
Ajay Ohri is the founder of Decisionstats.com and has 14 years work
experience as a data scientist He advises multiple startups in analytics shoring, analytics services, and analytics education, as well as using socialmedia to enhance buzz for analytics products Mr Ohri's research interestsinclude spreading open source analytics, analyzing social media manipulationwith mechanism design, simpler interfaces for cloud computing, investigatingclimate change and knowledge flows
off-He founded Decisionstats.com in 2007 a blog which has gathered more than100,000 views annually since past 7 years
His other books include R for Business Analytics (Springer 2012) and R for Cloud Computing (Springer 2014), and Python for R Users (Wiley 2017)
Abhinav Rai has been working as a Data Scientist for nearly a decade,
currently working at Microsoft He has experience working in telecom, retailmarketing, and online advertisement His areas of interest include the
Trang 21evolving techniques of machine learning and the associated technologies He
is especially more interested in analyzing large and humongous datasets andlikes to generate deep insights in such scenarios Academically, he holds adouble master's degree in Mathematics from Deendayal Upadhyay
Gorakhpur University with an NBHM scholarship and in Computer Sciencefrom Indian Statistical Institute, rigor and sophistication is a surety with hisanalytical deliveries
Trang 22www.PacktPub.com
Trang 23eBooks, discount offers, and more
Did you know that Packt offers eBook versions of every book published, withPDF and ePub files available? You can upgrade to the eBook version at
www.PacktPub.com and as a print book customer, you are entitled to a
discount on the eBook copy Get in touch with us at
< customercare@packtpub.com > for more details
At www.PacktPub.com, you can also read a collection of free technical
articles, sign up for a range of free newsletters and receive exclusive
discounts and offers on Packt books and eBooks
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt Mapt gives you full
access to all Packt books and video courses, as well as industry-leading tools
to help you plan your personal development and advance your career
Trang 25Customer Feedback
Thanks for purchasing this Packt book At Packt, quality is at the heart of oureditorial process To help us improve, please leave us an honest review onthis book's Amazon page at https://www.amazon.com/dp/1788621190
If you'd like to join our team of regular reviewers, you can e-mail us at
customerreviews@packtpub.com We award our regular reviewers with freeeBooks and videos in exchange for their valuable feedback Help us be
relentless in improving our products!
Trang 26R and Python are interchangeably required languages these days for anybodyengaged with data analysis The growth of these two languages and theirinter-dependency creates a natural requirement to learn them both Thus, it
was natural where the second edition of my previous title R Statistical
Application Development by Example was headed I thus took this
opportunity to add Python as an important layer and hence you would find
Doing it in Python spread across and throughout the book Now, the book is
useful on many fronts, those who need to learn both the languages, uses Rand needs to switch to Python, and vice versa While abstract development ofideas and algorithms have been retained in R only, standard and more
commonly required data analysis technique are available in both the
languages now The only reason for not providing the Python parallel is toavoid the book from becoming too bulky
The open source language R is fast becoming one of the preferred
companions for statistics, even as the subject continues to add many friends
in machine learning, data mining, and so on among its already rich scientificnetwork The era of mathematical theory and statistical application
embeddedness is truly a remarkable one for society and R and Python hasplayed a very pivotal role in it This book is a humble attempt at presentingstatistical models through R for any reader who has a bit of familiarity withthe subject In my experience of practicing the subject with colleagues andfriends from different backgrounds, I realized that many are interested inlearning the subject and applying it in their domain which enables them totake appropriate decisions in analyses, which involves uncertainty A decadeearlier my friends would have been content with being pointed to a usefulreference book Not so anymore! The work in almost every domain is donethrough computers and naturally they do have their data available in
spreadsheets, databases, and sometimes in plain text format The request for
an appropriate statistical model is invariantly followed by a one word
question software? My answer to them has always been a single letter replyR! Why? It is really a very simple decision and it has been my companionover the last seven years In this book, this experience has been converted
Trang 27into detailed chapters and a cleaner breakup of model building in R.
A by-product of my interactions with colleagues and friends who are all
aspiring statistical model builders has been that I have been able to pick upthe trough of their learning curve of the subject The first attempt towardsfixing the hurdle has been to introduce the fundamental concepts that thebeginners are most familiar with, which is data The difference is simply inthe subtleties and as such I firmly believe that introducing the subject on theirturf motivates the reader for a long way in their journey As with most
statistical software, R provides modules and packages which mostly covermany of the recently invented statistical methodologies The first five
chapters of the book focus on the fundamental aspects of the subject and the
R language and therefore hence cover R basics, data visualization,
exploratory data analysis, and statistical inference
The foundational aspects are illustrated using interesting examples and sets
up the framework for the next five chapters Linear and logistic regressionmodels being at the forefront, are of paramount importance in applications.The discussion is more generic in nature and the techniques can be easilyadapted across different domains The last two chapters have been inspired
by the Breiman school and hence the modern method of using classificationand regression trees has been developed in detail and illustrated through apractical dataset
Trang 28What this book covers
Chapter 1, Data Characteristics, introduces the different types of data
through a questionnaire and dataset The need of statistical models is
elaborated in some interesting contexts This is followed by a brief
explanation of the installation of R and Python and their related packages.Discrete and continuous random variables are discussed through introductoryprograms The programs are available in both the languages and althoughthey do not need to be followed, they are more expository in nature
Chapter 2, Import/Export Data, begins with a concise development of R
basics Data frames, vectors, matrices, and lists are discussed with clear andsimpler examples Importing of data from external files in CSV, XLS, andother formats is elaborated next Writing data/objects from R for other
languages is considered and the chapter concludes with a dialogue on R
session management Python basics, mathematical operations, and other
essential operations are explained Reading data from different format ofexternal file is also illustrated along with the session management required
Chapter 3, Data Visualization, discusses efficient graphics separately for
categorical and numeric datasets This translates into techniques for bar chart,dot chart, spine and mosaic plot, and four fold plot for categorical data whilehistogram, box plot, and scatter plot for continuous/numeric data A verybrief introduction to ggplot2 is also provided here Generating similar plotsusing both R and Python will be a treatise here
Chapter 4, Exploratory Analysis, encompasses highly intuitive techniques for
the preliminary analysis of data The visualizing techniques of EDA such asstem-and-leaf, letter values, and the modeling techniques of resistant line,smoothing data, and median polish provide rich insight as a preliminary
analysis step This chapter is driven mainly in R only
Chapter 5, Statistical Inference, begins with an emphasis on the likelihood
function and computing the maximum likelihood estimate Confidence
intervals for parameters of interest is developed using functions defined for
Trang 29specific problems The chapter also considers important statistical tests of test and t-test for comparison of means and chi-square tests and f-test forcomparison of variances The reader will learn how to create new R and
z-Python functions
Chapter 6, Linear Regression Analysis, builds a linear relationship between
an output and a set of explanatory variables The linear regression model hasmany underlying assumptions and such details are verified using validationtechniques A model may be affected by a single observation, or a singleoutput value, or an explanatory variable Statistical metrics are discussed indepth which helps remove one or more types of anomalies Given a largenumber of covariates, the efficient model is developed using model selectiontechniques While the stats core R package suffices, statsmodels package inPython is very useful
Chapter 7, The Logistic Regression Model, is useful as a classification model
when the output is a binary variable Diagnostic and model validation
through residuals are used which lead to an improved model ROC curves arenext discussed which helps in identifying of a better classification model The
R packages pscl and ROCR are useful while pysal and sklearn are useful inPython
Chapter 8, Regression Models with Regularization, discusses the problem of
over fitting, which arises from the use of models developed in the previoustwo chapters Ridge regression significantly reduces the probability of anover fit model and the development of natural spine models also lays thebasis for the models considered in the next chapter Regularization in R isachieved using packages ridge and MASS while sklearn and statsmodels help
in Python
Chapter 9, Classification and Regression Trees, provides a tree-based
regression model The trees are initially built using raw R functions and thefinal trees are also reproduced using rudimentary codes leading to a clearunderstanding of the CART mechanism The pruning procedure is illustratedthrough one of the languages and the reader should explore to find the fix inanother
Trang 30Chapter 10, CART and Beyond, considers two enhancements to CART, using
bagging and random forests A consolidation of all the models from Chapter
6, Linear Regression Analysis, to Chapter 10, CART and Beyond, is also
provided through a dataset The ensemble methods is fast emerging as veryeffective and popular machine learning technique and doing it in both thelanguages will improve users confidence
Trang 31What you need for this book
You will need the following to work with the examples in this book:R
Python
RStudio
Trang 32Who this book is for
If you want to have a brief understanding of the nature of data and performadvanced statistical analysis using both R and Python, then this book is whatyou need No prior knowledge is required Aspiring data scientist, R userstrying to learn Python and Python users trying to learn R
Trang 33In this book, you will find a number of text styles that distinguish betweendifferent kinds of information Here are some examples of these styles and anexplanation of their meaning
Code words in text, database table names, folder names, filenames, file
extensions, pathnames, dummy URLs, user input, and Twitter handles areshown as follows: “We can include other contexts through the use of the
Any command-line input or output is written as follows:
sudo apt-get update
sudo apt-get install python3.6
New terms and important words are shown in bold Words that you see on
the screen, for example, in menus or dialog boxes, appear in the text like this:
“Clicking the Next button moves you to the next screen.”
Trang 34Reader feedback
Feedback from our readers is always welcome Let us know what you thinkabout this book—what you liked or disliked Reader feedback is importantfor us as it helps us develop titles that you will really get the most out of
To send us general feedback, simply e-mail < feedback@packtpub.com >, andmention the book’s title in the subject of your message
If there is a topic that you have expertise in and you are interested in eitherwriting or contributing to a book, see our author guide at
www.packtpub.com/authors
Trang 35Customer support
Now that you are the proud owner of a Packt book, we have a number ofthings to help you to get the most from your purchase
Trang 36Downloading the example code
You can download the example code files for this book from your account at
http://www.packtpub.com If you purchased this book elsewhere, you canvisit http://www.packtpub.com/support and register to have the files e-maileddirectly to you
You can download the code files by following these steps:
1 Log in or register to our website using your e-mail address and
password
2 Hover the mouse pointer on the SUPPORT tab at the top.
3 Click on Code Downloads & Errata.
4 Enter the name of the book in the Search box.
5 Select the book for which you’re looking to download the code files
6 Choose from the drop-down menu where you purchased this book from
7 Click on Code Download.
You can also download the code files by clicking on the Code Files button
on the book’s webpage at the Packt Publishing website This page can be
accessed by entering the book’s name in the Search box Please note that you
need to be logged in to your Packt account
Once the file is downloaded, please make sure that you unzip or extract thefolder using the latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at
https://github.com/PacktPublishing/ with-R-and-Python-Second-Edition We also have other code bundles fromour rich catalog of books and videos available at
Statistical-Application-Development-https://github.com/PacktPublishing/ Check them out!
Trang 37Although we have taken every care to ensure the accuracy of our content,mistakes do happen If you find a mistake in one of our books—maybe amistake in the text or the code—we would be grateful if you could report this
to us By doing so, you can save other readers from frustration and help usimprove subsequent versions of this book If you find any errata, please
report them by visiting http://www.packtpub.com/submit-errata, selecting
your book, clicking on the Errata Submission Form link, and entering the
details of your errata Once your errata are verified, your submission will beaccepted and the errata will be uploaded to our website or added to any list ofexisting errata under the Errata section of that title
To view the previously submitted errata, go to
https://www.packtpub.com/books/content/support and enter the name of thebook in the search field The required information will appear under the
Errata section.
Trang 38Piracy of copyrighted material on the Internet is an ongoing problem acrossall media At Packt, we take the protection of our copyright and licenses veryseriously If you come across any illegal copies of our works in any form onthe Internet, please provide us with the location address or website nameimmediately so that we can pursue a remedy
Please contact us at < copyright@packtpub.com > with a link to the suspectedpirated material
We appreciate your help in protecting our authors and our ability to bring youvaluable content
Trang 39If you have a problem with any aspect of this book, you can contact us at
< questions@packtpub.com >, and we will do our best to address the problem
Trang 40Chapter 1 Data Characteristics
Data consists of observations across different types of variables, and it is vitalthat any data analyst understands these intricacies at the earliest stage of
exposure to statistical analysis This chapter recognizes the importance ofdata and begins with a template of a dummy questionnaire and then proceedswith the nitty-gritties of the subject We will then explain how uncertaintycreeps in to the domain of computer science The chapter closes with
coverage of important families of discrete and continuous random variables
We will cover the following topics:
Identification of the main variable types as nominal, categorical, andcontinuous variables
The uncertainty arising in many real experiments
R installation and packages
The mathematical form of discrete and continuous random variables andtheir applications