Statistical application development with r and python second edition

Data Visualization Packages and settings – R and Python Visualization techniques for categorical data Bar chart Going through the built-in examples of R Time for action – bar charts in R

Trang 2

Statistical Application Development with R and Python - Second Edition

Trang 3

What this book covers

What you need for this book

Who this book is for

Questionnaire and its components

Understanding the data characteristics in an R environment

Experiments with uncertainty in computer science

Installing and setting up R

Using R packages

RSADBE – the books R package

Python installation and setup

Using pip for packages

IDEs for R and Python

The companion code bundle

Discrete distributions

Trang 4

Discrete uniform distribution

Packages and settings – R and Python

Understanding data.frame and other formats

Constants, vectors, and matrices

Time for action – understanding constants, vectors, and basicarithmetic

What just happened?

Doing it in Python

Time for action – matrix computations

What just happened?

Doing it in Python

The list object

Time for action – creating a list object

What just happened?

The data.frame object

Time for action – creating a data.frame object

What just happened?

Have a go hero

The table object

Time for action – creating the Titanic dataset as a table objectWhat just happened?

Have a go hero

Using utils and the foreign packages

Time for action – importing data from external files

What just happened?

Doing it in Python

Trang 5

Importing data from MySQL

Doing it in Python

Exporting data/graphs

Exporting R objects

Exporting graphs

Time for action – exporting a graph

What just happened?

Managing R sessions

Time for action – session management

What just happened?

Doing it in Python

Pop quiz

Summary

3 Data Visualization

Visualization techniques for categorical data

Bar chart

Going through the built-in examples of R

Time for action – bar charts in R

What just happened?

Doing it in Python

Have a go hero

Dot chart

Time for action – dot charts in R

What just happened?

Doing it in Python

Spine and mosaic plots

Time for action – spine plot for the shift and operator dataWhat just happened?

Time for action – mosaic plot for the Titanic dataset

What just happened?

Pie chart and the fourfold plot

Visualization techniques for continuous variable data

Boxplot

Time for action – using the boxplot

What just happened?

Trang 6

Time for action – plot and pairs R functions

What just happened?

Doing it in Python

Have a go hero

Pareto chart

A brief peek at ggplot2

Time for action – qplot

What just happened?

Time for action – ggplot

What just happened?

Pop quiz

Summary

4 Exploratory Analysis

Essential summary statistics

Percentiles, quantiles, and median

Hinges

Interquartile range

Time for action – the essential summary statistics for The Walldataset

What just happened?

Techniques for exploratory analysis

The stem-and-leaf plot

Time for action – the stem function in play

What just happened?

Trang 7

Time for action – the bagplot display for multivariate datasetsWhat just happened?

Resistant line

Time for action – resistant line as a first regression model

What just happened?

Smoothing data

Time for action – smoothening the cow temperature data

What just happened?

Median polish

Time for action – the median polish algorithm

What just happened?

Have a go hero

Summary

5 Statistical Inference

Maximum likelihood estimator

Visualizing the likelihood function

Time for action – visualizing the likelihood function

What just happened?

Doing it in Python

Finding the maximum likelihood estimator

Using the fitdistr function

Time for action – finding the MLE using mle and fitdistr functionsWhat just happened?

Confidence intervals

Time for action – confidence intervals

What just happened?

Doing it in Python

Hypothesis testing

Binomial test

Time for action – testing probability of success

What just happened?

Tests of proportions and the chi-square test

Time for action – testing proportions

What just happened?

Tests based on normal distribution – one sample

Trang 8

Time for action – testing one-sample hypotheses

What just happened?

Have a go hero

Tests based on normal distribution – two sample

Time for action – testing two-sample hypotheses

What just happened?

Have a go hero

Doing it in Python

Summary

6 Linear Regression Analysis

Packages and settings - R and Python

The essence of regression

The simple linear regression model

What happens to the arbitrary choice of parameters?

Time for action - the arbitrary choice of parameters

What just happened?

Building a simple linear regression model

Time for action - building a simple linear regression modelWhat just happened?

Have a go hero

ANOVA and the confidence intervals

Time for action - ANOVA and the confidence intervals

What just happened?

Model validation

Time for action - residual plots for model validation

What just happened?

Doing it in Python

Have a go hero

Multiple linear regression model

Averaging k simple linear regression models or a multiple linearregression model

Time for action - averaging k simple linear regression modelsWhat just happened?

Building a multiple linear regression model

Time for action - building a multiple linear regression modelWhat just happened?

Trang 9

The ANOVA and confidence intervals for the multiple linear regressionmodel

Time for action - the ANOVA and confidence intervals for the

multiple linear regression model

What just happened?

Have a go hero

Useful residual plots

Time for action - residual plots for the multiple linear regressionmodel

What just happened?

Regression diagnostics

Leverage points

Influential points

DFFITS and DFBETAS

The multicollinearity problem

Time for action - addressing the multicollinearity problem for thegasoline data

What just happened?

Doing it in Python

Model selection

Stepwise procedures

The backward elimination

The forward selection

The stepwise regression

7 Logistic Regression Model

The binary regression problem

Time for action – limitation of linear regression model

What just happened?

Probit regression model

Trang 10

Time for action – understanding the constants

What just happened?

Doing it in Python

Logistic regression model

Time for action – fitting the logistic regression model

What just happened?

Doing it in Python

Hosmer-Lemeshow goodness-of-fit test statistic

Time for action – Hosmer-Lemeshow goodness-of-fit statisticWhat just happened?

Model validation and diagnostics

Residual plots for the GLM

Time for action – residual plots for logistic regression modelWhat just happened?

Doing it in Python

Have a go hero

Influence and leverage for the GLM

Time for action – diagnostics for the logistic regression

What just happened?

Have a go hero

Receiving operator curves

Time for action – ROC construction

What just happened?

Doing it in Python

Logistic regression for the German credit screening dataset

Time for action – logistic regression for the German credit datasetWhat just happened?

Doing it in Python

Have a go hero

Summary

8 Regression Models with Regularization

The overfitting problem

Time for action – understanding overfitting

What just happened?

Doing it in Python

Trang 11

Have a go hero

Regression spline

Basis functions

Piecewise linear regression model

Time for action – fitting piecewise linear regression models

What just happened?

Natural cubic splines and the general B-splines

Time for action – fitting the spline regression models

What just happened?

Ridge regression for linear models

Protecting against overfitting

Time for action – ridge regression for the linear regression modelWhat just happened?

Doing it in Python

Ridge regression for logistic regression models

Time for action – ridge regression for the logistic regression modelWhat just happened?

Another look at model assessment

Time for action – selecting iteratively and other topics

What just happened?

Pop quiz

Summary

9 Classification and Regression Trees

Understanding recursive partitions

Time for action – partitioning the display plot

What just happened?

Splitting the data

The first tree

Time for action – building our first tree

What just happened?

Constructing a regression tree

Time for action – the construction of a regression tree

What just happened?

Constructing a classification tree

Time for action – the construction of a classification tree

Trang 12

What just happened?

Doing it in Python

Classification tree for the German credit data

Time for action – the construction of a classification treeWhat just happened?

Doing it in Python

Have a go hero

Pruning and other finer aspects of a tree

Time for action – pruning a classification tree

What just happened?

Pop quiz

Summary

10 CART and Beyond

Improving the CART

Time for action – cross-validation predictions

What just happened?

Understanding bagging

The bootstrap

Time for action – understanding the bootstrap technique

What just happened?

How the bagging algorithm works

Time for action – the bagging algorithm

What just happened?

Trang 13

Trang 14

retrieval system, or transmitted in any form or by any means, without theprior written permission of the publisher, except in the case of brief

quotations embedded in critical articles or reviews

Every effort has been made in the preparation of this book to ensure the

accuracy of the information presented However, the information contained inthis book is sold without warranty, either express or implied Neither theauthor, nor Packt Publishing, and its dealers and distributors will be heldliable for any damages caused or alleged to be caused directly or indirectly bythis book

Packt Publishing has endeavored to provide trademark information about all

of the companies and products mentioned in this book by the appropriate use

of capitals However, Packt Publishing cannot guarantee the accuracy of thisinformation

First published: July 2013

Second edition: August 2017

Trang 15

ISBN 978-1-78862-119-9

www.packtpub.com

Trang 18

About the Author

Prabhanjan Narayanachar Tattar has a combined twelve years of

experience with R and Python software He has also authored the books A Course in Statistics with R, Wiley, and Practical Data Science Cookbook,

Packt The author has built three packages in R titled gpk, RSADBE, andACSWR He has obtained a PhD (statistics) from Bangalore University underthe broad area of survival snalysis and published several articles in peer-reviewed journals During the PhD program, the author received the youngStatistician honors for the IBS(IR)-GK Shukla Young Biometrician Award(2005) and the Dr U.S Nair Award for Young Statistician (2007) and alsoheld a Junior and Senior Research Fellowship at CSIR-UGC

Prabhanjan has worked in various positions in the analytical industry andnearly 10 years of experience in using statistical and machine learning

techniques

Trang 19

I would like to thank the readers and reviewers of the first edition and it istheir constructive criticism that a second edition has been possible The R andPython open source community deservers a huge applause for making thesoftware so complete that it is almost akin to rubbing a magical lamp

I continue to express my gratitude to all the people mentioned in the previousedition My family has been at the forefront as always in extending theircooperation and whenever I am working on a book, they understand the

weekends would have to be spent on the idiot box

Profs D D Pawar and V A Jadhav were my first two Statistics teachers and

I learnt my first craft from them during 1996-99 at Department of Statistics,Science College, Nanded Prof Pawar had been very kind and generous

towards me and invited in March 2015 to deliver some R talks from the firstedition Even 20 years later they are the flag-bearers of the subject in theMarathawada region and it is with profound love and affection that I express

my gratitude to both of them Thank you a lot, sirs

It was a mere formal dinner meeting with Tushar Gupta in Chennai a monthago and we thought of getting the second edition We both were convincedthat if we work in sync, do parallel publication processing, we would finishthis task within a month And it has been a roller-coaster ride with MenkaBohra, Snehal Kolte, and Dharmendra Yadav that the book is a finished

product in a record time My special thanks to this wonderful Packt team

Trang 20

About the Reviewers

Dr Ratnadip Adhikari received his B.Sc degree with Mathematics Honors

from Assam University, India, in 2004 and M.Sc in applied mathematicsfrom Indian Institute of Technology, Roorkee, in 2006 After that he obtainedM.Tech in Computer Science and Technology and Ph.D in Computer

Science, both from Jawaharlal Nehru University, New Delhi, India, in 2009and 2014, respectively

He worked as an Assistant Professor in the Computer Science &

Engineering (CSE) Dept of the LNM Institute of Information

Technology (LNMIIT), Jaipur, Rajasthan, India At present, he works as a

Senior Data Scientist at Fractal Analytics, Bangalore, India His primaryresearch interests include Pattern recognition, time series forecasting, datastream classification, and hybrid modeling The research works of Dr

Adhikari has been published in various reputed international journals and atconferences He has attended a number of conferences and workshops

throughout his academic career

Ajay Ohri is the founder of Decisionstats.com and has 14 years work

experience as a data scientist He advises multiple startups in analytics shoring, analytics services, and analytics education, as well as using socialmedia to enhance buzz for analytics products Mr Ohri's research interestsinclude spreading open source analytics, analyzing social media manipulationwith mechanism design, simpler interfaces for cloud computing, investigatingclimate change and knowledge flows

off-He founded Decisionstats.com in 2007 a blog which has gathered more than100,000 views annually since past 7 years

His other books include R for Business Analytics (Springer 2012) and R for Cloud Computing (Springer 2014), and Python for R Users (Wiley 2017)

Abhinav Rai has been working as a Data Scientist for nearly a decade,

currently working at Microsoft He has experience working in telecom, retailmarketing, and online advertisement His areas of interest include the

Trang 21

evolving techniques of machine learning and the associated technologies He

is especially more interested in analyzing large and humongous datasets andlikes to generate deep insights in such scenarios Academically, he holds adouble master's degree in Mathematics from Deendayal Upadhyay

Gorakhpur University with an NBHM scholarship and in Computer Sciencefrom Indian Statistical Institute, rigor and sophistication is a surety with hisanalytical deliveries

Trang 22

www.PacktPub.com

Trang 23

eBooks, discount offers, and more

Did you know that Packt offers eBook versions of every book published, withPDF and ePub files available? You can upgrade to the eBook version at

www.PacktPub.com and as a print book customer, you are entitled to a

discount on the eBook copy Get in touch with us at

< customercare@packtpub.com > for more details

At www.PacktPub.com, you can also read a collection of free technical

articles, sign up for a range of free newsletters and receive exclusive

discounts and offers on Packt books and eBooks

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt Mapt gives you full

access to all Packt books and video courses, as well as industry-leading tools

to help you plan your personal development and advance your career

Trang 25

Customer Feedback

Thanks for purchasing this Packt book At Packt, quality is at the heart of oureditorial process To help us improve, please leave us an honest review onthis book's Amazon page at https://www.amazon.com/dp/1788621190

If you'd like to join our team of regular reviewers, you can e-mail us at

customerreviews@packtpub.com We award our regular reviewers with freeeBooks and videos in exchange for their valuable feedback Help us be

relentless in improving our products!

Trang 26

R and Python are interchangeably required languages these days for anybodyengaged with data analysis The growth of these two languages and theirinter-dependency creates a natural requirement to learn them both Thus, it

was natural where the second edition of my previous title R Statistical

Application Development by Example was headed I thus took this

opportunity to add Python as an important layer and hence you would find

Doing it in Python spread across and throughout the book Now, the book is

useful on many fronts, those who need to learn both the languages, uses Rand needs to switch to Python, and vice versa While abstract development ofideas and algorithms have been retained in R only, standard and more

commonly required data analysis technique are available in both the

languages now The only reason for not providing the Python parallel is toavoid the book from becoming too bulky

The open source language R is fast becoming one of the preferred

companions for statistics, even as the subject continues to add many friends

in machine learning, data mining, and so on among its already rich scientificnetwork The era of mathematical theory and statistical application

embeddedness is truly a remarkable one for society and R and Python hasplayed a very pivotal role in it This book is a humble attempt at presentingstatistical models through R for any reader who has a bit of familiarity withthe subject In my experience of practicing the subject with colleagues andfriends from different backgrounds, I realized that many are interested inlearning the subject and applying it in their domain which enables them totake appropriate decisions in analyses, which involves uncertainty A decadeearlier my friends would have been content with being pointed to a usefulreference book Not so anymore! The work in almost every domain is donethrough computers and naturally they do have their data available in

spreadsheets, databases, and sometimes in plain text format The request for

an appropriate statistical model is invariantly followed by a one word

question software? My answer to them has always been a single letter replyR! Why? It is really a very simple decision and it has been my companionover the last seven years In this book, this experience has been converted

Trang 27

into detailed chapters and a cleaner breakup of model building in R.

A by-product of my interactions with colleagues and friends who are all

aspiring statistical model builders has been that I have been able to pick upthe trough of their learning curve of the subject The first attempt towardsfixing the hurdle has been to introduce the fundamental concepts that thebeginners are most familiar with, which is data The difference is simply inthe subtleties and as such I firmly believe that introducing the subject on theirturf motivates the reader for a long way in their journey As with most

statistical software, R provides modules and packages which mostly covermany of the recently invented statistical methodologies The first five

chapters of the book focus on the fundamental aspects of the subject and the

R language and therefore hence cover R basics, data visualization,

exploratory data analysis, and statistical inference

The foundational aspects are illustrated using interesting examples and sets

up the framework for the next five chapters Linear and logistic regressionmodels being at the forefront, are of paramount importance in applications.The discussion is more generic in nature and the techniques can be easilyadapted across different domains The last two chapters have been inspired

by the Breiman school and hence the modern method of using classificationand regression trees has been developed in detail and illustrated through apractical dataset

Trang 28

What this book covers

Chapter 1, Data Characteristics, introduces the different types of data

through a questionnaire and dataset The need of statistical models is

elaborated in some interesting contexts This is followed by a brief

explanation of the installation of R and Python and their related packages.Discrete and continuous random variables are discussed through introductoryprograms The programs are available in both the languages and althoughthey do not need to be followed, they are more expository in nature

Chapter 2, Import/Export Data, begins with a concise development of R

basics Data frames, vectors, matrices, and lists are discussed with clear andsimpler examples Importing of data from external files in CSV, XLS, andother formats is elaborated next Writing data/objects from R for other

languages is considered and the chapter concludes with a dialogue on R

session management Python basics, mathematical operations, and other

essential operations are explained Reading data from different format ofexternal file is also illustrated along with the session management required

Chapter 3, Data Visualization, discusses efficient graphics separately for

categorical and numeric datasets This translates into techniques for bar chart,dot chart, spine and mosaic plot, and four fold plot for categorical data whilehistogram, box plot, and scatter plot for continuous/numeric data A verybrief introduction to ggplot2 is also provided here Generating similar plotsusing both R and Python will be a treatise here

Chapter 4, Exploratory Analysis, encompasses highly intuitive techniques for

the preliminary analysis of data The visualizing techniques of EDA such asstem-and-leaf, letter values, and the modeling techniques of resistant line,smoothing data, and median polish provide rich insight as a preliminary

analysis step This chapter is driven mainly in R only

Chapter 5, Statistical Inference, begins with an emphasis on the likelihood

function and computing the maximum likelihood estimate Confidence

intervals for parameters of interest is developed using functions defined for

Trang 29

specific problems The chapter also considers important statistical tests of test and t-test for comparison of means and chi-square tests and f-test forcomparison of variances The reader will learn how to create new R and

z-Python functions

Chapter 6, Linear Regression Analysis, builds a linear relationship between

an output and a set of explanatory variables The linear regression model hasmany underlying assumptions and such details are verified using validationtechniques A model may be affected by a single observation, or a singleoutput value, or an explanatory variable Statistical metrics are discussed indepth which helps remove one or more types of anomalies Given a largenumber of covariates, the efficient model is developed using model selectiontechniques While the stats core R package suffices, statsmodels package inPython is very useful

Chapter 7, The Logistic Regression Model, is useful as a classification model

when the output is a binary variable Diagnostic and model validation

through residuals are used which lead to an improved model ROC curves arenext discussed which helps in identifying of a better classification model The

R packages pscl and ROCR are useful while pysal and sklearn are useful inPython

Chapter 8, Regression Models with Regularization, discusses the problem of

over fitting, which arises from the use of models developed in the previoustwo chapters Ridge regression significantly reduces the probability of anover fit model and the development of natural spine models also lays thebasis for the models considered in the next chapter Regularization in R isachieved using packages ridge and MASS while sklearn and statsmodels help

in Python

Chapter 9, Classification and Regression Trees, provides a tree-based

regression model The trees are initially built using raw R functions and thefinal trees are also reproduced using rudimentary codes leading to a clearunderstanding of the CART mechanism The pruning procedure is illustratedthrough one of the languages and the reader should explore to find the fix inanother

Trang 30

Chapter 10, CART and Beyond, considers two enhancements to CART, using

bagging and random forests A consolidation of all the models from Chapter

6, Linear Regression Analysis, to Chapter 10, CART and Beyond, is also

provided through a dataset The ensemble methods is fast emerging as veryeffective and popular machine learning technique and doing it in both thelanguages will improve users confidence

Trang 31

What you need for this book

You will need the following to work with the examples in this book:R

Python

RStudio

Trang 32

Who this book is for

If you want to have a brief understanding of the nature of data and performadvanced statistical analysis using both R and Python, then this book is whatyou need No prior knowledge is required Aspiring data scientist, R userstrying to learn Python and Python users trying to learn R

Trang 33

In this book, you will find a number of text styles that distinguish betweendifferent kinds of information Here are some examples of these styles and anexplanation of their meaning

Code words in text, database table names, folder names, filenames, file

extensions, pathnames, dummy URLs, user input, and Twitter handles areshown as follows: “We can include other contexts through the use of the

Any command-line input or output is written as follows:

sudo apt-get update

sudo apt-get install python3.6

New terms and important words are shown in bold Words that you see on

the screen, for example, in menus or dialog boxes, appear in the text like this:

“Clicking the Next button moves you to the next screen.”

Trang 34

Reader feedback

Feedback from our readers is always welcome Let us know what you thinkabout this book—what you liked or disliked Reader feedback is importantfor us as it helps us develop titles that you will really get the most out of

To send us general feedback, simply e-mail < feedback@packtpub.com >, andmention the book’s title in the subject of your message

If there is a topic that you have expertise in and you are interested in eitherwriting or contributing to a book, see our author guide at

www.packtpub.com/authors

Trang 35

Customer support

Now that you are the proud owner of a Packt book, we have a number ofthings to help you to get the most from your purchase

Trang 36

Downloading the example code

You can download the example code files for this book from your account at

http://www.packtpub.com If you purchased this book elsewhere, you canvisit http://www.packtpub.com/support and register to have the files e-maileddirectly to you

You can download the code files by following these steps:

1 Log in or register to our website using your e-mail address and

password

2 Hover the mouse pointer on the SUPPORT tab at the top.

3 Click on Code Downloads & Errata.

4 Enter the name of the book in the Search box.

5 Select the book for which you’re looking to download the code files

6 Choose from the drop-down menu where you purchased this book from

7 Click on Code Download.

You can also download the code files by clicking on the Code Files button

on the book’s webpage at the Packt Publishing website This page can be

accessed by entering the book’s name in the Search box Please note that you

need to be logged in to your Packt account

Once the file is downloaded, please make sure that you unzip or extract thefolder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at

https://github.com/PacktPublishing/ with-R-and-Python-Second-Edition We also have other code bundles fromour rich catalog of books and videos available at

Statistical-Application-Development-https://github.com/PacktPublishing/ Check them out!

Trang 37

Although we have taken every care to ensure the accuracy of our content,mistakes do happen If you find a mistake in one of our books—maybe amistake in the text or the code—we would be grateful if you could report this

to us By doing so, you can save other readers from frustration and help usimprove subsequent versions of this book If you find any errata, please

report them by visiting http://www.packtpub.com/submit-errata, selecting

your book, clicking on the Errata Submission Form link, and entering the

details of your errata Once your errata are verified, your submission will beaccepted and the errata will be uploaded to our website or added to any list ofexisting errata under the Errata section of that title

To view the previously submitted errata, go to

https://www.packtpub.com/books/content/support and enter the name of thebook in the search field The required information will appear under the

Errata section.

Trang 38

Piracy of copyrighted material on the Internet is an ongoing problem acrossall media At Packt, we take the protection of our copyright and licenses veryseriously If you come across any illegal copies of our works in any form onthe Internet, please provide us with the location address or website nameimmediately so that we can pursue a remedy

Please contact us at < copyright@packtpub.com > with a link to the suspectedpirated material

We appreciate your help in protecting our authors and our ability to bring youvaluable content

Trang 39

If you have a problem with any aspect of this book, you can contact us at

< questions@packtpub.com >, and we will do our best to address the problem

Trang 40

Chapter 1 Data Characteristics

Data consists of observations across different types of variables, and it is vitalthat any data analyst understands these intricacies at the earliest stage of

exposure to statistical analysis This chapter recognizes the importance ofdata and begins with a template of a dummy questionnaire and then proceedswith the nitty-gritties of the subject We will then explain how uncertaintycreeps in to the domain of computer science The chapter closes with

coverage of important families of discrete and continuous random variables

We will cover the following topics:

Identification of the main variable types as nominal, categorical, andcontinuous variables

The uncertainty arising in many real experiments

R installation and packages

The mathematical form of discrete and continuous random variables andtheir applications

Định dạng
Số trang	612
Dung lượng	21,32 MB