Logistic regression explained from scratch

Follow 574K Followers Editors Picks Features Deep Dives Grow Contribute About Logistic Regression Explained from Scratch (Visually, Mathematically and Programmatically) Hands on Vanilla Modelling Par.

Trang 1

Follow 574K Followers · Editors' Picks Features Deep Dives Grow Contribute About

Logistic Regression Explained from Scratch (Visually,

Mathematically and Programmatically)Hands-on Vanilla Modelling Part III

Abhibhav Sharma 18 hours ago · 14 min read

Trang 2

Image By Author

A plethora of results appear on a small google search “Logistic Regression”

Sometimes it gets very confusing for beginners in data science, to getaround the main idea behind logistic regression And why wouldn't they beconfused!!? Every different tutorial, article, or forum has a different

narration on Logistic Regression (not including the legit verbose of

Trang 3

some more sophisticated ones call it a “Regressor”, however, the idea andutility remain unrevealed Remember that Logistic regression is thebasic building block of artificial neural networks and no/fallaciousunderstanding of it could make it really difficult to understand theadvanced formalisms of data science.

Here, I will try to shed some light on and inside the Logistic Regressionmodel and its formalisms in a very basic manner in order to give a sense ofunderstanding to the readers (hopefully without confusing them) Now thesimplicity offered here is at a cost of capering the in-depth details of somecrucial aspects, and to get into the nitty-gritty of each aspect of Logisticregression would be like diving into the fractal (there will be no end to thediscussion) However, for each such concept, I will provide eminent

readings/sources that one should refer to

For there are two major branches in the study of Logistic regression (i)Modelling and (ii) Post Modelling analysis (using the logistic regressionresults) While the latter is the measure of effect from the fitted coefficients,

I believe that the black-box aspect of logistic regression has always been inits Modelling

Trang 4

My aim here is to:

1 To elaborate Logistic regression in the most layman way

2 To discuss the underlying mathematics of two popular optimizers thatare employed in Logistic Regression (Gradient Descent and NewtonMethod)

3 To create a logistic-regression module from scratch in R for each type ofoptimizer

One last thing before we proceed, this entire article is designed by keepingthe binary classification problem in mind in order to avoid complexity

1 The Logistic Regression is NOT A CLASSIFIER

Yes, it is not It is rather a regression model in the core of its heart I willdepict what and why logistic regression while preserving its resonance with

Trang 5

predicts a “value” of the targeted variable through a linear combination ofthe given features, while on the other hand, a Logistic regression predicts

“probability value” through a linear combination of the given featuresplugged inside a logistic function (aka inverse-logit) given as eq(1):

Equation 1

Trang 6

Logistic Function (Image by author)

Hence the name logistic regression This logistic function is a simplestrategy to map the linear combination “z”, lying in the (-inf,inf) range tothe probability interval of [0,1] (in the context of logistic regression, this zwill be called the log(odd) or logit or log(p/1-p)) (see the above plot)

Consequently, Logistic regression is a type of regression where the range ofmapping is confined to [0,1], unlike simple linear regression models wherethe domain and range could take any real value

A small sample of the data (Image by author)

Trang 7

Consider simple data with one variable and its corresponding binary classeither 0 or 1 The scatter plot of this data looks something like (Fig A left).

We see that the data points are in the two extreme clusters Good, now forour prediction modeling, a naive regression line in this scenario will give anonsense fit (red line in Fig A right) and what we actually require to fit issomething like a squiggly line (or a curvy “S” shaped blue rule in Fig Aright) to explain (or to correctly separate) a maximum number of datapoints

(Image by author) Fig A

Logistic regression is a scheme to search this most optimum blue squigglyline Now first let's understand what each point on this squiggly line

Trang 8

represents given any variable value projected on this line, this squiggly linetells the probability of falling in Class 1 (say “p”) for that projected

variable value So accordingly the line tells that all the bottom points thatlie on this blue line have zero chances (p=0) of being in class 1 and the toppoints that lie on it have the probability of 1(p=1) for the same Now,

remember that I have mentioned that the logistic (aka inverse-logit) is astrategy to map infinitely stretching space (-inf, inf) to a probability space

of [0,1], a logit function could transform the probability space of [0,1] to aspace stretching to (-inf, inf) eq(2)&(Fig B)

Equation 2

Trang 9

Fig B The logit function is given by log(p/1-p) that maps each probability value to the point on the number line

{ℝ} stretching from -infinity to infinity (Image by author)

Keeping this in mind, here comes the mantra of logistic regressionmodeling:

coefficient and slope in order to maximize the Likelihood (a very fancy stuff

probability[0,1] vs variable{ℝ} using inverse-Logit (aka Logistic function)

most optimum squiggly line or the most discriminating rule

WOW!!!

Trang 10

Well, you may (should) ask (i) Why and how to do this transformation ??,(ii) what the heck is Likelihood?? and (iii) How this scheme would lead tothe most optimum squiggle?!!.

So for (i), the idea to the transformation from a confined probability space[0,1] to an infinitely stretching real space (-inf, inf) is because it will makethe fitting problem very close to solving a linear regression, for which wehave a lot of optimizers and techniques to fit the most optimum line Thelatter questions will be answered eventually

Now coming back to our search for the best classifying blue squiggly line,the idea is to plot an initial linear regression line with the arbitrary

coefficient on ⚠ logit vs variable space⚠ coordinates first and then adjustthe coefficients of this fit to maximize the likelihood (relax!! I will explainthe “likelihood” when it is needed)

In our one variable case, we can write equation 3:

logit(p) = log(p/1-p) = β₀+ β₁*v ……….(eq 3)

Trang 11

(Image by author) Fig C

In Fig C (I), the red line is our arbitrary chosen regression line fitted for thedata points, mapped in a different coordinate system with β₀ (intercept) as-20 and β₁(slope) as 3.75

⚠ Note that the coordinate space is not class{0,1} vs variable{ℝ} but its

A(right) to Fig C(I) has no effects on the positional preferences of the points

Trang 12

i.e the extremes as in equation 2 above, the logit(0)=-infinity, andlogit(1)=+infinity.

At this point, let me reiterate our objective: We want to fit the straight linefor the data points in logit vs variable plot in such a way that it explains(correctly separates) the maximum number of data points when it getsconverted to the blue squiggly line through inverse-logit (aka logisticfunction) eq(1) So to achieve the best regression, a similar strategy ofsimple linear regression comes into play but despite minimizing the squaredresidual, the idea is to maximize the likelihood (relax!!) Since the pointsscattered on the infinity make it difficult to proceed in an orthodox linearregression method, the trick is to project these points on the logit (theinitial chosen/fitted line with the arbitrary coefficient) Fig C(II) In thisway, each data point projected on the logit corresponds to a logit value

When these logit values are plugged into the logistic function eq(1), we gettheir probability of falling in class 1 Fig C(III)

Note: This also can be proven mathematically as well that:

Trang 13

This probability can be represented mathematically as equation 4, which isvery close to a Bernoulli distribution, isn't it?.

1 eq(4)

The equation reads that for a given data instance x, the probability of thelabel Y being y(where y is either 0 or 1) is equal to the logistic of logit wheny=1 and is equal to (1-logistic of logit) if y=0 These new probability valuesare illustrated in our class{0,1} vs variable{ℝ} space as blue dots in FigC(III) This new probability value for a datapoint is what we call theLIKELIHOOD of that data point So in simple terms, likelihood is theprobability value of the datapoint where the probability value indicates thathow LIKELY the point is to be falling in the class 1 category And the

likelihood of the training label for the fitted weight vector β is nothing butthe product of each of these newfound probability values equation 5&6

L(β) = ⁿ∏ᵢ₌₁ P(Y = y⁽ⁱ⁾ | X = x⁽ⁱ⁾ )……….……….eq(5)Substituting equation 4 in equation 5 we get,

Trang 14

L(β) = ⁿ∏ᵢ₌₁ σ(βᵀ x⁽ⁱ⁾)ʸ⁽ⁱ⁾ · [1 − σ(βᵀx⁽ⁱ⁾)]⁽¹⁻ʸ⁽ⁱ⁾⁾ ………eq(6)

The idea is to estimate the parameters (β) such that it maximizes the L(β).

However, due to the mathematical convenience, we maximize the log of

L(β) and call its log-likelihood equation 7.

So at this point, I hope that our earlier stated objective is much

understandable i.e to find the best fitting parameters β in logit vs variable space such that LL(β) in probability vs variable space is maximum For this,

there is no close form and so in the next section, I will touch upon twooptimization methods (1) Gradient descent and (2) Newton’s method tofind the optimum parameters

2 Optimizers

Trang 15

Our optimization first requires the partial derivative of the log-likelihoodfunction So let me shamelessly share the snap from a very eminent lecturenote that beautifully elucidate the steps to derive the partial derivative of

LL(β) (Note: the calculations shown here use θ in place of β to represent the

parameters.)

To update the parameter, the steps toward the global maximum is:

Trang 16

Where η is the learning rate

so the algorithm is:

Initialize β and set likelihood=0

While likelihood≤max(likelihood){

Trang 17

Newton Method

Newton’s Method is another strong candidate among the all availableoptimizers We have learned Newton’s Method as an algorithm to stepwisefind the maximum/minimum point on the concave/convex functions in ourearly lessons:

xₙ₊₁=xₙ + ∇f(xₙ) ∇∇f⁻¹(xₙ)

H i.e the second-order partial derivative of LL(β)) Well, I will caper the

details here, but your curious brain should refer to this So, the ultimateexpression to update the parameter, in this case, is given by:

Here in the case of logistic regression, the calculation of H is super easybecause:

Trang 18

H= ∇∇LL(β) = ∇ⁿ∑ᵢ₌₁[y − σ(βᵀx⁽ⁱ⁾)].x⁽ⁱ⁾

= −ⁿ∑ᵢ₌₁ x⁽ⁱ⁾(∇pᵢ) = −ⁿ∑ᵢ₌₁ x⁽ⁱ⁾ pᵢ(1-pᵢ) (x⁽ⁱ⁾)ᵀ

So the algorithms are:

Initialize β and set likelihood=0

While likelihood≤max(likelihood){

Trang 19

Calculate Second_derivative Hessian H = -Xᵀ (P*(1-P))ᵀI X

Trang 20

section as I have given a spet-wise algorithm for both the optimizer and mycode will strictly follow that order.

The training Function

setwd("C:/Users/Dell/Desktop") set.seed(111) #to generate the same results as mine

# -Training Function -#

logistic.train<- function(train_data, method, lr, verbose){

b0<-rep(1, nrow(train_data)) x<-as.matrix(cbind(b0, train_data[,1:(ncol(train_data)-1)])) y<- train_data[, ncol(train_data)]

beta<- as.matrix(rep(0.5,ncol(x))); likelihood<-0; epoch<-0

#initiate

beta_all<-NULL beta_at<-c(1,10,50,100,110,150,180,200,300,500,600,800,1000, 1500,2000,4000,5000,6000,10000) #checkpoints (the epochs

at which I will record the betas)

# -Gradient

Descent -#

Trang 21

p <- 1/( 1+ exp(-(logit))) #Calculate P=logistic(Xβᵀ)=

1/(1+exp(-Xβᵀ))

# Likelihood: L(x|beta) = P(Y=1|x,beta)*P(Y=0|x,beta) likelihood<-1

for(i in 1:length(p)){

likelihood <- likelihood*(ifelse( y[i]==1, p[i], (1-p[i])))

#product of all the probability }

first_d<- t(x) %*% (y-p)#first derivative of the likelihood function

beta <- beta + lr*first_d #updating the parameters for a step toward maximization

#to see inside the steps of learning (irrelevant to the main working algo)

if(verbose==T){

ifelse(epoch%%200==0, print(paste0(epoch, "th Epoch", " -Likelihood=", round(likelihood,4),

" -log-likelihood=", round(log(likelihood),4),

collapse = "")), NA)}

if(epoch %in% beta_at){beta_all<-cbind(beta_all, beta)}

epoch<- epoch+1 }

} # -Newton second order diff method -#

Trang 22

while((likelihood < 0.95) & (epoch<=35000)){

logit<-x%*%beta #Calculate logit(p) = xβᵀ

p <- 1/( 1+ exp(-(logit))) #Calculate P=logistic(Xβᵀ)=

1/(1+exp(-Xβᵀ))

# Likelihood: L(x|beta) = P(Y=1|x,beta)*P(Y=0|x,beta) likelihood<-1

for(i in 1:length(p)){

likelihood <- likelihood*(ifelse( y[i]==1, p[i], (1-p[i]))) }

first_d<- t(x) %*% (y-p)#first derivative of the likelihood function

w<-matrix(0, ncol= nrow(x), nrow = nrow(x)) #initializing p(1- p) diagonal matrix

diag(w)<-p*(1-p) hessian<- -t(x) %*% w %*% x #hessian matrix

hessian<- diag(ncol(x))-hessian #Levenberg-Marquardt method:

Add a scaled identity matrix to avoid singularity issues

k<- solve(hessian) %*% (t(x) %*% (y-p)) #the gradient for newton method

beta <- beta + k #updating the parameters for a step toward maximization

if(verbose==T){

ifelse(epoch%%200==0, print(paste0(epoch, "th Epoch", " -Likelihood=",

Trang 23

if(epoch %in% beta_at){beta_all<-cbind(beta_all, beta)} #just

to inside the learning epoch<- epoch+1 }

} else(break)

beta_all<-cbind(beta_all, beta) colnames(beta_all)<-c(beta_at[1:(ncol(beta_all)-1)], epoch-1)

mylist<-list(as.matrix(beta), likelihood, beta_all) names(mylist)<- c("Beta", "likelihood", "Beta_all") return(mylist)

} # Fitting of logistic model

The Prediction Function

logistic.pred<-function(model, test_data){

test_new<- cbind( rep(1, nrow(test_data)), test_data[,- ncol(test_data)]) #adding 1 to fit the intercept

beta<-as.matrix(model$Beta) #extract the best suiting beta (the beta at final epoch)

beta_all<-model$Beta_all #extract all the betas at different checkpoints

ll<- model$likelihood #extract the highest likelihood obtained

log_odd<-cbind(as.matrix(test_new)) %*% beta #logit(p)

Trang 24

probability<- 1/(1+ exp(-log_odd)) # p=logistic(logit(p)) predicted_label<- ifelse(probability >= 0.5, 1, 0) #discrimination rule

k<-cbind(test_data[,ncol(test_data)], predicted_label) # actual label vs predicted label

colnames(k)<- c("Actual", "Predicted") k<- as.data.frame(k)

tp<-length(which(k$Actual==1 & k$Predicted==1)) #true positive tn<-length(which(k$Actual==0 & k$Predicted==0)) #true negative fp<-length(which(k$Actual==0 & k$Predicted==1)) #false positive fn<-length(which(k$Actual==1 & k$Predicted==0)) #false negative

cf<-matrix(c(tp, fn, fp, tn), 2, 2, byrow = F) #confusion matrix rownames(cf)<- c("1", "0")

colnames(cf)<- c("1", "0")

p_list<-list(k, cf, beta, ll, beta_all) names(p_list)<- c("predticted", "confusion matrix","beta",

"liklihood", "Beta_all") return(p_list)

} # to make prediction from the trained model

Data Parsing

#importing data

Định dạng
Số trang	35
Dung lượng	2,12 MB