Follow 574K Followers Editors Picks Features Deep Dives Grow Contribute About Logistic Regression Explained from Scratch (Visually, Mathematically and Programmatically) Hands on Vanilla Modelling Par.
Trang 1Follow 574K Followers · Editors' Picks Features Deep Dives Grow Contribute About
Logistic Regression Explained from Scratch (Visually,
Mathematically and Programmatically)Hands-on Vanilla Modelling Part III
Abhibhav Sharma 18 hours ago · 14 min read
Trang 2Image By Author
A plethora of results appear on a small google search “Logistic Regression”
Sometimes it gets very confusing for beginners in data science, to getaround the main idea behind logistic regression And why wouldn't they beconfused!!? Every different tutorial, article, or forum has a different
narration on Logistic Regression (not including the legit verbose of
Trang 3some more sophisticated ones call it a “Regressor”, however, the idea andutility remain unrevealed Remember that Logistic regression is thebasic building block of artificial neural networks and no/fallaciousunderstanding of it could make it really difficult to understand theadvanced formalisms of data science.
Here, I will try to shed some light on and inside the Logistic Regressionmodel and its formalisms in a very basic manner in order to give a sense ofunderstanding to the readers (hopefully without confusing them) Now thesimplicity offered here is at a cost of capering the in-depth details of somecrucial aspects, and to get into the nitty-gritty of each aspect of Logisticregression would be like diving into the fractal (there will be no end to thediscussion) However, for each such concept, I will provide eminent
readings/sources that one should refer to
For there are two major branches in the study of Logistic regression (i)Modelling and (ii) Post Modelling analysis (using the logistic regressionresults) While the latter is the measure of effect from the fitted coefficients,
I believe that the black-box aspect of logistic regression has always been inits Modelling
Trang 4My aim here is to:
1 To elaborate Logistic regression in the most layman way
2 To discuss the underlying mathematics of two popular optimizers thatare employed in Logistic Regression (Gradient Descent and NewtonMethod)
3 To create a logistic-regression module from scratch in R for each type ofoptimizer
One last thing before we proceed, this entire article is designed by keepingthe binary classification problem in mind in order to avoid complexity
1 The Logistic Regression is NOT A CLASSIFIER
Yes, it is not It is rather a regression model in the core of its heart I willdepict what and why logistic regression while preserving its resonance with
Trang 5predicts a “value” of the targeted variable through a linear combination ofthe given features, while on the other hand, a Logistic regression predicts
“probability value” through a linear combination of the given featuresplugged inside a logistic function (aka inverse-logit) given as eq(1):
Equation 1
Trang 6Logistic Function (Image by author)
Hence the name logistic regression This logistic function is a simplestrategy to map the linear combination “z”, lying in the (-inf,inf) range tothe probability interval of [0,1] (in the context of logistic regression, this zwill be called the log(odd) or logit or log(p/1-p)) (see the above plot)
Consequently, Logistic regression is a type of regression where the range ofmapping is confined to [0,1], unlike simple linear regression models wherethe domain and range could take any real value
A small sample of the data (Image by author)
Trang 7Consider simple data with one variable and its corresponding binary classeither 0 or 1 The scatter plot of this data looks something like (Fig A left).
We see that the data points are in the two extreme clusters Good, now forour prediction modeling, a naive regression line in this scenario will give anonsense fit (red line in Fig A right) and what we actually require to fit issomething like a squiggly line (or a curvy “S” shaped blue rule in Fig Aright) to explain (or to correctly separate) a maximum number of datapoints
(Image by author) Fig A
Logistic regression is a scheme to search this most optimum blue squigglyline Now first let's understand what each point on this squiggly line
Trang 8represents given any variable value projected on this line, this squiggly linetells the probability of falling in Class 1 (say “p”) for that projected
variable value So accordingly the line tells that all the bottom points thatlie on this blue line have zero chances (p=0) of being in class 1 and the toppoints that lie on it have the probability of 1(p=1) for the same Now,
remember that I have mentioned that the logistic (aka inverse-logit) is astrategy to map infinitely stretching space (-inf, inf) to a probability space
of [0,1], a logit function could transform the probability space of [0,1] to aspace stretching to (-inf, inf) eq(2)&(Fig B)
Equation 2
Trang 9Fig B The logit function is given by log(p/1-p) that maps each probability value to the point on the number line
{ℝ} stretching from -infinity to infinity (Image by author)
Keeping this in mind, here comes the mantra of logistic regressionmodeling:
coefficient and slope in order to maximize the Likelihood (a very fancy stuff
probability[0,1] vs variable{ℝ} using inverse-Logit (aka Logistic function)
most optimum squiggly line or the most discriminating rule
WOW!!!
Trang 10Well, you may (should) ask (i) Why and how to do this transformation ??,(ii) what the heck is Likelihood?? and (iii) How this scheme would lead tothe most optimum squiggle?!!.
So for (i), the idea to the transformation from a confined probability space[0,1] to an infinitely stretching real space (-inf, inf) is because it will makethe fitting problem very close to solving a linear regression, for which wehave a lot of optimizers and techniques to fit the most optimum line Thelatter questions will be answered eventually
Now coming back to our search for the best classifying blue squiggly line,the idea is to plot an initial linear regression line with the arbitrary
coefficient on ⚠ logit vs variable space⚠ coordinates first and then adjustthe coefficients of this fit to maximize the likelihood (relax!! I will explainthe “likelihood” when it is needed)
In our one variable case, we can write equation 3:
logit(p) = log(p/1-p) = β₀+ β₁*v ……….(eq 3)
Trang 11(Image by author) Fig C
In Fig C (I), the red line is our arbitrary chosen regression line fitted for thedata points, mapped in a different coordinate system with β₀ (intercept) as-20 and β₁(slope) as 3.75
⚠ Note that the coordinate space is not class{0,1} vs variable{ℝ} but its
A(right) to Fig C(I) has no effects on the positional preferences of the points
Trang 12i.e the extremes as in equation 2 above, the logit(0)=-infinity, andlogit(1)=+infinity.
At this point, let me reiterate our objective: We want to fit the straight linefor the data points in logit vs variable plot in such a way that it explains(correctly separates) the maximum number of data points when it getsconverted to the blue squiggly line through inverse-logit (aka logisticfunction) eq(1) So to achieve the best regression, a similar strategy ofsimple linear regression comes into play but despite minimizing the squaredresidual, the idea is to maximize the likelihood (relax!!) Since the pointsscattered on the infinity make it difficult to proceed in an orthodox linearregression method, the trick is to project these points on the logit (theinitial chosen/fitted line with the arbitrary coefficient) Fig C(II) In thisway, each data point projected on the logit corresponds to a logit value
When these logit values are plugged into the logistic function eq(1), we gettheir probability of falling in class 1 Fig C(III)
Note: This also can be proven mathematically as well that:
Trang 13This probability can be represented mathematically as equation 4, which isvery close to a Bernoulli distribution, isn't it?.
1 eq(4)
The equation reads that for a given data instance x, the probability of thelabel Y being y(where y is either 0 or 1) is equal to the logistic of logit wheny=1 and is equal to (1-logistic of logit) if y=0 These new probability valuesare illustrated in our class{0,1} vs variable{ℝ} space as blue dots in FigC(III) This new probability value for a datapoint is what we call theLIKELIHOOD of that data point So in simple terms, likelihood is theprobability value of the datapoint where the probability value indicates thathow LIKELY the point is to be falling in the class 1 category And the
likelihood of the training label for the fitted weight vector β is nothing butthe product of each of these newfound probability values equation 5&6
L(β) = ⁿ∏ᵢ₌₁ P(Y = y⁽ⁱ⁾ | X = x⁽ⁱ⁾ )……….……….eq(5)Substituting equation 4 in equation 5 we get,
Trang 14L(β) = ⁿ∏ᵢ₌₁ σ(βᵀ x⁽ⁱ⁾)ʸ⁽ⁱ⁾ · [1 − σ(βᵀx⁽ⁱ⁾)]⁽¹⁻ʸ⁽ⁱ⁾⁾ ………eq(6)
The idea is to estimate the parameters (β) such that it maximizes the L(β).
However, due to the mathematical convenience, we maximize the log of
L(β) and call its log-likelihood equation 7.
So at this point, I hope that our earlier stated objective is much
understandable i.e to find the best fitting parameters β in logit vs variable space such that LL(β) in probability vs variable space is maximum For this,
there is no close form and so in the next section, I will touch upon twooptimization methods (1) Gradient descent and (2) Newton’s method tofind the optimum parameters
2 Optimizers
Trang 15Our optimization first requires the partial derivative of the log-likelihoodfunction So let me shamelessly share the snap from a very eminent lecturenote that beautifully elucidate the steps to derive the partial derivative of
LL(β) (Note: the calculations shown here use θ in place of β to represent the
parameters.)
To update the parameter, the steps toward the global maximum is:
Trang 16Where η is the learning rate
so the algorithm is:
Initialize β and set likelihood=0
While likelihood≤max(likelihood){
Trang 17Newton Method
Newton’s Method is another strong candidate among the all availableoptimizers We have learned Newton’s Method as an algorithm to stepwisefind the maximum/minimum point on the concave/convex functions in ourearly lessons:
xₙ₊₁=xₙ + ∇f(xₙ) ∇∇f⁻¹(xₙ)
H i.e the second-order partial derivative of LL(β)) Well, I will caper the
details here, but your curious brain should refer to this So, the ultimateexpression to update the parameter, in this case, is given by:
Here in the case of logistic regression, the calculation of H is super easybecause:
Trang 18H= ∇∇LL(β) = ∇ⁿ∑ᵢ₌₁[y − σ(βᵀx⁽ⁱ⁾)].x⁽ⁱ⁾
= −ⁿ∑ᵢ₌₁ x⁽ⁱ⁾(∇pᵢ) = −ⁿ∑ᵢ₌₁ x⁽ⁱ⁾ pᵢ(1-pᵢ) (x⁽ⁱ⁾)ᵀ
So the algorithms are:
Initialize β and set likelihood=0
While likelihood≤max(likelihood){
Trang 19Calculate Second_derivative Hessian H = -Xᵀ (P*(1-P))ᵀI X
Trang 20section as I have given a spet-wise algorithm for both the optimizer and mycode will strictly follow that order.
The training Function
setwd("C:/Users/Dell/Desktop") set.seed(111) #to generate the same results as mine
# -Training Function -#
logistic.train<- function(train_data, method, lr, verbose){
b0<-rep(1, nrow(train_data)) x<-as.matrix(cbind(b0, train_data[,1:(ncol(train_data)-1)])) y<- train_data[, ncol(train_data)]
beta<- as.matrix(rep(0.5,ncol(x))); likelihood<-0; epoch<-0
#initiate
beta_all<-NULL beta_at<-c(1,10,50,100,110,150,180,200,300,500,600,800,1000, 1500,2000,4000,5000,6000,10000) #checkpoints (the epochs
at which I will record the betas)
# -Gradient
Descent -#
Trang 21
p <- 1/( 1+ exp(-(logit))) #Calculate P=logistic(Xβᵀ)=
1/(1+exp(-Xβᵀ))
# Likelihood: L(x|beta) = P(Y=1|x,beta)*P(Y=0|x,beta) likelihood<-1
for(i in 1:length(p)){
likelihood <- likelihood*(ifelse( y[i]==1, p[i], (1-p[i])))
#product of all the probability }
first_d<- t(x) %*% (y-p)#first derivative of the likelihood function
beta <- beta + lr*first_d #updating the parameters for a step toward maximization
#to see inside the steps of learning (irrelevant to the main working algo)
if(verbose==T){
ifelse(epoch%%200==0, print(paste0(epoch, "th Epoch", " -Likelihood=", round(likelihood,4),
" -log-likelihood=", round(log(likelihood),4),
collapse = "")), NA)}
if(epoch %in% beta_at){beta_all<-cbind(beta_all, beta)}
epoch<- epoch+1 }
} # -Newton second order diff method -#
Trang 22
while((likelihood < 0.95) & (epoch<=35000)){
logit<-x%*%beta #Calculate logit(p) = xβᵀ
p <- 1/( 1+ exp(-(logit))) #Calculate P=logistic(Xβᵀ)=
1/(1+exp(-Xβᵀ))
# Likelihood: L(x|beta) = P(Y=1|x,beta)*P(Y=0|x,beta) likelihood<-1
for(i in 1:length(p)){
likelihood <- likelihood*(ifelse( y[i]==1, p[i], (1-p[i]))) }
first_d<- t(x) %*% (y-p)#first derivative of the likelihood function
w<-matrix(0, ncol= nrow(x), nrow = nrow(x)) #initializing p(1- p) diagonal matrix
diag(w)<-p*(1-p) hessian<- -t(x) %*% w %*% x #hessian matrix
hessian<- diag(ncol(x))-hessian #Levenberg-Marquardt method:
Add a scaled identity matrix to avoid singularity issues
k<- solve(hessian) %*% (t(x) %*% (y-p)) #the gradient for newton method
beta <- beta + k #updating the parameters for a step toward maximization
if(verbose==T){
ifelse(epoch%%200==0, print(paste0(epoch, "th Epoch", " -Likelihood=",
Trang 23if(epoch %in% beta_at){beta_all<-cbind(beta_all, beta)} #just
to inside the learning epoch<- epoch+1 }
} else(break)
beta_all<-cbind(beta_all, beta) colnames(beta_all)<-c(beta_at[1:(ncol(beta_all)-1)], epoch-1)
mylist<-list(as.matrix(beta), likelihood, beta_all) names(mylist)<- c("Beta", "likelihood", "Beta_all") return(mylist)
} # Fitting of logistic model
The Prediction Function
logistic.pred<-function(model, test_data){
test_new<- cbind( rep(1, nrow(test_data)), test_data[,- ncol(test_data)]) #adding 1 to fit the intercept
beta<-as.matrix(model$Beta) #extract the best suiting beta (the beta at final epoch)
beta_all<-model$Beta_all #extract all the betas at different checkpoints
ll<- model$likelihood #extract the highest likelihood obtained
log_odd<-cbind(as.matrix(test_new)) %*% beta #logit(p)
Trang 24probability<- 1/(1+ exp(-log_odd)) # p=logistic(logit(p)) predicted_label<- ifelse(probability >= 0.5, 1, 0) #discrimination rule
k<-cbind(test_data[,ncol(test_data)], predicted_label) # actual label vs predicted label
colnames(k)<- c("Actual", "Predicted") k<- as.data.frame(k)
tp<-length(which(k$Actual==1 & k$Predicted==1)) #true positive tn<-length(which(k$Actual==0 & k$Predicted==0)) #true negative fp<-length(which(k$Actual==0 & k$Predicted==1)) #false positive fn<-length(which(k$Actual==1 & k$Predicted==0)) #false negative
cf<-matrix(c(tp, fn, fp, tn), 2, 2, byrow = F) #confusion matrix rownames(cf)<- c("1", "0")
colnames(cf)<- c("1", "0")
p_list<-list(k, cf, beta, ll, beta_all) names(p_list)<- c("predticted", "confusion matrix","beta",
"liklihood", "Beta_all") return(p_list)
} # to make prediction from the trained model
Data Parsing
#importing data