The coefficients in logistic regression, and , are estimated by using maximum likelihood estimation (MLE). The core idea behind MLE is to find the parameters that make the observed data most "likely" under the logistic regression model.
In logistic regression, we model the probability that the target variable is 1 (for example, "approved") given an input by using the logistic (sigmoid) function:
MLE tries different combinations of and , and for each combination, asks: How likely is it that we would see the actual outcomes in our data, given these parameters?
This is captured by using the likelihood function, which multiplies the predicted probabilities for each data point:
- If =1 (“approved”), we want the model’s predicted probability to be as close as 1. The term addresses this. If the actual observed data of y1 is actually “approved” or 1, the term will be 1.
- If =0, we want the predicted probability to be close to 0. The term handles this case. If the actual observed data of is “not approved”, or 0, the value will be will be close to 0, therefore will be close to 1.
So for each data point, we multiply either or , depending on whether the actual label is 1 or 0. The product over all examples gives us a single number: the likelihood of seeing the entire dataset under the current model. As we can see, if the predicted outcomes (using parameters and ) conform to the observed data, the value of likelihood will be maximized. The reason behind multiplying all the probabilities together is that we assume the outcomes are independent of each other. In other words, one person’s chance of approval should not influence another person’s chance of approval.
Because this product can get extremely small, we usually work with the log-likelihood, which turns the product into a sum and is easier to compute and optimize.
To find the values of and that maximize the log-likelihood, we use gradient descent—an iterative optimization algorithm. At each step, we compute how the log-likelihood changes with respect to each parameter (for example, its gradient), and then update the parameters slightly in the direction that increases the likelihood. Over time, this process converges toward the values of and that best fit the data.