Logistic Regression: A Deep Dive

Introduction
Working of Logistic Regression
- 2.1. Sigmoid Function
- 2.2. Loss Function – Cross-Entropy
- 2.3. Model Training Process
Why Use Logistic Regression Instead of Linear Regression?
Why Is It Called Logistic Regression and Not Logistic Classification?
Logistic Regression for Multi-Class Classification
Code Example
Summary

1. Introduction

Logistic Regression is a statistical method used to predict categorical outcomes. It is widely used for binary classification (e.g., predicting if an email is spam or not), but it can also be extended for multi-class classification problems (e.g., predicting the species of a flower).

In contrast to linear regression, which predicts a continuous output, logistic regression predicts a probability value between 0 and 1, which can then be mapped to a categorical label (such as 0 or 1 for binary classification).

2. Working of Logistic Regression

2.1. Sigmoid Function

The core idea behind logistic regression is to predict the probability of an instance belonging to a certain class. Logistic regression uses the sigmoid function (also known as the logistic function) to transform the output of a linear model into a probability.

The sigmoid function is defined as:

$σ(z) = \frac{1}{1 + e^{-z}}$

Where:

z is the input to the sigmoid function (often represented as a linear combination of features, i.e., $z = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \dots).$
e is the base of the natural logarithm.

The output of the sigmoid function is always between 0 and 1, which can be interpreted as the probability of the positive class.

2.2. Loss Function – Cross-Entropy

In logistic regression, we use a loss function to measure how well the model’s predictions align with the actual labels. The cross-entropy loss function is commonly used.

The cross-entropy loss for binary classification is defined as:

$J(θ) = – \frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_{\theta}(x^{(i)})) + (1 – y^{(i)}) \log(1 – h_{\theta}(x^{(i)})) \right]$

Where:

y(i) is the true label of the ii-th data point (0 or 1),
hθ(x^i) is the predicted probability (output of the sigmoid function),
m is the number of training examples.

The goal of the logistic regression model is to minimize this loss function, adjusting the model parameters (weights) to make the predicted probabilities as close as possible to the true labels.

2.3. Model Training Process

During the training phase, the model starts with random weights. At each iteration, the model makes predictions for all data points, and the sigmoid function is applied to transform the predictions into probabilities.

The model then computes the cross-entropy loss and adjusts the weights through optimization (typically gradient descent) to minimize this loss. The process of iterating through predictions and adjusting weights continues until the model converges to a set of weights that minimize the loss function or the input exhausts.

After training, the model outputs probabilities for each instance. Typically, a threshold of 0.5 is used to convert these probabilities into class labels:

If the probability is greater than or equal to 0.5, the predicted class is 1.
If the probability is less than 0.5, the predicted class is 0.

3. Why Use Logistic Regression Instead of Linear Regression?

Linear regression is designed to predict continuous values, while logistic regression extends it by applying the sigmoid function to convert raw logits into probabilities. This enables logistic regression to handle classification tasks effectively by producing valid probabilities, using appropriate loss functions, and modeling decision boundaries. There are two main reasons for using logistic regression instead of linear regression for classification tasks:

Output Range: Linear regression outputs continuous values that are unbounded, meaning predictions can exceed the range of valid probabilities (0 to 1). Logistic regression applies the sigmoid function, which bounds predictions between 0 and 1, making them interpretable as probabilities for classification tasks.
Error Measurement: Linear regression minimizes the Mean Squared Error (MSE), which is not appropriate for classification problems. Logistic regression minimizes Cross-Entropy Loss, which is specifically designed to measure the divergence between predicted probabilities and actual class labels, improving performance for classification.

4. Why Is It Called Logistic Regression and Not Logistic Classification?

Despite being used for classification tasks, logistic regression retains the term “regression” because of the underlying mathematical model. It is based on a linear regression framework, where we predict a continuous value (before applying the sigmoid function). The transformation to a probability using the sigmoid function and the application of a threshold to convert the probability into a class label is what differentiates it from traditional regression.

Thus, logistic regression is technically a regression model, but it is used for classification by interpreting the output as probabilities.

5. Logistic Regression for Multi-Class Classification

While logistic regression is typically used for binary classification, it can be extended to handle multi-class classification. There are several techniques to achieve this:

One-vs-Rest (OvR): In this approach, a separate binary logistic regression model is trained for each class. For each class, the model predicts whether an instance belongs to that class or not (hence “one-vs-rest”). At prediction time, the class with the highest probability is selected as the final prediction.
Softmax Regression: Another common approach for multi-class logistic regression, where the output layer has one unit for each class and the softmax function is applied to compute the probabilities for each class. This can be thought of as a generalization of logistic regression to multiple classes.

The One-vs-Rest (OvR) approach is commonly used in practice, and many machine learning libraries like scikit-learn provide a parameter multi_class="ovr" to handle multi-class logistic regression tasks.

Example Code

Binary Classification

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Initialize the logistic regression model
binary_model = LogisticRegression()

# Train the model
binary_model.fit(X_train, y_train)

# Predict on the test set
y_pred_binary = binary_model.predict(X_test)

# Evaluate the model
print("Binary Classification Accuracy:", accuracy_score(y_test, y_pred_binary))
print("Classification Report:\n", classification_report(y_test, y_pred_binary))

Multi-Class Classification

For multi-class classification, logistic regression supports One-vs-Rest (OvR) and Softmax (Multinomial) approaches. Here’s how to use both:

1. One-vs-Rest (OvR)

# Initialize the logistic regression model with OvR
multi_model_ovr = LogisticRegression(multi_class="ovr")

# Train the model
multi_model_ovr.fit(X_train, y_train)

# Predict on the test set
y_pred_multi_ovr = multi_model_ovr.predict(X_test)

# Evaluate the model
print("Multi-Class (OvR) Accuracy:", accuracy_score(y_test, y_pred_multi_ovr))
print("Classification Report:\n", classification_report(y_test, y_pred_multi_ovr))

2. Multinomial (Softmax)

# Initialize the logistic regression model with multinomial (softmax)
multi_model_softmax = LogisticRegression(multi_class="multinomial", solver="lbfgs")

# Train the model
multi_model_softmax.fit(X_train, y_train)

# Predict on the test set
y_pred_multi_softmax = multi_model_softmax.predict(X_test)

# Evaluate the model
print("Multi-Class (Softmax) Accuracy:", accuracy_score(y_test, y_pred_multi_softmax))
print("Classification Report:\n", classification_report(y_test, y_pred_multi_softmax))

Notes

The default solver for Logistic Regression is "lbfgs", which works well for most cases. However, if you encounter performance issues with large datasets, consider "saga" as an alternative.
For binary classification, you don’t need to specify multi_class, as it defaults to OvR for 2 classes.

Summary

Logistic regression is a powerful method for classification, particularly binary classification. It works by predicting the probability of an instance belonging to a certain class using the sigmoid function and a cross-entropy loss function. Additionally, logistic regression can be extended for multi-class classification using strategies like One-vs-Rest (OvR).