Stochastic Gradient Descent

Stochastic Gradient Descent (SGD): A Simple Overview

Machine learning models are essential in predicting outcomes, identifying trends, and giving insights from data. We use mathematical techniques like optimization to train models, making them more accurate in making predictions for future data. One popular optimization technique is stochastic gradient descent (SGD).

What is SGD?

SGD is an iterative optimization technique that uses mini-batches of data to calculate the gradient of the loss function. Instead of using the entire available data, SGD randomly selects a subset of the data. The subset of the data is then used to form an estimate of the full gradient. The actual gradient of the full loss function may be noisy or difficult to compute. Therefore, using SGD allows for quicker updates of the model weight because it is less dependent on computations dependent on large amounts of data.

The objective of optimization is to minimize the loss function, which represents the difference between the predicted and true outcomes. The equation for SGD algorithm is as follows: Image Source needed

$$ w_{t+1} = w_{t} - \eta \hat{\nabla}_{w}{L(w_{t})} $$

Where $w$ is the weight of the model, $L$ is the loss function, $t$ represents the iteration number, and $\eta$ is the learning rate. The learning rate is a scalar that controls the size of weight updates. Therefore, this algorithm uses a mini-batch of data to estimate the gradient and then updates the model's parameters $w$ in the direction of the negative gradient.

Advantages of SGD

SGD has several advantages over other optimization techniques:

1. Reduced Redundancy

The main advantage of SGD is that it reduces redundancy since it relies on mini-batches of data for the gradient estimate. In batch gradient descent, the algorithm recomputes gradients for similar examples before each parameter update. In contrast, SGD computes a single weight update for each mini-batch of data. This feature enables faster computations making SGD a popular algorithm in large dataset training tasks.

2. Suitable for Online Learning

SGD is a popular algorithm used for online learning. With online learning, data items come in large streams, and the underlying models must be updated. SGD's ability to work with mini-batches works well in both situations with large datasets and can quickly adjust to rapidly changing data distributions.

3. Works well with Non-convex Functions

SGD performs well in non-convex and noisy areas of the loss function. The iterative nature of the algorithm allows it to escape from local minima by trying to find the global minimum of the loss function. This feature makes SGD an excellent optimizer algorithm for neural networks that have non-convex optimization problems.

These advantages, among others, make SGD a widely-used optimization algorithm in machine learning models. Nonetheless, SGD also has some limitations.

Disadvantages of SGD

Although the SGD algorithm sounds perfect, there are some limitations to it. Here are some of the significant disadvantages:

1. Noise-Dependent

The SGD algorithm commonly used update mini-batches based on random selection. Thus, the algorithm is particularly sensitive to the noise in the data. This sensitivity later results in random weight update movements that increase the loss function's randomness, making it harder for the gradient to reach the true loss minimum.

2. Sensitivity to Learning Rate

The learning rate $\eta$ controls the size of the weight updates. Therefore, finding the right learning rate is essential. Learning rate values that are too large can lead to divergent weight updates or sudden increases in the loss function. On the other hand, a learning rate that is too small may take too long to converge.

3. Can Stagnate at Saddle Points

SGD does not work well around Saddle points. A saddle point is a location in the gradient space of a function that is situated on a ridge or groove in one direction but on a steep slope in another direction. At such a point, the gradient of the function becomes zero but does not mean that we have reached a minimum. Consequently, the sensitivity of the loss function around saddle points makes SGD optimizations challenging when encountering this problem.

Stochastic gradient descent is an iterative optimization algorithm that is widely used in training machine learning models. This algorithm offers several advantages in reducing redundancy, fast computation, performing well with non-convex functions, and many more. In contrast, its limitations include noise dependency, leaky gradients, sensitivity to learning, and instability in high-dimensional spaces. Therefore, to find the best optimization algorithm, it is essential to know the data you are working with to test an algorithm's performance in that context.