Discriminative Fine-Tuning

Discriminative Fine-Tuning: An Overview

Discriminative Fine-Tuning is a strategy used for ULMFiT type models. This strategy allows us to tune each layer of our model with different learning rates to improve its accuracy. Fine-tuning is a popular technique where pre-trained models are adapted to new tasks by updating their parameters with new data. But fine-tuning all layers with the same learning rate may not be the best option when dealing with complex models. That's where Discriminative Fine-Tuning comes into play.

Regular SGD update of a model’s parameters $\theta$ at a particular time step $t$ can be represented as:

$ \large\theta_{t} = \theta_{t-1} - \eta\cdot\nabla_{\theta}J(\theta)$

Here, $\eta$ is the learning rate and $\nabla_{\theta}J(\theta)$ is the gradient with regard to the model’s objective function. However, with discriminative fine-tuning, we split the model parameters $\theta$ into different layers {$\theta_{1}, \ldots, \theta_{L}$}, each with its unique learning rate {$\eta_{1}, \ldots, \eta_{L}$}. Therefore, the update of $\theta$ for each specific layer at time $t$ can be represented as:

$\large\theta_{t}^{l} = \theta_{t-1}^{l} - \eta^{l}\cdot\nabla_{\theta^{l}}J(\theta)$

Here, $\theta_{l}^{l}$ contains the parameters of the model at the $l$-th layer and $L$ is the number of layers of the model.

Why Use Discriminative Fine-Tuning?

While fine-tuning can result in significant improvements in model performance, it can be tricky when dealing with complex and deep neural networks. For instance, when updating a deep neural network with new data, we may want to make larger updates to the lower-level layers and small updates to the higher-level layers. This is because the lower-level layers capture finer-grained features, and the higher-level layers are more abstract. If we proceed to fine-tune all the layers with the same learning rate, the lower-level layers may get destroyed by overly aggressive optimization, while the higher-level layers may not learn much.

With discriminative fine-tuning, we can address this issue by assigning different learning rates to different layers based on their importance. Typically, the learning rate used for higher layers is much lower than that used for lower layers. This helps mitigate the overfitting problem related to the higher-level layers while allowing the lower-level layers to learn the new task effectively.

How to Choose Learning Rates?

Choosing appropriate learning rates for each layer is critical to the success of discriminative fine-tuning. The authors of ULMFiT followed an empirical process to determine the best learning rate for each layer. They first fine-tuned the last layer alone and set the learning rate for the last layer. They then set the learning rate for the lower layers based on this value, typically as $\eta_{l-1}=\eta_{l}/2.6$. This ratio helps to prevent drastic changes to the lower-level layers, ensuring that the model performs better on the new task.

Discriminative Fine-Tuning: Benefits and Drawbacks

One of the primary benefits of discriminative fine-tuning is that it enables us to update different layers with different learning rates. This way, we can optimize the lower-level layers more aggressively while allowing the higher-level layers to learn the new task more gradually. Additionally, discriminative fine-tuning can help to deal with the overfitting problem, which is often a challenge in deep neural networks.

The main downside with discriminative fine-tuning is that it can be computationally expensive since we have to tune the learning rate for each layer. As such, it may not be feasible to apply this technique to extremely large datasets with hundreds of layers. Additionally, selecting the appropriate learning rates for each layer can be a daunting task, especially if we are dealing with a new, untested network architecture.

Discriminative fine-tuning is a useful technique that helps optimize deep neural networks effectively. This technique allows us to adjust different layer learning rates to fine-tune our model better. The choice of learning rate is critical for this technique to be effective. The benefit of this technique is that it allows for better optimization of neural networks by preventing overfitting and optimizing lower-level layers differently than higher-level layers. However, the downside is that the technique can be computationally expensive and challenging to implement, especially when dealing with brand new network architectures.