Keras Optimizers Explained: RMSProp

A Comprehensive Overview of the RMSProp Optimization Algorithm

Okan Yenigün
Python in Plain English

--

Photo by Francesco Califano on Unsplash

RMSProp (Root Mean Squared Propagation) is an adaptive learning rate optimization algorithm.

Training deep learning models requires the optimization of weights with respect to a loss function. Gradient Descent and its variants are the most commonly used algorithms for this task. The basic idea behind Gradient Descent is to adjust the weights of the model in the direction that reduces the loss.

Remember that in Gradient Descent, we often encounter significant fluctuations in one direction (often depicted as the vertical direction in visualizations) even when our main goal is to advance in another direction (typically represented as the horizontal direction).

Gradient Descent. Image by the author.

In some scenarios, especially when the loss surface has steep and narrow valleys, Gradient Descent can cause the optimization process to “bounce” back and forth in the vertical direction (up and down the steep sides of the valley) rather than making steady progress towards the minimum in the horizontal direction.

To help visualize this, imagine the vertical axis representing the parameter b, and the horizontal axis representing the parameter w. However, remember that this is a simplification for illustrative purposes. In reality, we operate within a parameter space with many dimensions.

Image by the author.

We aim to decelerate the updates in the b direction while accelerating them in the w direction. The RMSProp algorithm effectively achieves this. It is designed to adaptively adjust the learning rate based on the recent magnitudes of gradients.

On iteration t:

  • It computes the dw, and db on the current mini-batch.

The operation ((1−β) dw^2 is element-wise. It maintains an exponentially weighted average of the squared derivatives.

ϵ is a small constant to prevent division by zero (e.g., 10−8).

Remember, our goal is to accelerate the learning process in the horizontal direction while decelerating or damping it in the vertical direction to minimize fluctuations.

We anticipate that sdw will be comparatively small, leading to division by a smaller value during the update process. Conversely, sdb is expected to be larger, resulting in division by a greater number when updating, which in turn slows down the adjustments in the vertical direction.

So our updates will be looking more like this.

Image by the author.

The algorithm is named ‘Root Mean Squared’ because it involves squaring the derivatives and subsequently taking the square root.

RMSProp adapts the learning rates by using the moving average of the squared gradient. It learns adaptively.

Unlike Adagrad, which can have an aggressively decreasing learning rate that makes it stop prematurely, RMSProp’s moving average approach allows for more flexibility.

Like other adaptive learning rate methods, RMSProp has hyperparameters that might need tuning for different problems.

The moving average of the squared gradient can increase or decrease, which can sometimes lead to instability, especially near the minimum.

Keras Implementation

tf.keras.optimizers.RMSprop(
learning_rate=0.001,
rho=0.9,
momentum=0.0,
epsilon=1e-07,
centered=False,
weight_decay=None,
clipnorm=None,
clipvalue=None,
global_clipnorm=None,
use_ema=False,
ema_momentum=0.99,
ema_overwrite_frequency=100,
jit_compile=True,
name="RMSprop",
**kwargs
)
  • learning_rate is the learning rate :)
  • rho is the decay factor for the moving average of the squared gradient, similar to β in the RMSProp formula. It determines the weight of the historical squared gradient in the moving average.
  • momentum accelerates the optimizer in the relevant direction and dampens oscillations. It’s a value between 0 (no momentum) and 1 (maximum). This introduces a velocity component to the original RMSProp, making it similar to the momentum in the traditional momentum SGD optimizer.
  • epsilon is a small constant for numerical stability.
  • centered computes the centered RMSProp if it is set to True. It subtracts the mean gradient from the current gradient before squaring and accumulating. This can provide better convergence.
  • weight_decay is the L2 regularization value. It can help in preventing overfitting by penalizing large weights.
  • clipnorm provides clipping to gradient values when their L2 norm exceeds this value.
  • clipvalue If set, the gradient of each weight is clipped to be no higher than this value.
  • global_clipnorm is used to ensure the global norm of the gradients does not exceed this value. This is another way to prevent exploding gradients.
  • use_ema determines whether to use the Exponential Moving Average (EMA) of the parameters instead of the parameters themselves.
  • ema_momentum is the momentum for the EMA update.
  • ema_overwrite_frequency is the frequency at which the original model weights are overwritten by the EMA weights.
  • jit_compile compiles the optimizer function using XLA. This can improve the execution speed.
  • name is the name of the optimizer instance.

The RMSProp optimizer stands as a beacon of adaptability in the realm of deep learning optimization algorithms. By dynamically adjusting the learning rate based on the historical magnitudes of gradients, RMSProp offers a robust solution to the challenges of slow convergence and oscillations inherent in traditional gradient descent. Its ability to navigate steep and shallow regions of the loss landscape efficiently makes it a favored choice for many practitioners. While newer optimization techniques continue to emerge, the principles and effectiveness of RMSProp ensure it remains a foundational tool in the deep learning toolkit.

Read More

Sources

https://www.youtube.com/watch?v=_e-LFe_igno

https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/experimental/RMSprop

https://keras.io/api/optimizers/rmsprop/

https://www.youtube.com/watch?v=ajI_HTyaCu8

In Plain English

Thank you for being a part of our community! Before you go:

--

--