Why does gradient descent take the negative gradient?

By Nischal Lal Shrestha | On Tuesday, July 19 2025 | Filed under Algorithm

Image for Why does gradient descent take the negative gradient?

Imagine standing halfway up a mountain. It's cold, windy, and our only goal is simple: get to the lowest point in the valley.

But here's the problem, we are blindfolded. We can't see the landscape, the valleys, or even our own feet. The only information we're given is "The ground slopes this way."

So what do we do?

We trust the slope. We step in the opposite direction of the slope. That, in essence, is the gradient descent.

The Formula Behind the Intution

In machine learning, gradient descent is the algorithm we use to tweat a model's parameters like weights in a neural network. Gradient descent is an optimization algorithm used to minimize a cost (or loss) function. Its core update rule is:

$$w_{\text{new}} = w_{\text{old}} - \eta \cdot \nabla J(w)$$

Where:

  • \(w\) is the weight (or parameter),
  • \(\eta\) is the learning rate (a small positive constant),
  • \(\nabla J(w)\) is the gradient of the cost function with respect to \(w\).
The gradient \(\nabla J(w)\) points in the direction of the steepest incrase of the function \(\nabla J(w)\). Since we want to minimize the function, we go in the opposite direction hence the negative gradient.

What If the Gradient is Positive?

Suppose we're at a point where the slope is +ve. \(\nabla J(w)\) is \(+0.6\), learning rate \(\eta=0.1\) . Then the update rule becomes: $$w_{\text{new}} = w_{\text{old}} - 0.1 \cdot 0.6 = w_{\text{old}} - 0.06$$ Here we are decreaing the weight. Since the function is increasing here, we move in the opposite direction to go downhill(i.e decreasing the weight).

What If the Gradient is Negative?

Suppose we're at a point where the slope is -ve. \(\nabla J(w)\) is \(-0.4\), learning rate \(\eta=0.1\) . Then the update rule becomes: $$w_{\text{new}} = w_{\text{old}} - 0.1 \cdot (-0.4) = w_{\text{old}} + 0.04$$ Here we are increasing the weight. Since the function is decreasing (slope is -ve) here, we move in the opposite direction to go downhill (i.e increasing the weight).