Connecting Gradient Descent to Physically Observed Solutions

Pavan B Govindaraju
3 min readMar 22, 2023

--

Gradient descent is a widely used optimization technique in the Machine Learning community. The training phase in neural networks involves solving an optimization problem and due to the large number of parameters involved, it is suitable to use a lightweight optimization method.

Gradient Descent (Cyprien Delaporte on Unsplash)

This makes gradient descent a great candidate as it requires only information about the first derivative unlike other methods such as Newton, which require the second derivatives to also be computed. Since the training phase is already computationally intensive due to the high dimensionality of typical neural networks in practice, this is a common approach and has been observed to be sufficient empirically as well.

“Physics Informed Neural networks” are a common approach to perform predictions of physical phenomena [1], where the solution u(t, x) is approximated using a neural network and is trained by minimizing the error on initial and boundary data along with an additional term that minimizes the solution error on the physical model itself on a set of sample points. Both errors are measured using the L2-norm.

In practice, variations of gradient descent are used for training purposes to attain the solution quickly. One common approach is known as weight decay, which is better explained by the following update equation:

Typically, the weight decay term only involves the weights and not the biases. The key idea is to note that adding this term is equivalent to having a penalty L2-norm of w in the objective function. Thus, the final solution would be biased towards smaller weights.

Common ML frameworks also provide this as part of their functionality, with Tensorflow providing it as kernel_regularizer and PyTorch having it as a weight_decay argument.

If ReLU is chosen as the activation function and for “positively-constrained” neural networks (where all the weights and biases are positive), this leads to an interesting observation. Since t, x can be taken to be positive, each layer essentially performs a w.x + b operation. And since w decreases in norm due to weight decay, even the final solution is biased towards a smaller norm.

Thus, an L2-penalty on weights is same as L2-penalty on the solution in this case. Using such variants of gradient descent would thus implicitly give the minimal-energy solution, something that is physically observed as well in equilibrium situations.

Note that even Adam and RMSProp include an implicit form of weight decay and would bias towards low-energy solutions as well, although this needs to be shown more rigorously.

Acknowledgements

[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational physics, 378, 686–707

--

--

Pavan B Govindaraju
Pavan B Govindaraju

Written by Pavan B Govindaraju

Specializes in not specializing || Blogging about data, systems and tech in general

No responses yet