Home Data News Researchers show that randomly initialized gradient descent can achieve zero training loss...

Researchers show that randomly initialized gradient descent can achieve zero training loss in deep learning

November 13, 2018 - 6:18 am

1982

2 min read

Yesterday, researchers from Carnegie Mellon University, University of Southern California, Peking University, and Massachusetts Institute of Technology published a paper on a big optimization problem in deep learning. This study proves that randomly initialized gradient descent can achieve zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet).

The key idea is to show that the Gram matrix is increasingly stable under overparameterization, and so every step of gradient descent decreases the loss at a geometric rate.

What is this study is based on?

This study builds on two ideas from previous works on gradient descent for two-layer neural networks:

The researchers analyzed the dynamics of the predictions whose convergence is determined by the least eigenvalue of the Gram matrix induced by the neural network architecture. And to lower bound the least eigenvalue, it is sufficient to bound the distance of each weight matrix from its initialization.
The second base concept is the observation by Li and Liang, which states that if the neural network is overparameterized, every weight matrix is close to its initialization.

What are the key observations made in this study?

This study focuses on the least squares loss and assumes the activation is Lipschitz and smooth. Consider that there are n data points and the neural network has H layers with width m.

The following are the aims this study tries to prove:

Fully-connected feedforward network: If m = Ω poly(n)2O(H)1, then randomly initialized gradient descent converges to zero training loss at a linear rate.
ResNet architecture: If m = Ω (poly(n, H)), then randomly initialized gradient descent converges to zero training loss at a linear rate. When compared with the first result, the dependence on the number of layers improves exponentially for ResNet. This theory demonstrates the advantage of using residual connections.
Convolutional ResNet: The same technique is used to analyze the convolutional ResNet. If m = poly(n, p, H) where p is the number of patches, then randomly initialized gradient descent achieves zero training loss.

To learn more, you can, read the full paper: Gradient Descent Finds Global Minima of Deep Neural Networks.

Top 6 Cybersecurity Books from Packt to Accelerate Your Career

Your Quick Introduction to Extended Events in Analysis Services from Blog…

Logging the history of my past SQL Saturday presentations from Blog…

Storage savings with Table Compression from Blog Posts – SQLServerCentral

Daily Coping 31 Dec 2020 from Blog Posts – SQLServerCentral

Learning Essential Linux Commands for Navigating the Shell Effectively

Exploring the Strategy Behavioral Design Pattern in Node.js

How to integrate a Medium editor in Angular 8

Implementing memory management with Golang’s garbage collector

How to create sales analysis app in Qlik Sense using DAR…

Researchers show that randomly initialized gradient descent can achieve zero training loss in deep learning

What is this study is based on?

What are the key observations made in this study?

Read Next

MobilePro

datapro

Programming

Subscribe to our newsletter