Inscrição na biblioteca: Guest
Portal Digital Begell Biblioteca digital da Begell eBooks Diários Referências e Anais Coleções de pesquisa
Journal of Machine Learning for Modeling and Computing

ISSN Imprimir: 2689-3967
ISSN On-line: 2689-3975

Journal of Machine Learning for Modeling and Computing

DOI: 10.1615/.2020034126
pages 39-74


Yeonjong Shin
Division of Applied Mathematics, Brown University, Providence, Rhode Island 02912, USA
George Em Karniadakis
Division of Applied Mathematics, Brown University, Providence, Rhode Island 02912, USA


In this paper we study the trainability of rectified linear unit (ReLU) networks at initialization. A ReLU neuron is said to be dead if it only outputs a constant for any input. Two death states of neurons are introduced−tentative and permanent death. A network is then said to be trainable if the number of permanently dead neurons is sufficiently small for a learning task. We refer to the probability of a randomly initialized network being trainable as trainability. We show that a network being trainable is a necessary condition for successful training, and the trainability serves as an upper bound of training success rates. In order to quantify the trainability, we study the probability distribution of the number of active neurons at initialization. In many applications, overspecified or overparameterized neural networks are successfully employed and shown to be trained effectively. With the notion of trainability, we show that overparameterization is both a necessary and a sufficient condition for achieving a zero training loss. Furthermore, we propose a data-dependent initialization method in an overparameterized setting. Numerical examples are provided to demonstrate the effectiveness of the method and our theoretical findings.


  1. Allen-Zhu, Z., Li, Y., and Song, Z., A Convergence Theory for Deep Learning via Over-Parameterization, arXiv preprint, 2018. arXiv: 1811.03962.

  2. Byrd, R.H., Lu, P., Nocedal, J., and Zhu, C., A Limited Memory Algorithm for Bound Constrained Optimization, SIAMJ. Sci. Comput, vol. 16, no. 5, pp. 1190-1208,1995.

  3. Cybenko, G., Approximation by Superpositions of a Sigmoidal Function, Math. Control, Signals Sys., vol. 2, no. 4, pp. 303-314,1989.

  4. Du, S.S., Lee, J.D., Li, H., Wang, L., and Zhai, X., Gradient Descent Finds Global Minima ofDeep Neural Networks, arXiv preprint, 2018a. arXiv: 1811.03804.

  5. Du, S.S., Zhai, X., Poczos, B., and Singh, A., Gradient Descent Provably Optimizes Over-Parameterized Neural Networks, arXiv preprint, 2018b. arXiv: 1810.02054.

  6. Duchi, J., Hazan, E., and Singer, Y., Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, J. Machine Learning Res., vol. 12, no. Jul, pp. 2121-2159,2011.

  7. Glorot, X. and Bengio, Y., Understanding the Difficulty of Training Deep Feedforward Neural Networks, Int. Conf. on Artificial Intelligence and Statistics, pp. 249-256, Sardinia, Italy, May 13-15,2010.

  8. He, K., Zhang, X., Ren, S., and Sun, J., Delving Deep into Rectifiers: Surpassing Human-Level Performance onlmagenet Classification, IEEE Int. Conf. on Computer Vision, pp. 1026-1034, Santiago, Chile, December 13-16,2015.

  9. Hinton, G., Overview of Mini-Batch Gradient Descent, accessed from /csc321/slides/lecture_slides_lec6.pdf, 2014.

  10. Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., and Kingsbury, B., Deep Neural Networks for Acoustic Modeling in Speech Recognition, IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82-97,2012.

  11. Hornik, K., Approximation Capabilities of Multilayer Feedforward Networks, Neural Networks, vol. 4, no. 2, pp. 251-257,1991.

  12. Ioffe, S. and Szegedy, C., Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Proc. of the 32nd Int. Conf. on Machine Learning, vol. 37, pp. 448-456,2015.

  13. Kingma, D.P. and Ba, J., Adam: A Method for Stochastic Optimization, Int. Conf. on Learning Represen-tations, San Diego, CA, USA, May 7-9,2015.

  14. Krahenbuhl, P., Doersch, C., Donahue, J., and Darrell, T., Data-Dependent Initializations of Convolutional Neural Networks, arXiv preprint, 2015. arXiv: 1511.06856.

  15. Krizhevsky, A., Sutskever, I., and Hinton, G., Imagenet Classification with Deep Convolutional Neural Networks, Adv. Neural Inf. Proc. Sys., vol. 25, pp. 1097-1105,2012.

  16. LeCun, Y., Bottou, L., Orr, G.B., and Muller, K.R., Efficient Backprop, Neural Networks: Tricks of the Trade, Berlin-Heidelberg, Germany: Springer, pp. 9-48, 2012.

  17. Leopardi, P.C., Distributing Points on the Sphere: Partitions, Separation, Quadrature and Energy, PhD, University of New South Wales, Sydney, Australia, 2007.

  18. Li, Y. and Liang, Y., Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data, Adv. Neural Inf. Proc. Sys., vol. 31, pp. 8157-8166,2018.

  19. Livni, R., Shalev-Shwartz, S., and Shamir, O., On the Computational Efficiency of Training Neural Networks, AdK Neural Inf. Proc. Sys., vol. 27, pp. 855-863,2014.

  20. Lu, L., Shin, Y., Su, Y., and Karniadakis, G.E., Dying ReLU and Initialization: Theory and Numerical Examples, arXiv preprint, 2019. arXiv: 1903.06733.

  21. Mishkin, D. and Matas, J., All You Need Is a Good Init, Int. Conf. on Learning Representations, San Juan, Puerto Rico, USA, May 2-4,2016.

  22. Nguyen, Q. and Hein, M., The Loss Surface of Deep and Wide Neural Networks, Proc. of the 34nd Int. Conf. on Machine Learning, vol. 70, pp. 2603-2612,2017.

  23. Oymak, S. and Soltanolkotabi, M., Towards Moderate Overparameterization: Global Convergence Guarantees for Training Shallow Neural Networks, arXiv preprint, 2019. arXiv: 1902.04674.

  24. Reddi, S.J., Kale, S., and Kumar, S., On the Convergence of Adam and Beyond, arXiv preprint, 2019. arXiv: 1904.09237.

  25. Robbins, H. and Monro, S., A Stochastic Approximation Method, Annals Math. Stat., vol. 22, no. 3, pp. 400-407,1951.

  26. Ruder, S., An Overview of Gradient Descent Optimization Algorithms, arXiv preprint, 2016. arXiv: 1609.04747.

  27. Rumelhart, D.E., Hinton, G.E., and Williams, R.J., Learning Internal Representations by Error Propagation, Tech. Rep., California University San Diego, La Jolla Institute for Cognitive Science, 1985.

  28. Safran, I. and Shamir, O., On the Quality of the Initial Basin in Overspecified Neural Networks, Proc. of the 33rdInt. Conf. on Machine Learning, vol. 48, pp. 774-782,2016.

  29. Salimans, T. and Kingma, D.P., Weight normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks, Adv. Neural Inf. Proc. Sys, vol. 29, pp. 901-909,2016.

  30. Saxe, A.M., McClelland, J.L., and Ganguli, S., Exact Solutions to the Nonlinear Dynamics of Learning in Deep Linear Neural Networks, Int. Conf. Learning Representations, Banff, Canada, April 14-16, 2014.

  31. Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Driessche, G.V.D., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., and Lanctot, M., Mastering the Game of Go with Deep Neural Networks and Tree Search, Nature, vol. 529, no. 7587, p. 484,2016.

  32. Soltanolkotabi, M., Javanmard, A., and Lee, J.D., Theoretical Insights into the Optimization Landscape of Over-Parameterized Shallow Neural Networks, IEEE Transact. Inf. Theor., vol. 65, no. 2, pp. 742-769, 2019.

  33. Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., et al., Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, arXiv preprint, 2016. arXiv: 1609.08144.

  34. Zou, D., Cao, Y., Zhou, D., and Gu, Q., Stochastic Gradient Descent Optimizes Over-Parameterized Deep RelU Networks, arXiv preprint, 2018. arXiv: 1811.08888.

Articles with similar content:

Detection and Avoidance of Input Validation Attacks in Web Application Using Deterministic Push Down Automata
Journal of Automation and Information Sciences, Vol.51, 2019, issue 9
S. Senthilkumar, V. Nithya
Protecting Privacy and Confidentiality in Environmental Health Research
Ethics in Biology, Engineering and Medicine: An International Journal, Vol.1, 2010, issue 4
David B. Resnik
International Journal for Uncertainty Quantification, Vol.10, 2020, issue 4
Mohammad Motamed
TsAGI Science Journal, Vol.46, 2015, issue 1
Sergey Georgievich Bazhenov, Natalia Borisovna Lysenkova
International Heat Transfer Conference 8, Vol.6, 1986, issue
P.V. Ramachandran, C.P. Arora