Library Subscription: Guest
Journal of Machine Learning for Modeling and Computing

Published 4 issues per year

ISSN Print: 2689-3967

ISSN Online: 2689-3975

Indexed in

TRAINABILITY OF ReLU NETWORKS AND DATA-DEPENDENT INITIALIZATION

Volume 1, Issue 1, 2020, pp. 39-74
DOI: 10.1615/JMachLearnModelComput.2020034126
Get accessDownload

ABSTRACT

In this paper we study the trainability of rectified linear unit (ReLU) networks at initialization. A ReLU neuron is said to be dead if it only outputs a constant for any input. Two death states of neurons are introduced−tentative and permanent death. A network is then said to be trainable if the number of permanently dead neurons is sufficiently small for a learning task. We refer to the probability of a randomly initialized network being trainable as trainability. We show that a network being trainable is a necessary condition for successful training, and the trainability serves as an upper bound of training success rates. In order to quantify the trainability, we study the probability distribution of the number of active neurons at initialization. In many applications, overspecified or overparameterized neural networks are successfully employed and shown to be trained effectively. With the notion of trainability, we show that overparameterization is both a necessary and a sufficient condition for achieving a zero training loss. Furthermore, we propose a data-dependent initialization method in an overparameterized setting. Numerical examples are provided to demonstrate the effectiveness of the method and our theoretical findings.

REFERENCES
  1. Allen-Zhu, Z., Li, Y., and Song, Z., A Convergence Theory for Deep Learning via Over-Parameterization, arXiv preprint, 2018. arXiv: 1811.03962.

  2. Byrd, R.H., Lu, P., Nocedal, J., and Zhu, C., A Limited Memory Algorithm for Bound Constrained Optimization, SIAMJ. Sci. Comput, vol. 16, no. 5, pp. 1190-1208,1995.

  3. Cybenko, G., Approximation by Superpositions of a Sigmoidal Function, Math. Control, Signals Sys., vol. 2, no. 4, pp. 303-314,1989.

  4. Du, S.S., Lee, J.D., Li, H., Wang, L., and Zhai, X., Gradient Descent Finds Global Minima ofDeep Neural Networks, arXiv preprint, 2018a. arXiv: 1811.03804.

  5. Du, S.S., Zhai, X., Poczos, B., and Singh, A., Gradient Descent Provably Optimizes Over-Parameterized Neural Networks, arXiv preprint, 2018b. arXiv: 1810.02054.

  6. Duchi, J., Hazan, E., and Singer, Y., Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, J. Machine Learning Res., vol. 12, no. Jul, pp. 2121-2159,2011.

  7. Glorot, X. and Bengio, Y., Understanding the Difficulty of Training Deep Feedforward Neural Networks, Int. Conf. on Artificial Intelligence and Statistics, pp. 249-256, Sardinia, Italy, May 13-15,2010.

  8. He, K., Zhang, X., Ren, S., and Sun, J., Delving Deep into Rectifiers: Surpassing Human-Level Performance onlmagenet Classification, IEEE Int. Conf. on Computer Vision, pp. 1026-1034, Santiago, Chile, December 13-16,2015.

  9. Hinton, G., Overview of Mini-Batch Gradient Descent, accessed from http://www.cs.toronto.edu/tijmen /csc321/slides/lecture_slides_lec6.pdf, 2014.

  10. Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., and Kingsbury, B., Deep Neural Networks for Acoustic Modeling in Speech Recognition, IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82-97,2012.

  11. Hornik, K., Approximation Capabilities of Multilayer Feedforward Networks, Neural Networks, vol. 4, no. 2, pp. 251-257,1991.

  12. Ioffe, S. and Szegedy, C., Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Proc. of the 32nd Int. Conf. on Machine Learning, vol. 37, pp. 448-456,2015.

  13. Kingma, D.P. and Ba, J., Adam: A Method for Stochastic Optimization, Int. Conf. on Learning Represen-tations, San Diego, CA, USA, May 7-9,2015.

  14. Krahenbuhl, P., Doersch, C., Donahue, J., and Darrell, T., Data-Dependent Initializations of Convolutional Neural Networks, arXiv preprint, 2015. arXiv: 1511.06856.

  15. Krizhevsky, A., Sutskever, I., and Hinton, G., Imagenet Classification with Deep Convolutional Neural Networks, Adv. Neural Inf. Proc. Sys., vol. 25, pp. 1097-1105,2012.

  16. LeCun, Y., Bottou, L., Orr, G.B., and Muller, K.R., Efficient Backprop, Neural Networks: Tricks of the Trade, Berlin-Heidelberg, Germany: Springer, pp. 9-48, 2012.

  17. Leopardi, P.C., Distributing Points on the Sphere: Partitions, Separation, Quadrature and Energy, PhD, University of New South Wales, Sydney, Australia, 2007.

  18. Li, Y. and Liang, Y., Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data, Adv. Neural Inf. Proc. Sys., vol. 31, pp. 8157-8166,2018.

  19. Livni, R., Shalev-Shwartz, S., and Shamir, O., On the Computational Efficiency of Training Neural Networks, AdK Neural Inf. Proc. Sys., vol. 27, pp. 855-863,2014.

  20. Lu, L., Shin, Y., Su, Y., and Karniadakis, G.E., Dying ReLU and Initialization: Theory and Numerical Examples, arXiv preprint, 2019. arXiv: 1903.06733.

  21. Mishkin, D. and Matas, J., All You Need Is a Good Init, Int. Conf. on Learning Representations, San Juan, Puerto Rico, USA, May 2-4,2016.

  22. Nguyen, Q. and Hein, M., The Loss Surface of Deep and Wide Neural Networks, Proc. of the 34nd Int. Conf. on Machine Learning, vol. 70, pp. 2603-2612,2017.

  23. Oymak, S. and Soltanolkotabi, M., Towards Moderate Overparameterization: Global Convergence Guarantees for Training Shallow Neural Networks, arXiv preprint, 2019. arXiv: 1902.04674.

  24. Reddi, S.J., Kale, S., and Kumar, S., On the Convergence of Adam and Beyond, arXiv preprint, 2019. arXiv: 1904.09237.

  25. Robbins, H. and Monro, S., A Stochastic Approximation Method, Annals Math. Stat., vol. 22, no. 3, pp. 400-407,1951.

  26. Ruder, S., An Overview of Gradient Descent Optimization Algorithms, arXiv preprint, 2016. arXiv: 1609.04747.

  27. Rumelhart, D.E., Hinton, G.E., and Williams, R.J., Learning Internal Representations by Error Propagation, Tech. Rep., California University San Diego, La Jolla Institute for Cognitive Science, 1985.

  28. Safran, I. and Shamir, O., On the Quality of the Initial Basin in Overspecified Neural Networks, Proc. of the 33rdInt. Conf. on Machine Learning, vol. 48, pp. 774-782,2016.

  29. Salimans, T. and Kingma, D.P., Weight normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks, Adv. Neural Inf. Proc. Sys, vol. 29, pp. 901-909,2016.

  30. Saxe, A.M., McClelland, J.L., and Ganguli, S., Exact Solutions to the Nonlinear Dynamics of Learning in Deep Linear Neural Networks, Int. Conf. Learning Representations, Banff, Canada, April 14-16, 2014.

  31. Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Driessche, G.V.D., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., and Lanctot, M., Mastering the Game of Go with Deep Neural Networks and Tree Search, Nature, vol. 529, no. 7587, p. 484,2016.

  32. Soltanolkotabi, M., Javanmard, A., and Lee, J.D., Theoretical Insights into the Optimization Landscape of Over-Parameterized Shallow Neural Networks, IEEE Transact. Inf. Theor., vol. 65, no. 2, pp. 742-769, 2019.

  33. Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., et al., Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, arXiv preprint, 2016. arXiv: 1609.08144.

  34. Zou, D., Cao, Y., Zhou, D., and Gu, Q., Stochastic Gradient Descent Optimizes Over-Parameterized Deep RelU Networks, arXiv preprint, 2018. arXiv: 1811.08888.

CITED BY
  1. Cheridito Patrick, Jentzen Arnulf, Rossmannek Florian, Non-convergence of stochastic gradient descent in the training of deep neural networks, Journal of Complexity, 64, 2021. Crossref

  2. Cheridito Patrick, Jentzen Arnulf, Riekert Adrian, Rossmannek Florian, A proof of convergence for gradient descent in the training of artificial neural networks for constant target functions, Journal of Complexity, 72, 2022. Crossref

  3. Ainsworth Mark, Shin Yeonjong, Active Neuron Least Squares: A Training Method for Multivariate Rectified Neural Networks, SIAM Journal on Scientific Computing, 44, 4, 2022. Crossref

  4. Jentzen Arnulf, Riekert Adrian, A proof of convergence for stochastic gradient descent in the training of artificial neural networks with ReLU activation for constant target functions, Zeitschrift für angewandte Mathematik und Physik, 73, 5, 2022. Crossref

Begell Digital Portal Begell Digital Library eBooks Journals References & Proceedings Research Collections Prices and Subscription Policies Begell House Contact Us Language English 中文 Русский Português German French Spain