图书馆订阅: Guest
Begell Digital Portal Begell 数字图书馆 电子图书 期刊 参考文献及会议录 研究收集
Journal of Machine Learning for Modeling and Computing

ISSN 打印: 2689-3967
ISSN 在线: 2689-3975

Journal of Machine Learning for Modeling and Computing

DOI: 10.1615/.2020034126
pages 39-74

TRAINABILITY OF ReLU NETWORKS AND DATA-DEPENDENT INITIALIZATION

Yeonjong Shin
Division of Applied Mathematics, Brown University, Providence, Rhode Island 02912, USA
George Em Karniadakis
Division of Applied Mathematics, Brown University, Providence, Rhode Island 02912, USA

ABSTRACT

In this paper we study the trainability of rectified linear unit (ReLU) networks at initialization. A ReLU neuron is said to be dead if it only outputs a constant for any input. Two death states of neurons are introduced−tentative and permanent death. A network is then said to be trainable if the number of permanently dead neurons is sufficiently small for a learning task. We refer to the probability of a randomly initialized network being trainable as trainability. We show that a network being trainable is a necessary condition for successful training, and the trainability serves as an upper bound of training success rates. In order to quantify the trainability, we study the probability distribution of the number of active neurons at initialization. In many applications, overspecified or overparameterized neural networks are successfully employed and shown to be trained effectively. With the notion of trainability, we show that overparameterization is both a necessary and a sufficient condition for achieving a zero training loss. Furthermore, we propose a data-dependent initialization method in an overparameterized setting. Numerical examples are provided to demonstrate the effectiveness of the method and our theoretical findings.

REFERENCES

  1. Allen-Zhu, Z., Li, Y., and Song, Z., A Convergence Theory for Deep Learning via Over-Parameterization, arXiv preprint, 2018. arXiv: 1811.03962.

  2. Byrd, R.H., Lu, P., Nocedal, J., and Zhu, C., A Limited Memory Algorithm for Bound Constrained Optimization, SIAMJ. Sci. Comput, vol. 16, no. 5, pp. 1190-1208,1995.

  3. Cybenko, G., Approximation by Superpositions of a Sigmoidal Function, Math. Control, Signals Sys., vol. 2, no. 4, pp. 303-314,1989.

  4. Du, S.S., Lee, J.D., Li, H., Wang, L., and Zhai, X., Gradient Descent Finds Global Minima ofDeep Neural Networks, arXiv preprint, 2018a. arXiv: 1811.03804.

  5. Du, S.S., Zhai, X., Poczos, B., and Singh, A., Gradient Descent Provably Optimizes Over-Parameterized Neural Networks, arXiv preprint, 2018b. arXiv: 1810.02054.

  6. Duchi, J., Hazan, E., and Singer, Y., Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, J. Machine Learning Res., vol. 12, no. Jul, pp. 2121-2159,2011.

  7. Glorot, X. and Bengio, Y., Understanding the Difficulty of Training Deep Feedforward Neural Networks, Int. Conf. on Artificial Intelligence and Statistics, pp. 249-256, Sardinia, Italy, May 13-15,2010.

  8. He, K., Zhang, X., Ren, S., and Sun, J., Delving Deep into Rectifiers: Surpassing Human-Level Performance onlmagenet Classification, IEEE Int. Conf. on Computer Vision, pp. 1026-1034, Santiago, Chile, December 13-16,2015.

  9. Hinton, G., Overview of Mini-Batch Gradient Descent, accessed from http://www.cs.toronto.edu/tijmen /csc321/slides/lecture_slides_lec6.pdf, 2014.

  10. Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., and Kingsbury, B., Deep Neural Networks for Acoustic Modeling in Speech Recognition, IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82-97,2012.

  11. Hornik, K., Approximation Capabilities of Multilayer Feedforward Networks, Neural Networks, vol. 4, no. 2, pp. 251-257,1991.

  12. Ioffe, S. and Szegedy, C., Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Proc. of the 32nd Int. Conf. on Machine Learning, vol. 37, pp. 448-456,2015.

  13. Kingma, D.P. and Ba, J., Adam: A Method for Stochastic Optimization, Int. Conf. on Learning Represen-tations, San Diego, CA, USA, May 7-9,2015.

  14. Krahenbuhl, P., Doersch, C., Donahue, J., and Darrell, T., Data-Dependent Initializations of Convolutional Neural Networks, arXiv preprint, 2015. arXiv: 1511.06856.

  15. Krizhevsky, A., Sutskever, I., and Hinton, G., Imagenet Classification with Deep Convolutional Neural Networks, Adv. Neural Inf. Proc. Sys., vol. 25, pp. 1097-1105,2012.

  16. LeCun, Y., Bottou, L., Orr, G.B., and Muller, K.R., Efficient Backprop, Neural Networks: Tricks of the Trade, Berlin-Heidelberg, Germany: Springer, pp. 9-48, 2012.

  17. Leopardi, P.C., Distributing Points on the Sphere: Partitions, Separation, Quadrature and Energy, PhD, University of New South Wales, Sydney, Australia, 2007.

  18. Li, Y. and Liang, Y., Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data, Adv. Neural Inf. Proc. Sys., vol. 31, pp. 8157-8166,2018.

  19. Livni, R., Shalev-Shwartz, S., and Shamir, O., On the Computational Efficiency of Training Neural Networks, AdK Neural Inf. Proc. Sys., vol. 27, pp. 855-863,2014.

  20. Lu, L., Shin, Y., Su, Y., and Karniadakis, G.E., Dying ReLU and Initialization: Theory and Numerical Examples, arXiv preprint, 2019. arXiv: 1903.06733.

  21. Mishkin, D. and Matas, J., All You Need Is a Good Init, Int. Conf. on Learning Representations, San Juan, Puerto Rico, USA, May 2-4,2016.

  22. Nguyen, Q. and Hein, M., The Loss Surface of Deep and Wide Neural Networks, Proc. of the 34nd Int. Conf. on Machine Learning, vol. 70, pp. 2603-2612,2017.

  23. Oymak, S. and Soltanolkotabi, M., Towards Moderate Overparameterization: Global Convergence Guarantees for Training Shallow Neural Networks, arXiv preprint, 2019. arXiv: 1902.04674.

  24. Reddi, S.J., Kale, S., and Kumar, S., On the Convergence of Adam and Beyond, arXiv preprint, 2019. arXiv: 1904.09237.

  25. Robbins, H. and Monro, S., A Stochastic Approximation Method, Annals Math. Stat., vol. 22, no. 3, pp. 400-407,1951.

  26. Ruder, S., An Overview of Gradient Descent Optimization Algorithms, arXiv preprint, 2016. arXiv: 1609.04747.

  27. Rumelhart, D.E., Hinton, G.E., and Williams, R.J., Learning Internal Representations by Error Propagation, Tech. Rep., California University San Diego, La Jolla Institute for Cognitive Science, 1985.

  28. Safran, I. and Shamir, O., On the Quality of the Initial Basin in Overspecified Neural Networks, Proc. of the 33rdInt. Conf. on Machine Learning, vol. 48, pp. 774-782,2016.

  29. Salimans, T. and Kingma, D.P., Weight normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks, Adv. Neural Inf. Proc. Sys, vol. 29, pp. 901-909,2016.

  30. Saxe, A.M., McClelland, J.L., and Ganguli, S., Exact Solutions to the Nonlinear Dynamics of Learning in Deep Linear Neural Networks, Int. Conf. Learning Representations, Banff, Canada, April 14-16, 2014.

  31. Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Driessche, G.V.D., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., and Lanctot, M., Mastering the Game of Go with Deep Neural Networks and Tree Search, Nature, vol. 529, no. 7587, p. 484,2016.

  32. Soltanolkotabi, M., Javanmard, A., and Lee, J.D., Theoretical Insights into the Optimization Landscape of Over-Parameterized Shallow Neural Networks, IEEE Transact. Inf. Theor., vol. 65, no. 2, pp. 742-769, 2019.

  33. Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., et al., Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, arXiv preprint, 2016. arXiv: 1609.08144.

  34. Zou, D., Cao, Y., Zhou, D., and Gu, Q., Stochastic Gradient Descent Optimizes Over-Parameterized Deep RelU Networks, arXiv preprint, 2018. arXiv: 1811.08888.


Articles with similar content:

Numerical Assessment of Theoretical Error Estimates in Coarse-Grained Kinetic Monte Carlo Simulations: Application to Surface Diffusion
International Journal for Multiscale Computational Engineering, Vol.3, 2005, issue 1
Markos A. Katsoulakis, Dionisios G. Vlachos, Abhijit Chatterjee
A Radio Controller Using Speech for the Blind
Critical Reviews™ in Biomedical Engineering, Vol.28, 2000, issue 3&4
Ren-Men Won, Chih-Lung Lin, Jer-Junn Luh, Cheng-Tao Ru, Te-Son Kuo, Maw-Huel Lee
ROBUST ADAPTIVE CONTROL OF SISO DYNAMIC HYBRID SYSTEMS
Hybrid Methods in Engineering, Vol.2, 2000, issue 1
M. de la Sen
A GRADIENT-BASED SAMPLING APPROACH FOR DIMENSION REDUCTION OF PARTIAL DIFFERENTIAL EQUATIONS WITH STOCHASTIC COEFFICIENTS
International Journal for Uncertainty Quantification, Vol.5, 2015, issue 1
Miroslav Stoyanov, Clayton G. Webster
Uniform Sampling of Fundamental Simplexes as Sets of Players' Mixed Strategies in the Finite Noncooperative Game for Finding Equilibrium Situations with Possible Concessions
Journal of Automation and Information Sciences, Vol.47, 2015, issue 9
Vadim V. Romanuke