Path: blob/master/translate_cache/optimizers/radam.zh.json
4923 views
{1"<h1>Rectified Adam (RAdam) optimizer</h1>\n": "<h1>\u6821\u6b63\u4e9a\u5f53 (raDAM) \u4f18\u5316\u5668</h1>\n",2"<h2>Rectified Adam Optimizer</h2>\n<p>This class extends from AMSAdam optimizer defined in <a href=\"amsadam.html\"><span translate=no>_^_0_^_</span></a>.</p>\n": "<h2>\u7ea0\u6b63\u4e9a\u5f53\u4f18\u5316\u5668</h2>\n<p>\u8fd9\u4e2a\u7c7b\u662f\u4ece\u4e2d\u5b9a\u4e49\u7684 AmsadAM \u4f18\u5316\u5668\u6269\u5c55\u800c\u6765\u7684<a href=\"amsadam.html\"><span translate=no>_^_0_^_</span></a>\u3002</p>\n",3"<h2>Rectified Adam</h2>\n": "<h2>\u7ea0\u6b63\u4e86\u4e9a\u5f53</h2>\n",4"<h3>Approximating <span translate=no>_^_0_^_</span></h3>\n": "<h3>\u8fd1\u4f3c\u503c<span translate=no>_^_0_^_</span></h3>\n",5"<h3>Calculate rectification term <span translate=no>_^_0_^_</span></h3>\n": "<h3>\u8ba1\u7b97\u6574\u6539\u671f\u9650<span translate=no>_^_0_^_</span></h3>\n",6"<h3>Do the <em>RAdam</em> parameter update</h3>\n<ul><li><span translate=no>_^_0_^_</span> is the optimizer state of the parameter (tensor) </li>\n<li><span translate=no>_^_1_^_</span> stores optimizer attributes of the parameter group </li>\n<li><span translate=no>_^_2_^_</span> is the parameter tensor <span translate=no>_^_3_^_</span> </li>\n<li><span translate=no>_^_4_^_</span> and <span translate=no>_^_5_^_</span> are the uncorrected first and second moments <span translate=no>_^_6_^_</span> and <span translate=no>_^_7_^_</span>; i.e. <span translate=no>_^_8_^_</span> and <span translate=no>_^_9_^_</span> without bias correction</li></ul>\n": "<h3>\u662f\u5426\u66f4\u65b0 R <em>adAM</em> \u53c2\u6570</h3>\n<ul><li><span translate=no>_^_0_^_</span>\u662f\u53c2\u6570\uff08\u5f20\u91cf\uff09\u7684\u4f18\u5316\u5668\u72b6\u6001</li>\n<li><span translate=no>_^_1_^_</span>\u5b58\u50a8\u53c2\u6570\u7ec4\u7684\u4f18\u5316\u7a0b\u5e8f\u5c5e\u6027</li>\n<li><span translate=no>_^_2_^_</span>\u662f\u53c2\u6570\u5f20\u91cf<span translate=no>_^_3_^_</span></li>\n<li><span translate=no>_^_4_^_</span>\u548c<span translate=no>_^_5_^_</span>\u662f\u672a\u6821\u6b63\u7684\u7b2c\u4e00\u4e2a\u548c\u7b2c\u4e8c\u4e2a\u65f6\u523b<span translate=no>_^_6_^_</span><span translate=no>_^_7_^_</span>\uff1b\u5373<span translate=no>_^_8_^_</span>\u548c<span translate=no>_^_9_^_</span>\u6ca1\u6709\u504f\u5dee\u6821\u6b63</li></ul>\n",7"<h3>Exponential moving average as simple moving average</h3>\n": "<h3>\u6307\u6570\u79fb\u52a8\u5e73\u5747\u7ebf\u4f5c\u4e3a\u7b80\u5355\u79fb\u52a8\u5e73\u5747\u7ebf</h3>\n",8"<h3>Initialize the optimizer</h3>\n<ul><li><span translate=no>_^_0_^_</span> is the list of parameters </li>\n<li><span translate=no>_^_1_^_</span> is the learning rate <span translate=no>_^_2_^_</span> </li>\n<li><span translate=no>_^_3_^_</span> is a tuple of (<span translate=no>_^_4_^_</span>, <span translate=no>_^_5_^_</span>) </li>\n<li><span translate=no>_^_6_^_</span> is <span translate=no>_^_7_^_</span> or <span translate=no>_^_8_^_</span> based on <span translate=no>_^_9_^_</span> </li>\n<li><span translate=no>_^_10_^_</span> is an instance of class <span translate=no>_^_11_^_</span> defined in <a href=\"index.html\"><span translate=no>_^_12_^_</span></a> </li>\n<li><span translate=no>_^_13_^_</span> is a flag whether to optimize the bias correction of the second moment by doing it after adding <span translate=no>_^_14_^_</span> </li>\n<li><span translate=no>_^_15_^_</span> is a flag indicating whether to use AMSGrad or fallback to plain Adam </li>\n<li><span translate=no>_^_16_^_</span> whether to use sgd when the rectification term <span translate=no>_^_17_^_</span> is intractable. </li>\n<li><span translate=no>_^_18_^_</span> is a dictionary of default for group values. This is useful when you want to extend the class <span translate=no>_^_19_^_</span>.</li></ul>\n": "<h3>\u521d\u59cb\u5316\u4f18\u5316\u5668</h3>\n<ul><li><span translate=no>_^_0_^_</span>\u662f\u53c2\u6570\u5217\u8868</li>\n<li><span translate=no>_^_1_^_</span>\u662f\u5b66\u4e60\u7387<span translate=no>_^_2_^_</span></li>\n<li><span translate=no>_^_3_^_</span>\u662f (<span translate=no>_^_4_^_</span>,<span translate=no>_^_5_^_</span>) \u7684\u5143\u7ec4</li>\n<li><span translate=no>_^_6_^_</span>\u662f<span translate=no>_^_7_^_</span>\u6216<span translate=no>_^_8_^_</span>\u57fa\u4e8e<span translate=no>_^_9_^_</span></li>\n<li><span translate=no>_^_10_^_</span>\u662f\u5728\u4e2d<span translate=no>_^_11_^_</span>\u5b9a\u4e49\u7684\u7c7b\u7684\u5b9e\u4f8b <a href=\"index.html\"><span translate=no>_^_12_^_</span></a></li>\n<li><span translate=no>_^_13_^_</span>\u662f\u4e00\u4e2a\u6807\u5fd7\uff0c\u662f\u5426\u5728\u6dfb\u52a0\u540e\u901a\u8fc7\u8fd9\u6837\u505a\u6765\u4f18\u5316\u7b2c\u4e8c\u4e2a\u65f6\u523b\u7684\u504f\u5dee\u6821\u6b63<span translate=no>_^_14_^_</span></li>\n<li><span translate=no>_^_15_^_</span>\u662f\u4e00\u4e2a\u6807\u5fd7\uff0c\u6307\u793a\u662f\u4f7f\u7528 AmsGrad \u8fd8\u662f\u56de\u9000\u5230\u666e\u901a\u7684 Adam</li>\n<li><span translate=no>_^_16_^_</span>\u7ea0\u6b63\u672f\u8bed<span translate=no>_^_17_^_</span>\u96be\u4ee5\u5904\u7406\u65f6\u662f\u5426\u4f7f\u7528 sgd\u3002</li>\n<li><span translate=no>_^_18_^_</span>\u662f\u7ec4\u503c\u7684\u9ed8\u8ba4\u5b57\u5178\u3002\u5f53\u4f60\u60f3\u6269\u5c55\u7c7b\u65f6\uff0c\u8fd9\u5f88\u6709\u7528<span translate=no>_^_19_^_</span>\u3002</li></ul>\n",9"<h3>Plot <span translate=no>_^_0_^_</span> against <span translate=no>_^_1_^_</span> for various <span translate=no>_^_2_^_</span></h3>\n<p><span translate=no>_^_3_^_</span></p>\n": "<h3>\u9634\u8c0b<span translate=no>_^_0_^_</span><span translate=no>_^_1_^_</span>\u5bf9\u6297\u5404\u79cd<span translate=no>_^_2_^_</span></h3>\n<p><span translate=no>_^_3_^_</span></p>\n",10"<h3>Rectification term</h3>\n": "<h3>\u6574\u6539\u671f\u9650</h3>\n",11"<h3>Rectification</h3>\n": "<h3>\u6574\u6539</h3>\n",12"<h3>Scaled inverse chi-squared</h3>\n": "<h3>\u7f29\u653e\u53cd\u5411\u5361\u65b9</h3>\n",13"<h3>Take an update step for a given parameter tensor</h3>\n<ul><li><span translate=no>_^_0_^_</span> is the optimizer state of the parameter (tensor) </li>\n<li><span translate=no>_^_1_^_</span> stores optimizer attributes of the parameter group </li>\n<li><span translate=no>_^_2_^_</span> is the current gradient tensor <span translate=no>_^_3_^_</span> for the parameter <span translate=no>_^_4_^_</span> </li>\n<li><span translate=no>_^_5_^_</span> is the parameter tensor <span translate=no>_^_6_^_</span></li></ul>\n": "<h3>\u5bf9\u7ed9\u5b9a\u53c2\u6570\u5f20\u91cf\u6267\u884c\u66f4\u65b0\u6b65\u9aa4</h3>\n<ul><li><span translate=no>_^_0_^_</span>\u662f\u53c2\u6570\uff08\u5f20\u91cf\uff09\u7684\u4f18\u5316\u5668\u72b6\u6001</li>\n<li><span translate=no>_^_1_^_</span>\u5b58\u50a8\u53c2\u6570\u7ec4\u7684\u4f18\u5316\u7a0b\u5e8f\u5c5e\u6027</li>\n<li><span translate=no>_^_2_^_</span>\u662f\u53c2\u6570\u7684\u5f53\u524d\u68af<span translate=no>_^_3_^_</span>\u5ea6\u5f20\u91cf<span translate=no>_^_4_^_</span></li>\n<li><span translate=no>_^_5_^_</span>\u662f\u53c2\u6570\u5f20\u91cf<span translate=no>_^_6_^_</span></li></ul>\n",14"<p><a href=\"https://en.wikipedia.org/wiki/Scaled_inverse_chi-squared_distribution\">Scaled inverse chi-squared</a> is the distribution of squared inverse of mean of <span translate=no>_^_0_^_</span> normal distributions. <span translate=no>_^_1_^_</span> where <span translate=no>_^_2_^_</span>.</p>\n": "<p><a href=\"https://en.wikipedia.org/wiki/Scaled_inverse_chi-squared_distribution\">\u7f29\u653e\u9006\u5361\u65b9</a>\u662f<span translate=no>_^_0_^_</span>\u6b63\u6001\u5206\u5e03\u5747\u503c\u7684\u9006\u5e73\u65b9\u5206\u5e03\u3002<span translate=no>_^_1_^_</span>\u5728\u54ea\u91cc<span translate=no>_^_2_^_</span>\u3002</p>\n",15"<p><span translate=no>_^_0_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span></p>\n",16"<p><span translate=no>_^_0_^_</span> is tractable when <span translate=no>_^_1_^_</span>. We are being a little more conservative since it's an approximated value </p>\n": "<p><span translate=no>_^_0_^_</span>\u4ec0\u4e48\u65f6\u5019\u662f\u53ef\u4ee5\u5904\u7406<span translate=no>_^_1_^_</span>\u7684\u3002\u6211\u4eec\u7a0d\u5fae\u4fdd\u5b88\u4e00\u70b9\uff0c\u56e0\u4e3a\u5b83\u662f\u8fd1\u4f3c\u503c</p>\n",17"<p>Adam optimizer sometimes converges to a bad local optima during the initial stages of the training; especially when training transformers. Researches use warmups to counter this; for the the initial training steps (warm-up stage) they use a low learning rate. This paper identifies the problem to be the high variance of adaptive learning rate during initial stages of training, and counters it using a new rectification term to reduce variance.</p>\n": "<p>\u5728\u8bad\u7ec3\u7684\u521d\u59cb\u9636\u6bb5\uff0cAdam optimizer \u6709\u65f6\u4f1a\u6536\u655b\u5230\u7cdf\u7cd5\u7684\u5c40\u90e8\u6700\u4f73\u503c\uff1b\u5c24\u5176\u662f\u5728\u8bad\u7ec3\u53d8\u5f62\u91d1\u521a\u65f6\u3002\u7814\u7a76\u4f7f\u7528\u70ed\u8eab\u6765\u5e94\u5bf9\u8fd9\u79cd\u60c5\u51b5\uff1b\u5bf9\u4e8e\u6700\u521d\u7684\u8bad\u7ec3\u6b65\u9aa4\uff08\u70ed\u8eab\u9636\u6bb5\uff09\uff0c\u4ed6\u4eec\u4f7f\u7528\u8f83\u4f4e\u7684\u5b66\u4e60\u7387\u3002\u672c\u6587\u5c06\u95ee\u9898\u786e\u5b9a\u4e3a\u8bad\u7ec3\u521d\u59cb\u9636\u6bb5\u81ea\u9002\u5e94\u5b66\u4e60\u7387\u7684\u9ad8\u65b9\u5dee\uff0c\u5e76\u4f7f\u7528\u65b0\u7684\u6821\u6b63\u672f\u8bed\u6765\u51cf\u5c11\u65b9\u5dee\u3002</p>\n",18"<p>Bias correction term for <span translate=no>_^_0_^_</span>, <span translate=no>_^_1_^_</span> </p>\n": "<p>\u504f\u5dee\u6821\u6b63\u672f\u8bed<span translate=no>_^_0_^_</span>\uff0c<span translate=no>_^_1_^_</span></p>\n",19"<p>Calculate <span translate=no>_^_0_^_</span> the number of optimizer steps </p>\n": "<p><span translate=no>_^_0_^_</span>\u8ba1\u7b97\u4f18\u5316\u5668\u6b65\u6570</p>\n",20"<p>Calculate weight decay </p>\n": "<p>\u8ba1\u7b97\u4f53\u91cd\u8870\u51cf</p>\n",21"<p>Computation without optimization </p>\n": "<p>\u65e0\u9700\u4f18\u5316\u7684\u8ba1\u7b97</p>\n",22"<p>Denominator <span translate=no>_^_0_^_</span> </p>\n": "<p>\u5206\u6bcd<span translate=no>_^_0_^_</span></p>\n",23"<p>From <span translate=no>_^_0_^_</span> distribution we have,</p>\n": "<p>\u4ece<span translate=no>_^_0_^_</span>\u5206\u53d1\u6765\u770b\uff0c</p>\n",24"<p>From above we have <span translate=no>_^_0_^_</span> where <span translate=no>_^_1_^_</span>. Note that <span translate=no>_^_2_^_</span> here is the standard deviation and different from <span translate=no>_^_3_^_</span> for momentum.</p>\n": "<p>\u4ece\u4e0a\u9762\u770b\uff0c\u6211\u4eec\u6709<span translate=no>_^_0_^_</span>\u54ea\u91cc<span translate=no>_^_1_^_</span>\u3002\u8bf7\u6ce8\u610f\uff0c<span translate=no>_^_2_^_</span>\u8fd9\u91cc\u662f\u6807\u51c6\u5dee\uff0c\u4e0e\u52a8<span translate=no>_^_3_^_</span>\u91cf\u4e0d\u540c\u3002</p>\n",25"<p>Get <span translate=no>_^_0_^_</span> and <span translate=no>_^_1_^_</span> </p>\n": "<p>\u83b7\u53d6<span translate=no>_^_0_^_</span>\u548c<span translate=no>_^_1_^_</span></p>\n",26"<p>Get <span translate=no>_^_0_^_</span> and <span translate=no>_^_1_^_</span>; i.e. <span translate=no>_^_2_^_</span> and <span translate=no>_^_3_^_</span> without bias correction </p>\n": "<p>Get<span translate=no>_^_0_^_</span> an<span translate=no>_^_1_^_</span> d; \u5373<span translate=no>_^_3_^_</span>\u4e0d<span translate=no>_^_2_^_</span>\u8fdb\u884c\u504f\u5dee\u6821\u6b63</p>\n",27"<p>Get learning rate </p>\n": "<p>\u83b7\u53d6\u5b66\u4e60\u7387</p>\n",28"<p>Here we are taking the simple moving average of the last <span translate=no>_^_0_^_</span> gradients. <span translate=no>_^_1_^_</span> satisfies the following,</p>\n": "<p>\u8fd9\u91cc\u6211\u4eec\u53d6\u6700\u540e\u4e00\u4e2a<span translate=no>_^_0_^_</span>\u68af\u5ea6\u7684\u7b80\u5355\u79fb\u52a8\u5e73\u5747\u7ebf\u3002<span translate=no>_^_1_^_</span>\u6ee1\u8db3\u4ee5\u4e0b\u6761\u4ef6\uff0c</p>\n",29"<p>If <span translate=no>_^_0_^_</span> is intractable </p>\n": "<p>\u5982\u679c<span translate=no>_^_0_^_</span>\u662f\u68d8\u624b\u7684</p>\n",30"<p>If <span translate=no>_^_0_^_</span> is intractable do a SGD with momentum </p>\n": "<p>\u5982\u679c<span translate=no>_^_0_^_</span>\u96be\u4ee5\u89e3\u51b3\uff0c\u90a3\u5c31\u7528\u52bf\u5934\u505a\u65b0\u52a0\u5761\u5143</p>\n",31"<p>In order to ensure that the adaptive learning rate <span translate=no>_^_0_^_</span> has consistent variance, we rectify the variance with <span translate=no>_^_1_^_</span></p>\n": "<p>\u4e3a\u4e86\u786e\u4fdd\u81ea\u9002\u5e94\u5b66\u4e60\u7387<span translate=no>_^_0_^_</span>\u5177\u6709\u4e00\u81f4\u7684\u65b9\u5dee\uff0c\u6211\u4eec\u4f7f\u7528\u4ee5\u4e0b\u65b9\u6cd5\u6821\u6b63\u65b9\u5dee<span translate=no>_^_1_^_</span></p>\n",32"<p>Let <span translate=no>_^_0_^_</span> and <span translate=no>_^_1_^_</span> be the functions to calculate momentum and adaptive learning rate. For Adam, they are</p>\n": "<p>\u8ba9<span translate=no>_^_0_^_</span>\u548c<span translate=no>_^_1_^_</span>\u6210\u4e3a\u8ba1\u7b97\u52a8\u91cf\u548c\u81ea\u9002\u5e94\u5b66\u4e60\u901f\u7387\u7684\u51fd\u6570\u3002\u5bf9\u4e9a\u5f53\u6765\u8bf4\uff0c\u4ed6\u4eec\u662f</p>\n",33"<p>Perform <em>RAdam</em> update </p>\n": "<p>\u6267\u884c <em>raDAM</em> \u66f4\u65b0</p>\n",34"<p>Step size <span translate=no>_^_0_^_</span> </p>\n": "<p>\u6b65\u957f<span translate=no>_^_0_^_</span></p>\n",35"<p>The distribution of exponential moving average can be approximated as a simple moving average.</p>\n": "<p>\u6307\u6570\u79fb\u52a8\u5e73\u5747\u7ebf\u7684\u5206\u5e03\u53ef\u4ee5\u8fd1\u4f3c\u4e3a\u7b80\u5355\u79fb\u52a8\u5e73\u5747\u7ebf\u3002</p>\n",36"<p>The paper also evaluates two variance reduction mechanisms: <em> <strong>Adam-2k</strong>: Only compute the adaptive learning rate (<span translate=no>_^_0_^_</span> in <a href=\"adam.html\">Adam</a>) during the first 2k steps, without changing parameters or calculating momentum (<span translate=no>_^_1_^_</span>). </em> <strong>Adam-eps</strong>: Adam with large <span translate=no>_^_2_^_</span>.</p>\n": "<p>\u672c\u6587\u8fd8\u8bc4\u4f30\u4e86\u4e24\u79cd\u65b9\u5dee\u7f29\u51cf\u673a\u5236\uff1a<em><strong>adam-2K</strong>\uff1a\u4ec5\u8ba1\u7b97\u524d 2k \u6b65\u957f\u7684\u81ea\u9002\u5e94\u5b66\u4e60\u7387\uff08<span translate=no>_^_0_^_</span>\u5728 <a href=\"adam.html\">Adam</a> \u4e2d\uff09\uff0c\u800c\u4e0d\u66f4\u6539\u53c2\u6570\u6216\u8ba1\u7b97\u52a8\u91cf\uff08<span translate=no>_^_1_^_</span>)\u3002</em><strong>adam-eps</strong>\uff1aAdam \u5f88\u5927<span translate=no>_^_2_^_</span>\u3002</p>\n",37"<p>Therefore the variance is minimized at maximal <span translate=no>_^_0_^_</span> which is <span translate=no>_^_1_^_</span>. Let the minimum variance be <span translate=no>_^_2_^_</span></p>\n": "<p>\u56e0\u6b64\uff0c\u65b9\u5dee\u6700\u5c0f\u5316<span translate=no>_^_0_^_</span>\u4e3a\u6700\u5927\u503c<span translate=no>_^_1_^_</span>\u3002\u8ba9\u6700\u5c0f\u65b9\u5dee\u4e3a<span translate=no>_^_2_^_</span></p>\n",38"<p>They estimate <span translate=no>_^_0_^_</span> based on first order expansion of <span translate=no>_^_1_^_</span> \ud83e\udd2a I didn't get how it was derived.</p>\n": "<p>\u4ed6\u4eec<span translate=no>_^_0_^_</span>\u6839\u636e\u4e00\u9636\u6269\u5f20\u4f30\u8ba1<span translate=no>_^_1_^_</span> \ud83e\udd2a \u6211\u4e0d\u660e\u767d\u5b83\u662f\u5982\u4f55\u5f97\u51fa\u7684\u3002</p>\n",39"<p>They prove that variance of <span translate=no>_^_0_^_</span> decreases with <span translate=no>_^_1_^_</span> when <span translate=no>_^_2_^_</span>.</p>\n": "<p>\u4ed6\u4eec\u8bc1\u660e\u4e86\u968f\u65f6\u95f4\u53d8\u5316\u7684\u53d8\u5316<span translate=no>_^_0_^_</span>\u800c<span translate=no>_^_1_^_</span>\u964d\u4f4e<span translate=no>_^_2_^_</span>\u3002</p>\n",40"<p>This gives,</p>\n": "<p>\u8fd9\u7ed9\u4e86\uff0c</p>\n",41"<p>This implementation is based on <a href=\"https://github.com/LiyuanLucasLiu/RAdam\">the official implementation</a> of the paper <a href=\"https://arxiv.org/abs/1908.03265\">On the Variance of the Adaptive Learning Rate and Beyond</a>.</p>\n": "<p>\u8be5\u5b9e\u65bd\u57fa\u4e8e<a href=\"https://github.com/LiyuanLucasLiu/RAdam\">\u300a<a href=\"https://arxiv.org/abs/1908.03265\">\u81ea\u9002\u5e94\u5b66\u4e60\u7387\u53ca\u4ee5\u540e\u7684\u5dee\u5f02\u300b\u4e00</a>\u6587\u7684\u6b63\u5f0f\u5b9e\u65bd</a>\u3002</p>\n",42"<p>Update parameters <span translate=no>_^_0_^_</span> </p>\n": "<p>\u66f4\u65b0\u53c2\u6570<span translate=no>_^_0_^_</span></p>\n",43"<p>We have implemented it in <a href=\"https://pytorch.org\">PyTorch</a> as an extension to <a href=\"amsgrad.html\">our AMSGrad implementation</a> thus requiring only the modifications to be implemented.</p>\n": "<p>\u6211\u4eec\u5df2\u7ecf\u5728 <a href=\"https://pytorch.org\">PyTorch</a> \u4e2d\u5b9e\u73b0\u4e86\u5b83\uff0c\u4f5c\u4e3a<a href=\"amsgrad.html\">\u6211\u4eec\u7684 AmsGrad</a> \u5b9e\u73b0\u7684\u6269\u5c55\uff0c\u56e0\u6b64\u53ea\u9700\u8981\u5b9e\u65bd\u4fee\u6539\u5373\u53ef\u3002</p>\n",44"<p>We have</p>\n": "<p>\u6211\u4eec\u6709</p>\n",45"<p>Whether to optimize the computation by combining scalar computations </p>\n": "<p>\u662f\u5426\u901a\u8fc7\u7ec4\u5408\u6807\u91cf\u8ba1\u7b97\u6765\u4f18\u5316\u8ba1\u7b97</p>\n",46"<p>where <span translate=no>_^_0_^_</span> is <span translate=no>_^_1_^_</span> for <span translate=no>_^_2_^_</span>. Lt <span translate=no>_^_3_^_</span> and step <span translate=no>_^_4_^_</span> be <span translate=no>_^_5_^_</span>, and <span translate=no>_^_6_^_</span> be the rectification term at step <span translate=no>_^_7_^_</span>.</p>\n": "<p>\u5728<span translate=no>_^_0_^_</span>\u54ea<span translate=no>_^_1_^_</span>\u91cc<span translate=no>_^_2_^_</span>\u3002Lt<span translate=no>_^_3_^_</span> and step<span translate=no>_^_4_^_</span> be<span translate=no>_^_5_^_</span>\uff0c\u7136\u540e<span translate=no>_^_6_^_</span>\u6210\u4e3a step \u7684\u6574\u6539\u671f\u9650<span translate=no>_^_7_^_</span>\u3002</p>\n",47"<p>which gives, <span translate=no>_^_0_^_</span></p>\n": "<p>\u8fd9\u7ed9\u4e86\uff0c<span translate=no>_^_0_^_</span></p>\n",48"<span translate=no>_^_0_^_</span>": "<span translate=no>_^_0_^_</span>",49"A simple PyTorch implementation/tutorial of RAdam optimizer.": "\u4e00\u4e2a\u7b80\u5355\u7684 PyTorch \u5b9e\u73b0/RadAM \u4f18\u5316\u5668\u6559\u7a0b\u3002",50"Rectified Adam (RAdam) optimizer": "\u6821\u6b63\u4e9a\u5f53 (raDAM) \u4f18\u5316\u5668"51}5253