Path: blob/master/translate_cache/optimizers/radam.ja.json
4924 views
{1"<h1>Rectified Adam (RAdam) optimizer</h1>\n": "<h1>\u4fee\u6b63\u3055\u308c\u305f\u30a2\u30c0\u30e0 (RaDAM) \u30aa\u30d7\u30c6\u30a3\u30de\u30a4\u30b6\u30fc</h1>\n",2"<h2>Rectified Adam Optimizer</h2>\n<p>This class extends from AMSAdam optimizer defined in <a href=\"amsadam.html\"><span translate=no>_^_0_^_</span></a>.</p>\n": "<h2>\u30ec\u30af\u30c6\u30a3\u30d5\u30a1\u30a4\u30c9\u30fb\u30a2\u30c0\u30e0\u30fb\u30aa\u30d7\u30c6\u30a3\u30de\u30a4\u30b6\u30fc</h2>\n<p>\u3053\u306e\u30af\u30e9\u30b9\u306f\u3001\u3067\u5b9a\u7fa9\u3055\u308c\u3066\u3044\u308b AmSadam \u30aa\u30d7\u30c6\u30a3\u30de\u30a4\u30b6\u3092\u62e1\u5f35\u3057\u305f\u3082\u306e\u3067\u3059\u3002<a href=\"amsadam.html\"><span translate=no>_^_0_^_</span></a></p>\n",3"<h2>Rectified Adam</h2>\n": "<h2>\u6b63\u7fa9\u306e\u30a2\u30c0\u30e0</h2>\n",4"<h3>Approximating <span translate=no>_^_0_^_</span></h3>\n": "<h3>\u304a\u304a\u3088\u305d\u306e\u5024 <span translate=no>_^_0_^_</span></h3>\n",5"<h3>Calculate rectification term <span translate=no>_^_0_^_</span></h3>\n": "<h3>\u4fee\u6b63\u671f\u9593\u306e\u8a08\u7b97 <span translate=no>_^_0_^_</span></h3>\n",6"<h3>Do the <em>RAdam</em> parameter update</h3>\n<ul><li><span translate=no>_^_0_^_</span> is the optimizer state of the parameter (tensor) </li>\n<li><span translate=no>_^_1_^_</span> stores optimizer attributes of the parameter group </li>\n<li><span translate=no>_^_2_^_</span> is the parameter tensor <span translate=no>_^_3_^_</span> </li>\n<li><span translate=no>_^_4_^_</span> and <span translate=no>_^_5_^_</span> are the uncorrected first and second moments <span translate=no>_^_6_^_</span> and <span translate=no>_^_7_^_</span>; i.e. <span translate=no>_^_8_^_</span> and <span translate=no>_^_9_^_</span> without bias correction</li></ul>\n": "<h3><em>RadAM</em> \u30d1\u30e9\u30e1\u30fc\u30bf\u306e\u66f4\u65b0\u3092\u884c\u3044\u307e\u3059</h3>\n<ul><li><span translate=no>_^_0_^_</span>\u306f\u30d1\u30e9\u30e1\u30fc\u30bf\u30fc (\u30c6\u30f3\u30bd\u30eb) \u306e\u30aa\u30d7\u30c6\u30a3\u30de\u30a4\u30b6\u30fc\u72b6\u614b\u3067\u3059</li>\n<li><span translate=no>_^_1_^_</span>\u30d1\u30e9\u30e1\u30fc\u30bf\u30b0\u30eb\u30fc\u30d7\u306e\u30aa\u30d7\u30c6\u30a3\u30de\u30a4\u30b6\u5c5e\u6027\u3092\u683c\u7d0d\u3057\u307e\u3059</li>\n<li><span translate=no>_^_2_^_</span>\u306f\u30d1\u30e9\u30e1\u30fc\u30bf\u30c6\u30f3\u30bd\u30eb <span translate=no>_^_3_^_</span></li>\n<li><span translate=no>_^_4_^_</span><span translate=no>_^_6_^_</span>\u672a\u88dc\u6b63\u306e\u7b2c1\u30e2\u30fc\u30e1\u30f3\u30c8\u3068\u7b2c2\u30e2\u30fc\u30e1\u30f3\u30c8\u3067\u3001<span translate=no>_^_7_^_</span><span translate=no>_^_8_^_</span><span translate=no>_^_9_^_</span>\u30d0\u30a4\u30a2\u30b9\u88dc\u6b63\u306a\u3057 <span translate=no>_^_5_^_</span></li></ul>\n",7"<h3>Exponential moving average as simple moving average</h3>\n": "<h3>\u5358\u7d14\u79fb\u52d5\u5e73\u5747\u3068\u3057\u3066\u306e\u6307\u6570\u79fb\u52d5\u5e73\u5747</h3>\n",8"<h3>Initialize the optimizer</h3>\n<ul><li><span translate=no>_^_0_^_</span> is the list of parameters </li>\n<li><span translate=no>_^_1_^_</span> is the learning rate <span translate=no>_^_2_^_</span> </li>\n<li><span translate=no>_^_3_^_</span> is a tuple of (<span translate=no>_^_4_^_</span>, <span translate=no>_^_5_^_</span>) </li>\n<li><span translate=no>_^_6_^_</span> is <span translate=no>_^_7_^_</span> or <span translate=no>_^_8_^_</span> based on <span translate=no>_^_9_^_</span> </li>\n<li><span translate=no>_^_10_^_</span> is an instance of class <span translate=no>_^_11_^_</span> defined in <a href=\"index.html\"><span translate=no>_^_12_^_</span></a> </li>\n<li><span translate=no>_^_13_^_</span> is a flag whether to optimize the bias correction of the second moment by doing it after adding <span translate=no>_^_14_^_</span> </li>\n<li><span translate=no>_^_15_^_</span> is a flag indicating whether to use AMSGrad or fallback to plain Adam </li>\n<li><span translate=no>_^_16_^_</span> whether to use sgd when the rectification term <span translate=no>_^_17_^_</span> is intractable. </li>\n<li><span translate=no>_^_18_^_</span> is a dictionary of default for group values. This is useful when you want to extend the class <span translate=no>_^_19_^_</span>.</li></ul>\n": "<h3>\u30aa\u30d7\u30c6\u30a3\u30de\u30a4\u30b6\u3092\u521d\u671f\u5316</h3>\n<ul><li><span translate=no>_^_0_^_</span>\u306f\u30d1\u30e9\u30e1\u30fc\u30bf\u306e\u30ea\u30b9\u30c8\u3067\u3059</li>\n<li><span translate=no>_^_1_^_</span>\u306f\u5b66\u7fd2\u7387 <span translate=no>_^_2_^_</span></li>\n<li><span translate=no>_^_3_^_</span>(,) <span translate=no>_^_4_^_</span> \u306e\u30bf\u30d7\u30eb\u3067\u3059 <span translate=no>_^_5_^_</span></li>\n<li><span translate=no>_^_6_^_</span><span translate=no>_^_7_^_</span><span translate=no>_^_8_^_</span>\u307e\u305f\u306f\u305d\u308c\u306b\u57fa\u3065\u3044\u3066\u3044\u308b <span translate=no>_^_9_^_</span></li>\n<li><span translate=no>_^_10_^_</span><span translate=no>_^_11_^_</span>\u3067\u5b9a\u7fa9\u3055\u308c\u3066\u3044\u308b\u30af\u30e9\u30b9\u306e\u30a4\u30f3\u30b9\u30bf\u30f3\u30b9\u3067\u3059 <a href=\"index.html\"><span translate=no>_^_12_^_</span></a></li>\n<li><span translate=no>_^_13_^_</span>\u30bb\u30ab\u30f3\u30c9\u30e2\u30fc\u30e1\u30f3\u30c8\u306e\u30d0\u30a4\u30a2\u30b9\u88dc\u6b63\u3092\u52a0\u7b97\u3057\u3066\u304b\u3089\u884c\u3046\u3053\u3068\u3067\u6700\u9069\u5316\u3059\u308b\u304b\u5426\u304b\u306e\u30d5\u30e9\u30b0\u3067\u3059 <span translate=no>_^_14_^_</span></li>\n<li><span translate=no>_^_15_^_</span>amsGrad\u3092\u4f7f\u7528\u3059\u308b\u304b\u3001\u30d7\u30ec\u30fc\u30f3\u306aAdam\u306b\u30d5\u30a9\u30fc\u30eb\u30d0\u30c3\u30af\u3059\u308b\u304b\u3092\u793a\u3059\u30d5\u30e9\u30b0\u3067\u3059</li>\n<li><span translate=no>_^_16_^_</span><span translate=no>_^_17_^_</span>\u4fee\u6b63\u9805\u304c\u6271\u3044\u306b\u304f\u3044\u5834\u5408\u306b sgd \u3092\u4f7f\u3046\u304b\u3069\u3046\u304b\u3002</li>\n<li><span translate=no>_^_18_^_</span>\u30b0\u30eb\u30fc\u30d7\u5024\u306e\u30c7\u30d5\u30a9\u30eb\u30c8\u8f9e\u66f8\u3067\u3059\u3002\u3053\u308c\u306f\u3001\u30af\u30e9\u30b9\u3092\u62e1\u5f35\u3059\u308b\u5834\u5408\u306b\u4fbf\u5229\u3067\u3059<span translate=no>_^_19_^_</span>\u3002</li></ul>\n",9"<h3>Plot <span translate=no>_^_0_^_</span> against <span translate=no>_^_1_^_</span> for various <span translate=no>_^_2_^_</span></h3>\n<p><span translate=no>_^_3_^_</span></p>\n": "<h3><span translate=no>_^_0_^_</span><span translate=no>_^_1_^_</span>\u3055\u307e\u3056\u307e\u306a\u30d7\u30ed\u30c3\u30c8\u5bfe\u8c61 <span translate=no>_^_2_^_</span></h3>\n<p><span translate=no>_^_3_^_</span></p>\n",10"<h3>Rectification term</h3>\n": "<h3>\u4fee\u6b63\u671f\u9593</h3>\n",11"<h3>Rectification</h3>\n": "<h3>\u6574\u6d41</h3>\n",12"<h3>Scaled inverse chi-squared</h3>\n": "<h3>\u30b9\u30b1\u30fc\u30ea\u30f3\u30b0\u3055\u308c\u305f\u9006\u30ab\u30a4\u4e8c\u4e57</h3>\n",13"<h3>Take an update step for a given parameter tensor</h3>\n<ul><li><span translate=no>_^_0_^_</span> is the optimizer state of the parameter (tensor) </li>\n<li><span translate=no>_^_1_^_</span> stores optimizer attributes of the parameter group </li>\n<li><span translate=no>_^_2_^_</span> is the current gradient tensor <span translate=no>_^_3_^_</span> for the parameter <span translate=no>_^_4_^_</span> </li>\n<li><span translate=no>_^_5_^_</span> is the parameter tensor <span translate=no>_^_6_^_</span></li></ul>\n": "<h3>\u4e0e\u3048\u3089\u308c\u305f\u30d1\u30e9\u30e1\u30fc\u30bf\u30c6\u30f3\u30bd\u30eb\u306e\u66f4\u65b0\u30b9\u30c6\u30c3\u30d7\u3092\u5b9f\u884c\u3059\u308b</h3>\n<ul><li><span translate=no>_^_0_^_</span>\u306f\u30d1\u30e9\u30e1\u30fc\u30bf\u30fc (\u30c6\u30f3\u30bd\u30eb) \u306e\u30aa\u30d7\u30c6\u30a3\u30de\u30a4\u30b6\u30fc\u72b6\u614b\u3067\u3059</li>\n<li><span translate=no>_^_1_^_</span>\u30d1\u30e9\u30e1\u30fc\u30bf\u30b0\u30eb\u30fc\u30d7\u306e\u30aa\u30d7\u30c6\u30a3\u30de\u30a4\u30b6\u5c5e\u6027\u3092\u683c\u7d0d\u3057\u307e\u3059</li>\n<li><span translate=no>_^_2_^_</span><span translate=no>_^_3_^_</span>\u30d1\u30e9\u30e1\u30fc\u30bf\u306e\u73fe\u5728\u306e\u52fe\u914d\u30c6\u30f3\u30bd\u30eb\u3067\u3059 <span translate=no>_^_4_^_</span></li>\n<li><span translate=no>_^_5_^_</span>\u306f\u30d1\u30e9\u30e1\u30fc\u30bf\u30c6\u30f3\u30bd\u30eb <span translate=no>_^_6_^_</span></li></ul>\n",14"<p><a href=\"https://en.wikipedia.org/wiki/Scaled_inverse_chi-squared_distribution\">Scaled inverse chi-squared</a> is the distribution of squared inverse of mean of <span translate=no>_^_0_^_</span> normal distributions. <span translate=no>_^_1_^_</span> where <span translate=no>_^_2_^_</span>.</p>\n": "<p><a href=\"https://en.wikipedia.org/wiki/Scaled_inverse_chi-squared_distribution\">\u30b9\u30b1\u30fc\u30ea\u30f3\u30b0\u3055\u308c\u305f\u9006\u30ab\u30a4\u4e8c\u4e57\u306f</a>\u3001\u6b63\u898f\u5206\u5e03\u306e\u5e73\u5747\u306e\u4e8c\u4e57\u9006\u6570\u306e\u5206\u5e03\u3067\u3059\u3002<span translate=no>_^_0_^_</span><span translate=no>_^_1_^_</span>\u3069\u3053<span translate=no>_^_2_^_</span>\u3002</p>\n",15"<p><span translate=no>_^_0_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span></p>\n",16"<p><span translate=no>_^_0_^_</span> is tractable when <span translate=no>_^_1_^_</span>. We are being a little more conservative since it's an approximated value </p>\n": "<p><span translate=no>_^_0_^_</span><span translate=no>_^_1_^_</span>\u3069\u3093\u306a\u3068\u304d\u3067\u3082\u6271\u3044\u3084\u3059\u3044\u3002\u304a\u304a\u3088\u305d\u306e\u5024\u306a\u306e\u3067\u3001\u3082\u3046\u5c11\u3057\u4fdd\u5b88\u7684\u306b\u3057\u3066\u3044\u307e\u3059</p>\n",17"<p>Adam optimizer sometimes converges to a bad local optima during the initial stages of the training; especially when training transformers. Researches use warmups to counter this; for the the initial training steps (warm-up stage) they use a low learning rate. This paper identifies the problem to be the high variance of adaptive learning rate during initial stages of training, and counters it using a new rectification term to reduce variance.</p>\n": "<p>\u30a2\u30c0\u30e0\u30aa\u30d7\u30c6\u30a3\u30de\u30a4\u30b6\u30fc\u306f\u3001\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0\u306e\u521d\u671f\u6bb5\u968e\u3001\u7279\u306b\u30c8\u30e9\u30f3\u30b9\u30d5\u30a9\u30fc\u30de\u30fc\u3092\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0\u3057\u3066\u3044\u308b\u3068\u304d\u306b\u3001\u4e0d\u9069\u5207\u306a\u5c40\u6240\u6700\u9069\u5024\u306b\u53ce\u675f\u3059\u308b\u3053\u3068\u304c\u3042\u308a\u307e\u3059\u3002\u7814\u7a76\u8005\u306f\u3053\u308c\u306b\u5bfe\u6297\u3059\u308b\u305f\u3081\u306b\u30a6\u30a9\u30fc\u30e0\u30a2\u30c3\u30d7\u3092\u4f7f\u3044\u307e\u3059\u3002\u6700\u521d\u306e\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0\u30b9\u30c6\u30c3\u30d7\uff08\u30a6\u30a9\u30fc\u30e0\u30a2\u30c3\u30d7\u6bb5\u968e\uff09\u3067\u306f\u4f4e\u3044\u5b66\u7fd2\u7387\u3092\u4f7f\u3044\u307e\u3059\u3002\u672c\u7a3f\u3067\u306f\u3001\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0\u306e\u521d\u671f\u6bb5\u968e\u306b\u304a\u3051\u308b\u9069\u5fdc\u5b66\u7fd2\u7387\u306e\u3070\u3089\u3064\u304d\u304c\u5927\u304d\u3044\u3068\u3044\u3046\u554f\u984c\u3092\u7279\u5b9a\u3057\u3001\u5206\u6563\u3092\u6e1b\u3089\u3059\u305f\u3081\u306e\u65b0\u3057\u3044\u4fee\u6b63\u9805\u3092\u7528\u3044\u3066\u305d\u306e\u554f\u984c\u306b\u5bfe\u51e6\u3057\u3066\u3044\u307e\u3059</p>\u3002\n",18"<p>Bias correction term for <span translate=no>_^_0_^_</span>, <span translate=no>_^_1_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span>\u306e\u30d0\u30a4\u30a2\u30b9\u88dc\u6b63\u7528\u8a9e <span translate=no>_^_1_^_</span></p>\n",19"<p>Calculate <span translate=no>_^_0_^_</span> the number of optimizer steps </p>\n": "<p>\u30aa\u30d7\u30c6\u30a3\u30de\u30a4\u30b6\u30fc\u30b9\u30c6\u30c3\u30d7\u6570\u306e\u8a08\u7b97 <span translate=no>_^_0_^_</span></p>\n",20"<p>Calculate weight decay </p>\n": "<p>\u4f53\u91cd\u6e1b\u5c11\u306e\u8a08\u7b97</p>\n",21"<p>Computation without optimization </p>\n": "<p>\u6700\u9069\u5316\u306a\u3057\u306e\u8a08\u7b97</p>\n",22"<p>Denominator <span translate=no>_^_0_^_</span> </p>\n": "<p>\u5206\u6bcd <span translate=no>_^_0_^_</span></p>\n",23"<p>From <span translate=no>_^_0_^_</span> distribution we have,</p>\n": "<p><span translate=no>_^_0_^_</span>\u79c1\u305f\u3061\u304c\u6301\u3063\u3066\u3044\u308b\u30c7\u30a3\u30b9\u30c8\u30ea\u30d3\u30e5\u30fc\u30b7\u30e7\u30f3\u304b\u3089\u3001</p>\n",24"<p>From above we have <span translate=no>_^_0_^_</span> where <span translate=no>_^_1_^_</span>. Note that <span translate=no>_^_2_^_</span> here is the standard deviation and different from <span translate=no>_^_3_^_</span> for momentum.</p>\n": "<p>\u4e0a\u304b\u3089\u898b\u308b\u3068\u3001<span translate=no>_^_0_^_</span>\u5834\u6240\u304c\u308f\u304b\u308a\u307e\u3059<span translate=no>_^_1_^_</span>\u3002<span translate=no>_^_2_^_</span>\u3053\u308c\u306f\u6a19\u6e96\u504f\u5dee\u3067\u3042\u308a\u3001<span translate=no>_^_3_^_</span>\u904b\u52d5\u91cf\u3068\u306f\u7570\u306a\u308b\u3053\u3068\u306b\u6ce8\u610f\u3057\u3066\u304f\u3060\u3055\u3044</p>\u3002\n",25"<p>Get <span translate=no>_^_0_^_</span> and <span translate=no>_^_1_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span>\u53d6\u5f97\u3057\u3066 <span translate=no>_^_1_^_</span></p>\n",26"<p>Get <span translate=no>_^_0_^_</span> and <span translate=no>_^_1_^_</span>; i.e. <span translate=no>_^_2_^_</span> and <span translate=no>_^_3_^_</span> without bias correction </p>\n": "<p>Get <span translate=no>_^_0_^_</span> \u3068<span translate=no>_^_1_^_</span>; \u3064\u307e\u308a<span translate=no>_^_2_^_</span>\u3001<span translate=no>_^_3_^_</span>\u30d0\u30a4\u30a2\u30b9\u88dc\u6b63\u306a\u3057</p>\n",27"<p>Get learning rate </p>\n": "<p>\u5b66\u7fd2\u7387\u3092\u53d6\u5f97</p>\n",28"<p>Here we are taking the simple moving average of the last <span translate=no>_^_0_^_</span> gradients. <span translate=no>_^_1_^_</span> satisfies the following,</p>\n": "<p>\u3053\u3053\u3067\u306f\u3001<span translate=no>_^_0_^_</span>\u6700\u5f8c\u306e\u52fe\u914d\u306e\u5358\u7d14\u79fb\u52d5\u5e73\u5747\u3092\u53d6\u3063\u3066\u3044\u307e\u3059\u3002<span translate=no>_^_1_^_</span>\u4ee5\u4e0b\u3092\u6e80\u305f\u3057\u3001</p>\n",29"<p>If <span translate=no>_^_0_^_</span> is intractable </p>\n": "<p><span translate=no>_^_0_^_</span>\u6cbb\u308a\u306b\u304f\u3044\u5834\u5408</p>\n",30"<p>If <span translate=no>_^_0_^_</span> is intractable do a SGD with momentum </p>\n": "<p><span translate=no>_^_0_^_</span>\u624b\u306b\u8ca0\u3048\u306a\u3044\u306a\u3089\u52e2\u3044\u3092\u3064\u3051\u3066SGD\u3092\u3084\u308a\u307e\u3057\u3087\u3046</p>\n",31"<p>In order to ensure that the adaptive learning rate <span translate=no>_^_0_^_</span> has consistent variance, we rectify the variance with <span translate=no>_^_1_^_</span></p>\n": "<p><span translate=no>_^_0_^_</span>\u9069\u5fdc\u578b\u5b66\u7fd2\u7387\u306e\u3070\u3089\u3064\u304d\u304c\u4e00\u8cab\u3057\u3066\u3044\u308b\u3053\u3068\u3092\u78ba\u8a8d\u3059\u308b\u305f\u3081\u306b\u3001\u5dee\u7570\u3092\u4ee5\u4e0b\u306e\u3088\u3046\u306b\u4fee\u6b63\u3057\u307e\u3059\u3002<span translate=no>_^_1_^_</span></p>\n",32"<p>Let <span translate=no>_^_0_^_</span> and <span translate=no>_^_1_^_</span> be the functions to calculate momentum and adaptive learning rate. For Adam, they are</p>\n": "<p><span translate=no>_^_1_^_</span>\u904b\u52d5\u91cf\u3068\u9069\u5fdc\u5b66\u7fd2\u7387\u3092\u8a08\u7b97\u3059\u308b\u95a2\u6570\u3068\u3057\u307e\u3057\u3087\u3046<span translate=no>_^_0_^_</span>\u3002\u30a2\u30c0\u30e0\u306b\u3068\u3063\u3066\u3001\u5f7c\u3089\u306f</p>\n",33"<p>Perform <em>RAdam</em> update </p>\n": "<p><em>RaDAM \u30a2\u30c3\u30d7\u30c7\u30fc\u30c8\u3092\u5b9f\u884c</em></p>\n",34"<p>Step size <span translate=no>_^_0_^_</span> </p>\n": "<p>\u30b9\u30c6\u30c3\u30d7\u30b5\u30a4\u30ba <span translate=no>_^_0_^_</span></p>\n",35"<p>The distribution of exponential moving average can be approximated as a simple moving average.</p>\n": "<p>\u6307\u6570\u79fb\u52d5\u5e73\u5747\u306e\u5206\u5e03\u306f\u3001\u5358\u7d14\u306a\u79fb\u52d5\u5e73\u5747\u3068\u3057\u3066\u8fd1\u4f3c\u3067\u304d\u307e\u3059\u3002</p>\n",36"<p>The paper also evaluates two variance reduction mechanisms: <em> <strong>Adam-2k</strong>: Only compute the adaptive learning rate (<span translate=no>_^_0_^_</span> in <a href=\"adam.html\">Adam</a>) during the first 2k steps, without changing parameters or calculating momentum (<span translate=no>_^_1_^_</span>). </em> <strong>Adam-eps</strong>: Adam with large <span translate=no>_^_2_^_</span>.</p>\n": "<p>\u3053\u306e\u8ad6\u6587\u3067\u306f\u30012\u3064\u306e\u5206\u6563\u524a\u6e1b\u30e1\u30ab\u30cb\u30ba\u30e0\u306b\u3064\u3044\u3066\u3082\u8a55\u4fa1\u3057\u3066\u3044\u307e\u3059\u3002<em><strong>Adam-2k</strong>\uff1a\u30d1\u30e9\u30e1\u30fc\u30bf\u3092\u5909\u66f4\u3057\u305f\u308a\u3001\u904b\u52d5\u91cf\u3092\u8a08\u7b97\u3057\u305f\u308a\u305b\u305a\u306b\u3001<span translate=no>_^_0_^_</span>\u6700\u521d\u306e2k\u30b9\u30c6\u30c3\u30d7\u3067\u306f\uff08<a href=\"adam.html\">Adam\u3067</a>\uff09\u9069\u5fdc\u5b66\u7fd2\u7387\u306e\u307f\u3092\u8a08\u7b97\u3057\u307e\u3059</em>\uff08\uff09\u3002<span translate=no>_^_1_^_</span><strong>Adam-EPS</strong>: \u30a2\u30c0\u30e0\u30fb\u30a6\u30a3\u30ba\u30fb\u30e9\u30fc\u30b8\u30fb\u30a6\u30a3\u30ba\u30fb\u30e9\u30fc\u30b8</p>. <span translate=no>_^_2_^_</span>\n",37"<p>Therefore the variance is minimized at maximal <span translate=no>_^_0_^_</span> which is <span translate=no>_^_1_^_</span>. Let the minimum variance be <span translate=no>_^_2_^_</span></p>\n": "<p>\u3057\u305f\u304c\u3063\u3066\u3001\u5206\u6563\u306f\u6700\u5927\u5024<span translate=no>_^_0_^_</span>\u3001\u3064\u307e\u308a\u3067\u6700\u5c0f\u5316\u3055\u308c\u307e\u3059\u3002<span translate=no>_^_1_^_</span>\u6700\u5c0f\u5206\u6563\u3092\u6b21\u306e\u5f0f\u306b\u3057\u307e\u3057\u3087\u3046 <span translate=no>_^_2_^_</span></p>\n",38"<p>They estimate <span translate=no>_^_0_^_</span> based on first order expansion of <span translate=no>_^_1_^_</span> \ud83e\udd2a I didn't get how it was derived.</p>\n": "<p>\u3069\u3046\u5c0e\u304d\u51fa\u3055\u308c\u305f\u306e\u304b\u308f\u304b\u3089\u306a\u304b\u3063\u305f <span translate=no>_^_1_^_</span> \ud83e\udd2a <span translate=no>_^_0_^_</span> \u306e\u4e00\u6b21\u5c55\u958b\u306b\u57fa\u3065\u3044\u3066\u898b\u7a4d\u3082\u3063\u3066\u3044\u307e\u3059\u3002</p>\n",39"<p>They prove that variance of <span translate=no>_^_0_^_</span> decreases with <span translate=no>_^_1_^_</span> when <span translate=no>_^_2_^_</span>.</p>\n": "<p><span translate=no>_^_0_^_</span>\u6642\u9593\u3068\u3068\u3082\u306b\u3070\u3089\u3064\u304d\u304c\u5c0f\u3055\u304f\u306a\u308b\u3053\u3068\u3092\u8a3c\u660e\u3057\u3066\u3044\u307e\u3059<span translate=no>_^_1_^_</span>\u3002<span translate=no>_^_2_^_</span></p>\n",40"<p>This gives,</p>\n": "<p>\u3053\u308c\u306b\u3088\u308a\u3001</p>\n",41"<p>This implementation is based on <a href=\"https://github.com/LiyuanLucasLiu/RAdam\">the official implementation</a> of the paper <a href=\"https://arxiv.org/abs/1908.03265\">On the Variance of the Adaptive Learning Rate and Beyond</a>.</p>\n": "<p>\u3053\u306e\u5b9f\u88c5\u306f<a href=\"https://github.com/LiyuanLucasLiu/RAdam\">\u3001\u300c<a href=\"https://arxiv.org/abs/1908.03265\">\u9069\u5fdc\u5b66\u7fd2\u7387\u3068\u305d\u306e\u5f8c\u306e\u5dee\u7570\u306b\u95a2\u3059\u308b\u8ad6\u6587\u300d\u306e\u516c\u5f0f\u5b9f\u88c5\u306b\u57fa\u3065\u3044\u3066\u3044\u307e\u3059</a></a>\u3002</p>\n",42"<p>Update parameters <span translate=no>_^_0_^_</span> </p>\n": "<p>\u30d1\u30e9\u30e1\u30fc\u30bf\u3092\u66f4\u65b0 <span translate=no>_^_0_^_</span></p>\n",43"<p>We have implemented it in <a href=\"https://pytorch.org\">PyTorch</a> as an extension to <a href=\"amsgrad.html\">our AMSGrad implementation</a> thus requiring only the modifications to be implemented.</p>\n": "<p><a href=\"https://pytorch.org\">amsGrad\u5b9f\u88c5\u306e\u62e1\u5f35\u3068\u3057\u3066PyTorch\u306b\u5b9f\u88c5\u3057\u305f\u306e\u3067</a><a href=\"amsgrad.html\">\u3001\u5b9f\u88c5\u3059\u308b\u5fc5\u8981\u304c\u3042\u308b\u306e\u306f\u5909\u66f4\u3060\u3051\u3067\u3059</a>\u3002</p>\n",44"<p>We have</p>\n": "<p>\u79c1\u305f\u3061\u306f\u6301\u3063\u3066\u3044\u307e\u3059</p>\n",45"<p>Whether to optimize the computation by combining scalar computations </p>\n": "<p>\u30b9\u30ab\u30e9\u30fc\u8a08\u7b97\u3092\u7d44\u307f\u5408\u308f\u305b\u3066\u8a08\u7b97\u3092\u6700\u9069\u5316\u3059\u308b\u304b\u3069\u3046\u304b</p>\n",46"<p>where <span translate=no>_^_0_^_</span> is <span translate=no>_^_1_^_</span> for <span translate=no>_^_2_^_</span>. Lt <span translate=no>_^_3_^_</span> and step <span translate=no>_^_4_^_</span> be <span translate=no>_^_5_^_</span>, and <span translate=no>_^_6_^_</span> be the rectification term at step <span translate=no>_^_7_^_</span>.</p>\n": "<p><span translate=no>_^_0_^_</span><span translate=no>_^_1_^_</span>\u3069\u3053\u304c<span translate=no>_^_2_^_</span>.<span translate=no>_^_3_^_</span><span translate=no>_^_4_^_</span>\u4e00\u6b69\u3092\u8e0f\u307f\u51fa\u3057\u3066<span translate=no>_^_5_^_</span>\u3001<span translate=no>_^_6_^_</span>\u6bb5\u968e\u7684\u306a\u4fee\u6b63\u9805\u306b\u306a\u308a\u306a\u3055\u3044</p>\u3002<span translate=no>_^_7_^_</span>\n",47"<p>which gives, <span translate=no>_^_0_^_</span></p>\n": "<p>\u3053\u308c\u306b\u3088\u308a\u3001<span translate=no>_^_0_^_</span></p>\n",48"<span translate=no>_^_0_^_</span>": "<span translate=no>_^_0_^_</span>",49"A simple PyTorch implementation/tutorial of RAdam optimizer.": "RaDAM \u30aa\u30d7\u30c6\u30a3\u30de\u30a4\u30b6\u30fc\u306e\u7c21\u5358\u306a PyTorch \u5b9f\u88c5/\u30c1\u30e5\u30fc\u30c8\u30ea\u30a2\u30eb\u3002",50"Rectified Adam (RAdam) optimizer": "\u4fee\u6b63\u3055\u308c\u305f\u30a2\u30c0\u30e0 (RaDAM) \u30aa\u30d7\u30c6\u30a3\u30de\u30a4\u30b6\u30fc"51}5253