CoCalc -- amsgrad.zh.json

GitHub Repository: labmlai/annotated_deep_learning_paper_implementations
Path: blob/master/translate_cache/optimizers/amsgrad.zh.json
⁴⁹²³ views
1
{
2
 "<h1>AMSGrad</h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of the paper <a href=\"https://arxiv.org/abs/1904.09237\">On the Convergence of Adam and Beyond</a>.</p>\n<p>We implement this as an extension to our <a href=\"adam.html\">Adam optimizer implementation</a>. The implementation it self is really small since it&#x27;s very similar to Adam.</p>\n<p>We also have an implementation of the synthetic example described in the paper where Adam fails to converge.</p>\n": "<h1>\u963f\u59c6\u65af\u683c\u62c9\u5fb7</h1>\n<p>\u8fd9\u662f <a href=\"https://pytorch.org\">PyTorch</a> \u5bf9\u300a<a href=\"https://arxiv.org/abs/1904.09237\">\u4e9a\u5f53\u4e0e\u8d85\u8d8a\u7684\u878d\u5408\u300b\u4e00\u6587\u7684</a>\u5b9e\u73b0\u3002</p>\n<p>\u6211\u4eec\u5c06\u5176\u4f5c\u4e3a\u6211\u4eec\u7684 <a href=\"adam.html\">Adam \u4f18\u5316\u5668\u5b9e\u73b0</a>\u7684\u6269\u5c55\u3002\u5b83\u81ea\u8eab\u7684\u5b9e\u73b0\u975e\u5e38\u5c0f\uff0c\u56e0\u4e3a\u5b83\u4e0e\u4e9a\u5f53\u975e\u5e38\u76f8\u4f3c\u3002</p>\n<p>\u6211\u4eec\u8fd8\u5b9e\u73b0\u4e86\u672c\u6587\u4e2d\u63cf\u8ff0\u7684\u5408\u6210\u793a\u4f8b\uff0c\u5176\u4e2d\u4e9a\u5f53\u672a\u80fd\u6536\u655b\u3002</p>\n",
3
 "<h2>AMSGrad Optimizer</h2>\n<p>This class extends from Adam optimizer defined in <a href=\"adam.html\"><span translate=no>_^_0_^_</span></a>. Adam optimizer is extending the class <span translate=no>_^_1_^_</span> defined in <a href=\"index.html\"><span translate=no>_^_2_^_</span></a>.</p>\n": "<h2>amsGrad \u4f18\u5316\u5668</h2>\n<p>\u8fd9\u4e2a\u7c7b\u662f\u4ece\u4e2d\u5b9a\u4e49\u7684 Adam \u4f18\u5316\u5668\u6269\u5c55\u800c\u6765\u7684<a href=\"adam.html\"><span translate=no>_^_0_^_</span></a>\u3002Adam \u4f18\u5316\u5668\u6b63\u5728\u6269\u5c55\u4e2d<span translate=no>_^_1_^_</span>\u5b9a\u4e49\u7684\u7c7b<a href=\"index.html\"><span translate=no>_^_2_^_</span></a>\u3002</p>\n",
4
 "<h2>Synthetic Experiment</h2>\n<p>This is the synthetic experiment described in the paper, that shows a scenario where <em>Adam</em> fails.</p>\n<p>The paper (and Adam) formulates the problem of optimizing as minimizing the expected value of a function, <span translate=no>_^_0_^_</span> with respect to the parameters <span translate=no>_^_1_^_</span>. In the stochastic training setting we do not get hold of the function <span translate=no>_^_2_^_</span> it self; that is, when you are optimizing a NN <span translate=no>_^_3_^_</span> would be the function on entire batch of data. What we actually evaluate is a mini-batch so the actual function is realization of the stochastic <span translate=no>_^_4_^_</span>. This is why we are talking about an expected value. So let the function realizations be <span translate=no>_^_5_^_</span> for each time step of training.</p>\n<p>We measure the performance of the optimizer as the regret, <span translate=no>_^_6_^_</span> where <span translate=no>_^_7_^_</span> is the parameters at time step <span translate=no>_^_8_^_</span>, and <span translate=no>_^_9_^_</span> is the optimal parameters that minimize <span translate=no>_^_10_^_</span>.</p>\n<p>Now lets define the synthetic problem,</p>\n<span translate=no>_^_11_^_</span><p>where <span translate=no>_^_12_^_</span>. The optimal solution is <span translate=no>_^_13_^_</span>.</p>\n<p>This code will try running <em>Adam</em> and <em>AMSGrad</em> on this problem.</p>\n": "<h2>\u5408\u6210\u5b9e\u9a8c</h2>\n<p>\u8fd9\u662f\u8bba\u6587\u4e2d\u63cf\u8ff0\u7684\u5408\u6210\u5b9e\u9a8c\uff0c\u5b83\u663e\u793a\u4e86<em>\u4e9a\u5f53</em>\u5931\u8d25\u7684\u60c5\u666f\u3002</p>\n<p>\u672c\u6587\uff08\u548c\u4e9a\u5f53\uff09\u5c06\u4f18\u5316\u95ee\u9898\u63cf\u8ff0\u4e3a\u6700\u5c0f\u5316\u51fd\u6570<span translate=no>_^_0_^_</span>\u76f8\u5bf9\u4e8e\u53c2\u6570\u7684\u9884\u671f\u503c<span translate=no>_^_1_^_</span>\u3002\u5728\u968f\u673a\u8bad\u7ec3\u8bbe\u7f6e\u4e2d\uff0c\u6211\u4eec\u65e0\u6cd5\u638c\u63e1<span translate=no>_^_2_^_</span>\u5b83\u81ea\u8eab\u7684\u51fd\u6570\uff1b\u4e5f\u5c31\u662f\u8bf4\uff0c\u5f53\u4f60\u4f18\u5316\u65f6\uff0cNN<span translate=no>_^_3_^_</span> \u5c06\u662f\u6574\u6279\u6570\u636e\u7684\u51fd\u6570\u3002\u6211\u4eec\u5b9e\u9645\u8bc4\u4f30\u7684\u662f\u4e00\u4e2a\u5c0f\u6279\u91cf\uff0c\u6240\u4ee5\u5b9e\u9645\u7684\u529f\u80fd\u662f\u968f\u673a\u6307\u6807\u7684\u5b9e\u73b0<span translate=no>_^_4_^_</span>\u3002\u8fd9\u5c31\u662f\u6211\u4eec\u8c08\u8bba\u9884\u671f\u503c\u7684\u539f\u56e0\u3002\u56e0\u6b64\uff0c\u8ba9\u51fd\u6570\u5b9e\u73b0<span translate=no>_^_5_^_</span>\u9002\u7528\u4e8e\u8bad\u7ec3\u7684\u6bcf\u4e2a\u65f6\u95f4\u6b65\u3002</p>\n<p>\u6211\u4eec\u5c06\u4f18\u5316\u5668\u7684\u6027\u80fd\u4f5c\u4e3a\u9057\u61be\u6765\u8861\u91cf\uff0c<span translate=no>_^_6_^_</span>\u5176\u4e2d<span translate=no>_^_7_^_</span>\u662f\u65f6\u95f4\u6b65\u7684\u53c2\u6570<span translate=no>_^_8_^_</span>\uff0c<span translate=no>_^_9_^_</span>\u662f\u6700\u4f73\u7684\u6700\u5c0f\u5316\u7684\u53c2\u6570<span translate=no>_^_10_^_</span>\u3002</p>\n<p>\u73b0\u5728\u8ba9\u6211\u4eec\u6765\u5b9a\u4e49\u7efc\u5408\u95ee\u9898\uff0c</p>\n<span translate=no>_^_11_^_</span><p>\u5728\u54ea\u91cc<span translate=no>_^_12_^_</span>\u3002\u6700\u4f73\u7684\u89e3\u51b3\u65b9\u6848\u662f<span translate=no>_^_13_^_</span>\u3002</p>\n<p>\u8fd9\u6bb5\u4ee3\u7801\u5c06\u5c1d\u8bd5\u8fd0\u884c<em>\u4e9a\u5f53</em>\u548c<em>\u963f\u59c6\u65af\u683c</em>\u62c9\u5fb7\u6765\u89e3\u51b3\u8fd9\u4e2a\u95ee\u9898\u3002</p>\n",
5
 "<h3><span translate=no>_^_0_^_</span></h3>\n": "<h3><span translate=no>_^_0_^_</span></h3>\n",
6
 "<h3>Calculate <span translate=no>_^_0_^_</span> and and <span translate=no>_^_1_^_</span> or <span translate=no>_^_2_^_</span></h3>\n<ul><li><span translate=no>_^_3_^_</span> is the optimizer state of the parameter (tensor) </li>\n<li><span translate=no>_^_4_^_</span> stores optimizer attributes of the parameter group </li>\n<li><span translate=no>_^_5_^_</span> is the current gradient tensor <span translate=no>_^_6_^_</span> for the parameter <span translate=no>_^_7_^_</span></li></ul>\n": "<h3>\u8ba1\u7b97<span translate=no>_^_0_^_</span>\u548c\u548c<span translate=no>_^_1_^_</span>\u6216<span translate=no>_^_2_^_</span></h3>\n<ul><li><span translate=no>_^_3_^_</span>\u662f\u53c2\u6570\uff08\u5f20\u91cf\uff09\u7684\u4f18\u5316\u5668\u72b6\u6001</li>\n<li><span translate=no>_^_4_^_</span>\u5b58\u50a8\u53c2\u6570\u7ec4\u7684\u4f18\u5316\u7a0b\u5e8f\u5c5e\u6027</li>\n<li><span translate=no>_^_5_^_</span>\u662f\u53c2\u6570\u7684\u5f53\u524d\u68af<span translate=no>_^_6_^_</span>\u5ea6\u5f20\u91cf<span translate=no>_^_7_^_</span></li></ul>\n",
7
 "<h3>Initialize a parameter state</h3>\n<ul><li><span translate=no>_^_0_^_</span> is the optimizer state of the parameter (tensor) </li>\n<li><span translate=no>_^_1_^_</span> stores optimizer attributes of the parameter group </li>\n<li><span translate=no>_^_2_^_</span> is the parameter tensor <span translate=no>_^_3_^_</span></li></ul>\n": "<h3>\u521d\u59cb\u5316\u53c2\u6570\u72b6\u6001</h3>\n<ul><li><span translate=no>_^_0_^_</span>\u662f\u53c2\u6570\uff08\u5f20\u91cf\uff09\u7684\u4f18\u5316\u5668\u72b6\u6001</li>\n<li><span translate=no>_^_1_^_</span>\u5b58\u50a8\u53c2\u6570\u7ec4\u7684\u4f18\u5316\u7a0b\u5e8f\u5c5e\u6027</li>\n<li><span translate=no>_^_2_^_</span>\u662f\u53c2\u6570\u5f20\u91cf<span translate=no>_^_3_^_</span></li></ul>\n",
8
 "<h3>Initialize the optimizer</h3>\n<ul><li><span translate=no>_^_0_^_</span> is the list of parameters </li>\n<li><span translate=no>_^_1_^_</span> is the learning rate <span translate=no>_^_2_^_</span> </li>\n<li><span translate=no>_^_3_^_</span> is a tuple of (<span translate=no>_^_4_^_</span>, <span translate=no>_^_5_^_</span>) </li>\n<li><span translate=no>_^_6_^_</span> is <span translate=no>_^_7_^_</span> or <span translate=no>_^_8_^_</span> based on <span translate=no>_^_9_^_</span> </li>\n<li><span translate=no>_^_10_^_</span> is an instance of class <span translate=no>_^_11_^_</span> defined in <a href=\"index.html\"><span translate=no>_^_12_^_</span></a> </li>\n<li>&#x27;optimized_update&#x27; is a flag whether to optimize the bias correction of the second moment  by doing it after adding <span translate=no>_^_13_^_</span> </li>\n<li><span translate=no>_^_14_^_</span> is a flag indicating whether to use AMSGrad or fallback to plain Adam </li>\n<li><span translate=no>_^_15_^_</span> is a dictionary of default for group values.  This is useful when you want to extend the class <span translate=no>_^_16_^_</span>.</li></ul>\n": "<h3>\u521d\u59cb\u5316\u4f18\u5316\u5668</h3>\n<ul><li><span translate=no>_^_0_^_</span>\u662f\u53c2\u6570\u5217\u8868</li>\n<li><span translate=no>_^_1_^_</span>\u662f\u5b66\u4e60\u7387<span translate=no>_^_2_^_</span></li>\n<li><span translate=no>_^_3_^_</span>\u662f (<span translate=no>_^_4_^_</span>,<span translate=no>_^_5_^_</span>) \u7684\u5143\u7ec4</li>\n<li><span translate=no>_^_6_^_</span>\u662f<span translate=no>_^_7_^_</span>\u6216<span translate=no>_^_8_^_</span>\u57fa\u4e8e<span translate=no>_^_9_^_</span></li>\n<li><span translate=no>_^_10_^_</span>\u662f\u5728\u4e2d<span translate=no>_^_11_^_</span>\u5b9a\u4e49\u7684\u7c7b\u7684\u5b9e\u4f8b <a href=\"index.html\"><span translate=no>_^_12_^_</span></a></li>\n<li>\u201coptimized_update\u201d \u662f\u4e00\u4e2a\u6807\u5fd7\uff0c\u5728\u6dfb\u52a0\u540e\u662f\u5426\u8981\u4f18\u5316\u7b2c\u4e8c\u4e2a\u65f6\u523b\u7684\u504f\u5dee\u6821\u6b63<span translate=no>_^_13_^_</span></li>\n<li><span translate=no>_^_14_^_</span>\u662f\u4e00\u4e2a\u6807\u5fd7\uff0c\u6307\u793a\u662f\u4f7f\u7528 AmsGrad \u8fd8\u662f\u56de\u9000\u5230\u666e\u901a\u7684 Adam</li>\n<li><span translate=no>_^_15_^_</span>\u662f\u7ec4\u503c\u7684\u9ed8\u8ba4\u5b57\u5178\u3002\u5f53\u4f60\u60f3\u6269\u5c55\u7c7b\u65f6\uff0c\u8fd9\u5f88\u6709\u7528<span translate=no>_^_16_^_</span>\u3002</li></ul>\n",
9
 "<p><span translate=no>_^_0_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span></p>\n",
10
 "<p>Calculate <span translate=no>_^_0_^_</span>.</p>\n<p>\ud83e\udd14 I feel you should be taking / maintaining the max of the bias corrected second exponential average of squared gradient. But this is how it&#x27;s <a href=\"https://github.com/pytorch/pytorch/blob/19f4c5110e8bcad5e7e75375194262fca0a6293a/torch/optim/functional.py#L90\">implemented in PyTorch also</a>. I guess it doesn&#x27;t really matter since bias correction only increases the value and it only makes an actual difference during the early few steps of the training. </p>\n": "<p>\u8ba1\u7b97<span translate=no>_^_0_^_</span>\u3002</p>\n<p>\ud83e\udd14 \u6211\u89c9\u5f97\u4f60\u5e94\u8be5\u53d6/\u4fdd\u6301\u504f\u5dee\u6821\u6b63\u7684\u5e73\u65b9\u68af\u5ea6\u7684\u7b2c\u4e8c\u4e2a\u6307\u6570\u5e73\u5747\u503c\u7684\u6700\u5927\u503c\u3002\u4f46\u8fd9\u4e5f\u662f<a href=\"https://github.com/pytorch/pytorch/blob/19f4c5110e8bcad5e7e75375194262fca0a6293a/torch/optim/functional.py#L90\">\u5728 PyTorch \u4e2d\u5b9e\u73b0</a>\u5b83\u7684\u65b9\u5f0f\u3002\u6211\u60f3\u8fd9\u5e76\u4e0d\u91cd\u8981\uff0c\u56e0\u4e3a\u504f\u5dee\u6821\u6b63\u53ea\u4f1a\u589e\u52a0\u503c\uff0c\u800c\u4e14\u53ea\u4f1a\u5728\u8bad\u7ec3\u7684\u6700\u521d\u51e0\u4e2a\u6b65\u9aa4\u4e2d\u4ea7\u751f\u5b9e\u9645\u5dee\u5f02\u3002</p>\n",
11
 "<p>Calculate gradients </p>\n": "<p>\u8ba1\u7b97\u68af\u5ea6</p>\n",
12
 "<p>Call <span translate=no>_^_0_^_</span> of Adam optimizer which we are extending </p>\n": "<p>\u6211\u4eec\u6b63\u5728\u6269\u5c55<span translate=no>_^_0_^_</span>\u7684 Call of Adam \u4f18\u5316\u5668</p>\n",
13
 "<p>Clear gradients </p>\n": "<p>\u6e10\u53d8\u6e05\u6670</p>\n",
14
 "<p>Create experiment to record results </p>\n": "<p>\u521b\u5efa\u5b9e\u9a8c\u4ee5\u8bb0\u5f55\u7ed3\u679c</p>\n",
15
 "<p>Define <span translate=no>_^_0_^_</span> parameter </p>\n": "<p>\u5b9a\u4e49<span translate=no>_^_0_^_</span>\u53c2\u6570</p>\n",
16
 "<p>Fall back to <em>Adam</em> if the parameter group is not using <span translate=no>_^_0_^_</span> </p>\n": "<p>\u5982\u679c\u53c2\u6570\u7ec4\u672a\u4f7f\u7528\uff0c\u5219\u56de\u9000\u5230 <em>Adam</em><span translate=no>_^_0_^_</span></p>\n",
17
 "<p>Get <span translate=no>_^_0_^_</span> and <span translate=no>_^_1_^_</span> from <em>Adam</em> </p>\n": "<p><span translate=no>_^_1_^_</span>\u4ece <em>Adam</em> \u90a3\u91cc\u5f97<span translate=no>_^_0_^_</span>\u5230</p>\n",
18
 "<p>Get <span translate=no>_^_0_^_</span>.</p>\n<p>\ud83d\uddd2 The paper uses the notation <span translate=no>_^_1_^_</span> for this, which we don&#x27;t use that here because it confuses with the Adam&#x27;s usage of the same notation for bias corrected exponential moving average. </p>\n": "<p>\u5f97\u5230<span translate=no>_^_0_^_</span>\u3002</p>\n<p>\ud83d\uddd2 \u672c\u6587\u4f7f\u7528\u4e86\u8fd9\u4e2a\u7b26\u53f7<span translate=no>_^_1_^_</span>\uff0c\u6211\u4eec\u5728\u8fd9\u91cc\u4e0d\u4f7f\u7528\u8fd9\u79cd\u7b26\u53f7\uff0c\u56e0\u4e3a\u5b83\u4e0e\u4e9a\u5f53\u5bf9\u504f\u5dee\u6821\u6b63\u6307\u6570\u79fb\u52a8\u5e73\u5747\u7ebf\u4f7f\u7528\u76f8\u540c\u7684\u7b26\u53f7\u6df7\u6dc6\u4e86\u3002</p>\n",
19
 "<p>If <span translate=no>_^_0_^_</span> flag is <span translate=no>_^_1_^_</span> for this parameter group, we maintain the maximum of exponential moving average of squared gradient </p>\n": "<p>\u5982\u679c f<span translate=no>_^_0_^_</span> lag<span translate=no>_^_1_^_</span> \u7528\u4e8e\u6b64\u53c2\u6570\u7ec4\uff0c\u5219\u6211\u4eec\u4fdd\u6301\u68af\u5ea6\u5e73\u65b9\u6307\u6570\u79fb\u52a8\u5e73\u5747\u7ebf\u7684\u6700\u5927\u503c</p>\n",
20
 "<p>If this parameter group is using <span translate=no>_^_0_^_</span> </p>\n": "<p>\u5982\u679c\u6b64\u53c2\u6570\u7ec4\u6b63\u5728\u4f7f\u7528<span translate=no>_^_0_^_</span></p>\n",
21
 "<p>Initialize the relevant optimizer </p>\n": "<p>\u521d\u59cb\u5316\u76f8\u5173\u7684\u4f18\u5316\u5668</p>\n",
22
 "<p>Make sure <span translate=no>_^_0_^_</span> </p>\n": "<p>\u8bf7\u786e\u4fdd<span translate=no>_^_0_^_</span></p>\n",
23
 "<p>Optimal, <span translate=no>_^_0_^_</span> </p>\n": "<p>\u6700\u4f73\uff0c<span translate=no>_^_0_^_</span></p>\n",
24
 "<p>Optimize </p>\n": "<p>\u4f18\u5316</p>\n",
25
 "<p>Run for <span translate=no>_^_0_^_</span> steps </p>\n": "<p>\u8dd1\u6b65\u8dd1<span translate=no>_^_0_^_</span>\u6b65</p>\n",
26
 "<p>Run the synthetic experiment is <em>AMSGrad</em> You can see that AMSGrad converges to true optimal <span translate=no>_^_0_^_</span> </p>\n": "<p>\u5728 <em>amsGrad</em> \u8fd0\u884c\u5408\u6210\u5b9e\u9a8c\u4f60\u53ef\u4ee5\u770b\u5230 amsGrad \u4f1a\u805a\u5230\u771f\u6b63\u7684\u6700\u4f18\u503c<span translate=no>_^_0_^_</span></p>\n",
27
 "<p>Run the synthetic experiment is <em>Adam</em>. You can see that Adam converges at <span translate=no>_^_0_^_</span> </p>\n": "<p>\u8fd0\u884c\u5408\u6210\u5b9e\u9a8c\u7684\u662f<em>\u4e9a\u5f53</em>\u3002\u4f60\u53ef\u4ee5\u770b\u5230\u4e9a\u5f53\u805a\u96c6\u5728<span translate=no>_^_0_^_</span></p>\n",
28
 "<p>Track results every 1,000 steps </p>\n": "<p>\u6bcf 1000 \u6b65\u8ddf\u8e2a\u4e00\u6b21\u7ed3\u679c</p>\n",
29
 "A simple PyTorch implementation/tutorial of AMSGrad optimizer.": "\u4e00\u4e2a\u7b80\u5355\u7684 AmsGrad \u4f18\u5316\u5668\u7684 PyTorch \u5b9e\u73b0/\u6559\u7a0b\u3002",
30
 "AMSGrad Optimizer": "amsGrad \u4f18\u5316\u5668"
31
}
32
Product

Resources

Company