Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
labmlai
GitHub Repository: labmlai/annotated_deep_learning_paper_implementations
Path: blob/master/translate_cache/optimizers/sophia.zh.json
4937 views
1
{
2
"<h1>Sophia Optimizer</h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of <em>Sophia-G</em> from paper <a href=\"https://arxiv.org/abs/2305.14342\">Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training</a>. Official implementation is available at <a href=\"https://github.com/Liuhong99/Sophia\">Liuhong99/Sophia</a>.</p>\n<p>Sophia is more adaptive to heterogeneous curvatures than Adam, more resistant to non-convexity and rapid change of Hessian than Newton\u2019s method, and also uses a low-cost pre-conditioner.</p>\n<p>Sophia keeps diagonal Hessian estimates with EMA across iterations. The diagonal Hessian <span translate=no>_^_0_^_</span> is calculated every <span translate=no>_^_1_^_</span> steps.</p>\n<span translate=no>_^_2_^_</span><p>Sophia uses EMA of gradients <span translate=no>_^_3_^_</span>, only considers positive entries of the diagonal Hessian and does per-coordinate clipping to the update.</p>\n<span translate=no>_^_4_^_</span><p>where <span translate=no>_^_5_^_</span> is a very small value to prevent division by <span translate=no>_^_6_^_</span>.</p>\n<h3>Gauss-Newton-Bartlett (GNB) estimator</h3>\n<span translate=no>_^_7_^_</span><p>where <span translate=no>_^_8_^_</span> are the inputs, <span translate=no>_^_9_^_</span> is the batch size (number of inputs/tokens), <span translate=no>_^_10_^_</span> is cross entropy loss, and <span translate=no>_^_11_^_</span> are sampled from the logits <span translate=no>_^_12_^_</span>.</p>\n<p>Note that this hessian estimate is always positive and therefore we can replace <span translate=no>_^_13_^_</span> with <span translate=no>_^_14_^_</span>.</p>\n<p>Sophia with Gauss-Newton-Bartlett (GNB) estimator is <strong>Sophia-G</strong></p>\n<p>Here is an <a href=\"../transformers/basic/with_sophia.html\">experiment</a> that uses Sophia-G to train a transformer.</p>\n": "<h1>Sophia Optimizer</h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of <em>Sophia-G</em> from paper <a href=\"https://arxiv.org/abs/2305.14342\">Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training</a>. Official implementation is available at <a href=\"https://github.com/Liuhong99/Sophia\">Liuhong99/Sophia</a>.</p>\n<p>Sophia is more adaptive to heterogeneous curvatures than Adam, more resistant to non-convexity and rapid change of Hessian than Newton\u2019s method, and also uses a low-cost pre-conditioner.</p>\n<p>Sophia keeps diagonal Hessian estimates with EMA across iterations. The diagonal Hessian <span translate=no>_^_0_^_</span> is calculated every <span translate=no>_^_1_^_</span> steps.</p>\n<span translate=no>_^_2_^_</span><p>Sophia uses EMA of gradients <span translate=no>_^_3_^_</span>, only considers positive entries of the diagonal Hessian and does per-coordinate clipping to the update.</p>\n<span translate=no>_^_4_^_</span><p>where <span translate=no>_^_5_^_</span> is a very small value to prevent division by <span translate=no>_^_6_^_</span>.</p>\n<h3>Gauss-Newton-Bartlett (GNB) estimator</h3>\n<span translate=no>_^_7_^_</span><p>where <span translate=no>_^_8_^_</span> are the inputs, <span translate=no>_^_9_^_</span> is the batch size (number of inputs/tokens), <span translate=no>_^_10_^_</span> is cross entropy loss, and <span translate=no>_^_11_^_</span> are sampled from the logits <span translate=no>_^_12_^_</span>.</p>\n<p>Note that this hessian estimate is always positive and therefore we can replace <span translate=no>_^_13_^_</span> with <span translate=no>_^_14_^_</span>.</p>\n<p>Sophia with Gauss-Newton-Bartlett (GNB) estimator is <strong>Sophia-G</strong></p>\n<p>Here is an <a href=\"../transformers/basic/with_sophia.html\">experiment</a> that uses Sophia-G to train a transformer.</p>\n",
3
"<h2>Sophia-G Optimizer</h2>\n<p>We extend the class <span translate=no>_^_0_^_</span> defined in <a href=\"index.html\"><span translate=no>_^_1_^_</span></a> to implement the Sophia optimizer.</p>\n": "<h2>Sophia-G Optimizer</h2>\n<p>We extend the class <span translate=no>_^_0_^_</span> defined in <a href=\"index.html\"><span translate=no>_^_1_^_</span></a> to implement the Sophia optimizer.</p>\n",
4
"<h3>Initialize a parameter state</h3>\n<ul><li><span translate=no>_^_0_^_</span> is the optimizer state of the parameter (tensor) </li>\n<li><span translate=no>_^_1_^_</span> stores optimizer attributes of the parameter group </li>\n<li><span translate=no>_^_2_^_</span> is the parameter tensor <span translate=no>_^_3_^_</span></li></ul>\n": "<h3>Initialize a parameter state</h3>\n<ul><li><span translate=no>_^_0_^_</span> is the optimizer state of the parameter (tensor) </li>\n<li><span translate=no>_^_1_^_</span> stores optimizer attributes of the parameter group </li>\n<li><span translate=no>_^_2_^_</span> is the parameter tensor <span translate=no>_^_3_^_</span></li></ul>\n",
5
"<h3>Initialize the optimizer</h3>\n<ul><li><span translate=no>_^_0_^_</span> is the list of parameters </li>\n<li><span translate=no>_^_1_^_</span> is the maximum learning rate <span translate=no>_^_2_^_</span> </li>\n<li><span translate=no>_^_3_^_</span> is a tuple of (<span translate=no>_^_4_^_</span>, <span translate=no>_^_5_^_</span>) </li>\n<li><span translate=no>_^_6_^_</span> is <span translate=no>_^_7_^_</span> </li>\n<li><span translate=no>_^_8_^_</span> is <span translate=no>_^_9_^_</span> </li>\n<li><span translate=no>_^_10_^_</span> is an instance of class <span translate=no>_^_11_^_</span> defined in <a href=\"index.html\"><span translate=no>_^_12_^_</span></a> </li>\n<li><span translate=no>_^_13_^_</span> is a dictionary of default for group values. This is useful when you want to extend the class <span translate=no>_^_14_^_</span>.</li></ul>\n": "<h3>Initialize the optimizer</h3>\n<ul><li><span translate=no>_^_0_^_</span> is the list of parameters </li>\n<li><span translate=no>_^_1_^_</span> is the maximum learning rate <span translate=no>_^_2_^_</span> </li>\n<li><span translate=no>_^_3_^_</span> is a tuple of (<span translate=no>_^_4_^_</span>, <span translate=no>_^_5_^_</span>) </li>\n<li><span translate=no>_^_6_^_</span> is <span translate=no>_^_7_^_</span> </li>\n<li><span translate=no>_^_8_^_</span> is <span translate=no>_^_9_^_</span> </li>\n<li><span translate=no>_^_10_^_</span> is an instance of class <span translate=no>_^_11_^_</span> defined in <a href=\"index.html\"><span translate=no>_^_12_^_</span></a> </li>\n<li><span translate=no>_^_13_^_</span> is a dictionary of default for group values. This is useful when you want to extend the class <span translate=no>_^_14_^_</span>.</li></ul>\n",
6
"<h3>Take an update step for a given parameter tensor</h3>\n<ul><li><span translate=no>_^_0_^_</span> is the optimizer state of the parameter (tensor) </li>\n<li><span translate=no>_^_1_^_</span> stores optimizer attributes of the parameter group </li>\n<li><span translate=no>_^_2_^_</span> is the current gradient tensor <span translate=no>_^_3_^_</span> for the parameter <span translate=no>_^_4_^_</span> </li>\n<li><span translate=no>_^_5_^_</span> is the parameter tensor <span translate=no>_^_6_^_</span></li></ul>\n<p>We do the following parameter update,</p>\n<span translate=no>_^_7_^_</span>": "<h3>Take an update step for a given parameter tensor</h3>\n<ul><li><span translate=no>_^_0_^_</span> is the optimizer state of the parameter (tensor) </li>\n<li><span translate=no>_^_1_^_</span> stores optimizer attributes of the parameter group </li>\n<li><span translate=no>_^_2_^_</span> is the current gradient tensor <span translate=no>_^_3_^_</span> for the parameter <span translate=no>_^_4_^_</span> </li>\n<li><span translate=no>_^_5_^_</span> is the parameter tensor <span translate=no>_^_6_^_</span></li></ul>\n<p>We do the following parameter update,</p>\n<span translate=no>_^_7_^_</span>",
7
"<h3>Update the EMA of Hessian diagonal <span translate=no>_^_0_^_</span></h3>\n<ul><li><span translate=no>_^_1_^_</span> is the number of tokens/inputs in the batch <span translate=no>_^_2_^_</span></li></ul>\n<span translate=no>_^_3_^_</span>": "<h3>Update the EMA of Hessian diagonal <span translate=no>_^_0_^_</span></h3>\n<ul><li><span translate=no>_^_1_^_</span> is the number of tokens/inputs in the batch <span translate=no>_^_2_^_</span></li></ul>\n<span translate=no>_^_3_^_</span>",
8
"<p><span translate=no>_^_0_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span> </p>\n",
9
"<p>Calculate weight decay </p>\n": "<p>Calculate weight decay </p>\n",
10
"<p>Exponential moving average of Hessian diagonal, <span translate=no>_^_0_^_</span> </p>\n": "<p>Exponential moving average of Hessian diagonal, <span translate=no>_^_0_^_</span> </p>\n",
11
"<p>Exponential moving average of gradients, <span translate=no>_^_0_^_</span> </p>\n": "<p>Exponential moving average of gradients, <span translate=no>_^_0_^_</span> </p>\n",
12
"<p>Get <span translate=no>_^_0_^_</span> </p>\n": "<p>Get <span translate=no>_^_0_^_</span> </p>\n",
13
"<p>Get <span translate=no>_^_0_^_</span> and <span translate=no>_^_1_^_</span> </p>\n": "<p>Get <span translate=no>_^_0_^_</span> and <span translate=no>_^_1_^_</span> </p>\n",
14
"<p>Get maximum learning rate <span translate=no>_^_0_^_</span> </p>\n": "<p>Get maximum learning rate <span translate=no>_^_0_^_</span> </p>\n",
15
"<p>Get optimizer state </p>\n": "<p>Get optimizer state </p>\n",
16
"<p>In-place calculation of <span translate=no>_^_0_^_</span> <span translate=no>_^_1_^_</span> </p>\n": "<p>In-place calculation of <span translate=no>_^_0_^_</span> <span translate=no>_^_1_^_</span> </p>\n",
17
"<p>Increment <span translate=no>_^_0_^_</span> the number of optimizer steps </p>\n": "<p>Increment <span translate=no>_^_0_^_</span> the number of optimizer steps </p>\n",
18
"<p>Initialize state if empty </p>\n": "<p>Initialize state if empty </p>\n",
19
"<p>Iterate through parameter groups </p>\n": "<p>Iterate through parameter groups </p>\n",
20
"<p>Iterate through parameters </p>\n": "<p>Iterate through parameters </p>\n",
21
"<p>Skip parameters without gradients </p>\n": "<p>Skip parameters without gradients </p>\n",
22
"<p>This is the number of optimizer steps taken on the parameter, <span translate=no>_^_0_^_</span> </p>\n": "<p>This is the number of optimizer steps taken on the parameter, <span translate=no>_^_0_^_</span> </p>\n",
23
"<p>Update EMA Hessian diagonal</p>\n<span translate=no>_^_0_^_</span><p> </p>\n": "<p>Update EMA Hessian diagonal</p>\n<span translate=no>_^_0_^_</span><p> </p>\n",
24
"A simple PyTorch implementation/tutorial of Sophia optimizer": "A simple PyTorch implementation/tutorial of Sophia optimizer",
25
"Sophia Optimizer": "Sophia Optimizer"
26
}
27