Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
labmlai
GitHub Repository: labmlai/annotated_deep_learning_paper_implementations
Path: blob/master/translate_cache/transformers/mha.zh.json
4923 views
1
{
2
"<h1>Multi-Headed Attention (MHA)</h1>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/basic/autoregressive_experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n<p>This is a tutorial/implementation of multi-headed attention from paper <a href=\"https://arxiv.org/abs/1706.03762\">Attention Is All You Need</a> in <a href=\"https://pytorch.org/\">PyTorch</a>. The implementation is inspired from <a href=\"https://nlp.seas.harvard.edu/2018/04/03/attention.html\">Annotated Transformer</a>.</p>\n<p>Here is the <a href=\"basic/autoregressive_experiment.html\">training code</a> that uses a basic transformer with MHA for NLP auto-regression.</p>\n<p><a href=\"basic/autoregressive_experiment.html\">Here is an experiment implementation</a> that trains a simple transformer.</p>\n": "<h1>\u591a\u5934\u6ce8\u610f\u529b (MHA)</h1>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/basic/autoregressive_experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n</a><p>\u8fd9\u662f\u8bba\u6587<a href=\"https://arxiv.org/abs/1706.03762\">\u300a Attention is All You Need \u300b</a>\u4e2d\u591a\u5934\u6ce8\u610f\u529b\u7684<a href=\"https://pytorch.org/\">PyTorch</a>\u6559\u7a0b/\u5b9e\u73b0\u3002\u8be5\u5b9e\u73b0\u7684\u7075\u611f\u6765\u81ea<a href=\"https://nlp.seas.harvard.edu/2018/04/03/attention.html\">\u300a\u5e26\u6ce8\u91ca\u7684 Transformer \u300b</a>\u3002</p><p>\u8fd9\u662f\u4f7f\u7528\u57fa\u7840 Transformer \u548c MHA \u8fdb\u884c NLP \u81ea\u56de\u5f52\u7684<a href=\"basic/autoregressive_experiment.html\">\u8bad\u7ec3\u4ee3\u7801</a>\u3002</p><p>\u8fd9\u662f\u4e00\u4e2a\u8bad\u7ec3\u7b80\u5355 Transformer \u7684<a href=\"basic/autoregressive_experiment.html\">\u4ee3\u7801\u5b9e\u73b0</a>\u3002</p>\n",
3
"<h3>Calculate scores between queries and keys</h3>\n<p>This method can be overridden for other variations like relative attention.</p>\n": "<h3>\u8ba1\u7b97 Qurey \u548c Key \u4e4b\u95f4\u7684\u5206\u6570</h3>\n<p>\u8fd9\u79cd\u65b9\u6cd5\u53ef\u4ee5\u540c\u6837\u9002\u7528\u4e8e\u5176\u4ed6\u53d8\u4f53\uff0c\u5982\u76f8\u5bf9\u6ce8\u610f\u529b\u3002</p>\n",
4
"<p> <a id=\"MHA\"></a></p>\n<h2>Multi-Head Attention Module</h2>\n<p>This computes scaled multi-headed attention for given <span translate=no>_^_0_^_</span>, <span translate=no>_^_1_^_</span> and <span translate=no>_^_2_^_</span> vectors.</p>\n<p><span translate=no>_^_3_^_</span></p>\n<p>In simple terms, it finds keys that matches the query, and gets the values of those keys.</p>\n<p>It uses dot-product of query and key as the indicator of how matching they are. Before taking the <span translate=no>_^_4_^_</span> the dot-products are scaled by <span translate=no>_^_5_^_</span>. This is done to avoid large dot-product values causing softmax to give very small gradients when <span translate=no>_^_6_^_</span> is large.</p>\n<p>Softmax is calculated along the axis of of the sequence (or time).</p>\n": "<p><a id=\"MHA\"></a></p>\n<h2>\u591a\u5934\u6ce8\u610f\u529b\u6a21\u5757</h2>\n<p>\u8fd9\u5c06\u8ba1\u7b97\u7ed9\u51fa\u7684<span translate=no>_^_1_^_</span>\u3001<span translate=no>_^_2_^_</span>\u548c<span translate=no>_^_0_^_</span>\u5411\u91cf\u7f29\u653e\u540e\u7684\u591a\u5934\u6ce8\u610f\u529b\u3002</p>\n<p><span translate=no>_^_3_^_</span></p>\n<p>\u7b80\u5355\u6765\u8bf4\uff0c\u5b83\u4f1a\u627e\u5230\u4e0e\u67e5\u8be2 (Query) \u5339\u914d\u7684\u952e (key)\uff0c\u5e76\u83b7\u53d6\u8fd9\u4e9b\u952e (Key) \u7684\u503c (Value) \u3002</p>\n<p>\u5b83\u4f7f\u7528\u67e5\u8be2\u548c\u952e\u7684\u70b9\u79ef\u4f5c\u4e3a\u8861\u91cf\u5b83\u4eec\u4e4b\u95f4\u5339\u914d\u7a0b\u5ea6\u7684\u6307\u6807\u3002\u5728\u8fdb\u884c<span translate=no>_^_4_^_</span>\u4e4b\u524d\uff0c\u70b9\u79ef\u4f1a\u4e58\u4ee5<span translate=no>_^_5_^_</span>\u3002\u8fd9\u6837\u505a\u662f\u4e3a\u4e86\u907f\u514d\u5f53<span translate=no>_^_6_^_</span>\u8f83\u5927\u65f6\uff0c\u5927\u7684\u70b9\u79ef\u503c\u5bfc\u81f4 Softmax \u64cd\u4f5c\u8f93\u51fa\u975e\u5e38\u5c0f\u7684\u68af\u5ea6\u3002</p>\n<p>Softmax \u662f\u6cbf\u5e8f\u5217\uff08\u6216\u65f6\u95f4\uff09\u8f74\u8ba1\u7b97\u7684\u3002</p>\n",
5
"<p> <a id=\"PrepareMHA\"></a></p>\n<h2>Prepare for multi-head attention</h2>\n<p>This module does a linear transformation and splits the vector into given number of heads for multi-head attention. This is used to transform <strong>key</strong>, <strong>query</strong>, and <strong>value</strong> vectors.</p>\n": "<p><a id=\"PrepareMHA\"></a></p>\n<h2>\u51c6\u5907\u591a\u5934\u6ce8\u610f\u529b</h2>\n<p>\u8be5\u90e8\u5206\u6267\u884c\u7ebf\u6027\u53d8\u6362\uff0c\u5e76\u5c06\u5411\u91cf\u5206\u5272\u6210\u7ed9\u5b9a\u6570\u91cf\u7684\u5934\u4ee5\u83b7\u5f97\u591a\u5934\u6ce8\u610f\u529b\u3002\u8fd9\u7528\u4e8e<strong>\u952e</strong>\u3001<strong>\u67e5\u8be2</strong>\u548c<strong>\u503c</strong>\u5411\u91cf\u3002</p>\n",
6
"<p> <span translate=no>_^_0_^_</span> has shape <span translate=no>_^_1_^_</span>, where first dimension is the query dimension. If the query dimension is equal to <span translate=no>_^_2_^_</span> it will be broadcasted.</p>\n": "<p><span translate=no>_^_0_^_</span>\u7684\u5f62\u72b6\u4e3a<span translate=no>_^_1_^_</span>\uff0c\u5176\u4e2d\u7b2c\u4e00\u7ef4\u662f\u67e5\u8be2\u7ef4\u5ea6\u3002\u5982\u679c\u67e5\u8be2\u7ef4\u5ea6\u7b49\u4e8e<span translate=no>_^_2_^_</span>\uff0c\u5219\u4f1a\u8fdb\u884c\u5e7f\u64ad\u3002</p>\n",
7
"<p> <span translate=no>_^_0_^_</span>, <span translate=no>_^_1_^_</span> and <span translate=no>_^_2_^_</span> are the tensors that store collection of <em>query</em>, <em>key</em> and <em>value</em> vectors. They have shape <span translate=no>_^_3_^_</span>.</p>\n<p><span translate=no>_^_4_^_</span> has shape <span translate=no>_^_5_^_</span> and <span translate=no>_^_6_^_</span> indicates whether for batch <span translate=no>_^_7_^_</span>, query at position <span translate=no>_^_8_^_</span> has access to key-value at position <span translate=no>_^_9_^_</span>.</p>\n": "<p><span translate=no>_^_0_^_</span>\u3001<span translate=no>_^_1_^_</span>\u548c<span translate=no>_^_2_^_</span>\u662f\u5b58\u50a8<em>\u67e5\u8be2</em>\u3001<em>\u952e</em>\u548c<em>\u503c</em>\u5411\u91cf\u96c6\u5408\u7684\u5f20\u91cf\u3002\u5b83\u4eec\u7684\u5f62\u72b6\u4e3a<span translate=no>_^_3_^_</span>\u3002</p>\n<p><span translate=no>_^_4_^_</span>\u7684\u5f62\u72b6\u4e3a<span translate=no>_^_5_^_</span>\uff0c<span translate=no>_^_6_^_</span>\u8868\u793a\u6279\u6b21<span translate=no>_^_7_^_</span>\uff0c\u5728\u4f4d\u7f6e<span translate=no>_^_8_^_</span>\u5904\u67e5\u8be2\u662f\u5426\u6709\u6743\u8bbf\u95ee\u4f4d\u7f6e<span translate=no>_^_9_^_</span>\u5904\u7684\u952e\u503c\u5bf9\u3002</p>\n",
8
"<p><span translate=no>_^_0_^_</span> attention along the key sequence dimension <span translate=no>_^_1_^_</span> </p>\n": "<p>\u5bf9 Key \u5e8f\u5217\u7ef4\u5ea6\u4e0a\u7684\u6ce8\u610f\u529b\u8fdb\u884c<span translate=no>_^_0_^_</span>\u64cd\u4f5c\uff0c<span translate=no>_^_1_^_</span></p>\n",
9
"<p><span translate=no>_^_0_^_</span>, <span translate=no>_^_1_^_</span> and <span translate=no>_^_2_^_</span> have shape <span translate=no>_^_3_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span>\uff0c<span translate=no>_^_1_^_</span>\u548c<span translate=no>_^_2_^_</span>\u7684\u5f62\u72b6\u4e3a<span translate=no>_^_3_^_</span></p>\n",
10
"<p>Apply dropout </p>\n": "<p>\u5e94\u7528 Dropout</p>\n",
11
"<p>Apply mask </p>\n": "<p>\u5e94\u7528\u63a9\u7801</p>\n",
12
"<p>Calculate <span translate=no>_^_0_^_</span> or <span translate=no>_^_1_^_</span> </p>\n": "<p>\u8ba1\u7b97<span translate=no>_^_0_^_</span>\u6216<span translate=no>_^_1_^_</span></p>\n",
13
"<p>Compute attention scores <span translate=no>_^_0_^_</span>. This gives a tensor of shape <span translate=no>_^_1_^_</span>. </p>\n": "<p>\u8ba1\u7b97\u6ce8\u610f\u529b\u5206\u6570<span translate=no>_^_0_^_</span>\uff0c\u8fd9\u5c06\u5f97\u5230\u4e00\u4e2a\u5f62\u72b6\u4e3a<span translate=no>_^_1_^_</span>\u7684\u5f20\u91cf\u3002</p>\n",
14
"<p>Concatenate multiple heads </p>\n": "<p>\u8fde\u63a5\u591a\u4e2a\u5934</p>\n",
15
"<p>Dropout </p>\n": "<p>Dropout</p>\n",
16
"<p>Input has shape <span translate=no>_^_0_^_</span> or <span translate=no>_^_1_^_</span>. We apply the linear transformation to the last dimension and split that into the heads. </p>\n": "<p>\u8f93\u5165\u7684\u5f62\u72b6\u4e3a<span translate=no>_^_0_^_</span>\u6216<span translate=no>_^_1_^_</span>\u3002\u6211\u4eec\u5bf9\u6700\u540e\u4e00\u7ef4\u5e94\u7528\u7ebf\u6027\u53d8\u6362\uff0c\u5e76\u5c06\u5176\u5206\u4e3a\u591a\u4e2a\u5934\u3002</p>\n",
17
"<p>Linear layer for linear transform </p>\n": "<p>\u7ebf\u6027\u5c42\u7528\u4e8e\u7ebf\u6027\u53d8\u6362</p>\n",
18
"<p>Linear transform </p>\n": "<p>\u7ebf\u6027\u53d8\u6362</p>\n",
19
"<p>Multiply by values <span translate=no>_^_0_^_</span> </p>\n": "<p>\u4e58\u4ee5\u6570\u503c<span translate=no>_^_0_^_</span></p>\n",
20
"<p>Number of dimensions in vectors in each head </p>\n": "<p>\u6bcf\u4e2a\u5934\u90e8\u4e2d\u5411\u91cf\u7684\u7ef4\u5ea6\u6570\u91cf</p>\n",
21
"<p>Number of features per head </p>\n": "<p>\u6bcf\u4e2a\u5934\u90e8\u7684\u7279\u5f81\u6570\u91cf</p>\n",
22
"<p>Number of heads </p>\n": "<p>\u6ce8\u610f\u529b\u5934\u6570</p>\n",
23
"<p>Output has shape <span translate=no>_^_0_^_</span> or <span translate=no>_^_1_^_</span> </p>\n": "<p>\u8f93\u51fa\u5177\u6709\u5f62\u72b6<span translate=no>_^_0_^_</span>\u6216<span translate=no>_^_1_^_</span></p>\n",
24
"<p>Output layer </p>\n": "<p>\u8f93\u51fa\u5c42</p>\n",
25
"<p>Prepare <span translate=no>_^_0_^_</span>, <span translate=no>_^_1_^_</span> and <span translate=no>_^_2_^_</span> for attention computation. These will then have shape <span translate=no>_^_3_^_</span>. </p>\n": "<p>\u4e3a\u6ce8\u610f\u529b\u8ba1\u7b97\u51c6\u5907\u5411\u91cf<span translate=no>_^_0_^_</span>\uff0c<span translate=no>_^_1_^_</span>\u5e76<span translate=no>_^_2_^_</span>\u5b83\u4eec\u7684\u5f62\u72b6\u5c06\u53d8\u4e3a<span translate=no>_^_3_^_</span>\u3002</p>\n",
26
"<p>Same mask applied to all heads. </p>\n": "<p>\u6240\u6709\u7684\u5934\u90e8\u4f7f\u7528\u76f8\u540c\u7684\u63a9\u7801\u3002</p>\n",
27
"<p>Save attentions for any other calculations </p>\n": "<p>\u4e3a\u5176\u4ed6\u8ba1\u7b97\u4fdd\u5b58\u6ce8\u610f\u529b\u4fe1\u606f</p>\n",
28
"<p>Save attentions if debugging </p>\n": "<p>\u8c03\u8bd5\u65f6\u4fdd\u5b58\u6ce8\u610f\u529b\u4fe1\u606f</p>\n",
29
"<p>Scale scores <span translate=no>_^_0_^_</span> </p>\n": "<p>\u7f29\u653e\u5206\u6570<span translate=no>_^_0_^_</span></p>\n",
30
"<p>Scaling factor before the softmax </p>\n": "<p>Softmax \u4e4b\u524d\u7684\u7f29\u653e\u7cfb\u6570</p>\n",
31
"<p>Softmax for attention along the time dimension of <span translate=no>_^_0_^_</span> </p>\n": "<p>\u5728\u952e\uff08 Key \uff09\u7684\u65f6\u95f4\u7ef4\u5ea6\u4e0a\u8fdb\u884c\u6ce8\u610f\u529b Softmax<span translate=no>_^_0_^_</span></p>\n",
32
"<p>Split last dimension into heads </p>\n": "<p>\u5c06\u6700\u540e\u4e00\u4e2a\u7ef4\u5ea6\u5206\u6210\u591a\u4e2a\u5934\u90e8</p>\n",
33
"<p>These transform the <span translate=no>_^_0_^_</span>, <span translate=no>_^_1_^_</span> and <span translate=no>_^_2_^_</span> vectors for multi-headed attention. </p>\n": "<p>\u8fd9\u4e9b\u5c06\u5bf9\u591a\u5934\u6ce8\u610f\u529b\u7684\u5411\u91cf<span translate=no>_^_0_^_</span>\u3001<span translate=no>_^_1_^_</span>\u548c<span translate=no>_^_2_^_</span>\u8fdb\u884c\u8f6c\u6362\u3002</p>\n",
34
"<p>We store attentions so that it can be used for logging, or other computations if needed </p>\n": "<p>\u5b58\u50a8\u6ce8\u610f\u529b\u4fe1\u606f\uff0c\u4ee5\u4fbf\u5728\u9700\u8981\u65f6\u7528\u4e8e\u8bb0\u5f55\u6216\u5176\u4ed6\u8ba1\u7b97\u3002</p>\n",
35
"<p>resulting mask has shape <span translate=no>_^_0_^_</span> </p>\n": "<p>\u751f\u6210\u7684\u63a9\u7801\u5f62\u72b6\u4e3a<span translate=no>_^_0_^_</span></p>\n",
36
"<ul><li><span translate=no>_^_0_^_</span> is the number of heads. </li>\n<li><span translate=no>_^_1_^_</span> is the number of features in the <span translate=no>_^_2_^_</span>, <span translate=no>_^_3_^_</span> and <span translate=no>_^_4_^_</span> vectors.</li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u662f\u6ce8\u610f\u529b\u5934\u7684\u6570\u91cf\u3002</li>\n<li><span translate=no>_^_1_^_</span>\u662f\u5411\u91cf<span translate=no>_^_2_^_</span>\u3001<span translate=no>_^_3_^_</span>\u548c<span translate=no>_^_4_^_</span>\u4e2d\u7684\u7279\u5f81\u6570\u91cf\u3002</li></ul>\n",
37
"Multi-Headed Attention (MHA)": "\u591a\u5934\u6ce8\u610f\u529b (MHA)",
38
"This implements the Multi-Headed Attention used in transformers using PyTorch with explanations.": "\u8fd9\u4e2a\u4ee3\u7801\u7528 PyTorch \u5b9e\u73b0\u4e86 Transformers \u4e2d\u7684\u591a\u5934\u6ce8\u610f\u529b\uff0c\u5e76\u9644\u6709\u9010\u884c\u6ce8\u91ca\u3002"
39
}
40