Path: blob/master/translate_cache/transformers/mha.ja.json
4924 views
{1"<h1>Multi-Headed Attention (MHA)</h1>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/basic/autoregressive_experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n<p>This is a tutorial/implementation of multi-headed attention from paper <a href=\"https://arxiv.org/abs/1706.03762\">Attention Is All You Need</a> in <a href=\"https://pytorch.org/\">PyTorch</a>. The implementation is inspired from <a href=\"https://nlp.seas.harvard.edu/2018/04/03/attention.html\">Annotated Transformer</a>.</p>\n<p>Here is the <a href=\"basic/autoregressive_experiment.html\">training code</a> that uses a basic transformer with MHA for NLP auto-regression.</p>\n<p><a href=\"basic/autoregressive_experiment.html\">Here is an experiment implementation</a> that trains a simple transformer.</p>\n": "<h1>\u30de\u30eb\u30c1\u30d8\u30c3\u30c9\u30fb\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3 (MHA)</h1>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/basic/autoregressive_experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n</a><p>\u3053\u308c\u306f\u3001PyTorch\u306e\u8ad6\u6587\u300c<a href=\"https://arxiv.org/abs/1706.03762\">\u6ce8\u610f\u3055\u3048\u3042\u308c\u3070\u5341\u5206\u300d\u306e\u300c\u591a\u9762\u7684\u306a\u6ce8\u610f\u300d\u306e\u30c1\u30e5\u30fc\u30c8\u30ea\u30a2\u30eb/\u5b9f\u88c5\u3067\u3059</a>\u3002<a href=\"https://pytorch.org/\"><a href=\"https://nlp.seas.harvard.edu/2018/04/03/attention.html\">\u5b9f\u88c5\u306f\u6ce8\u91c8\u4ed8\u304d\u30c8\u30e9\u30f3\u30b9\u30d5\u30a9\u30fc\u30de\u30fc\u304b\u3089\u7740\u60f3\u3092\u5f97\u3066\u3044\u307e\u3059</a></p>\u3002\n<p>\u3053\u308c\u306f\u3001<a href=\"basic/autoregressive_experiment.html\">NLP\u81ea\u5df1\u56de\u5e30\u7528\u306eMHA\u3092\u5099\u3048\u305f\u57fa\u672c\u7684\u306a\u30c8\u30e9\u30f3\u30b9\u30d5\u30a9\u30fc\u30de\u30fc\u3092\u4f7f\u7528\u3059\u308b\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0\u30b3\u30fc\u30c9\u3067\u3059</a>\u3002</p>\n<p><a href=\"basic/autoregressive_experiment.html\">\u3053\u308c\u306f\u7c21\u5358\u306a\u5909\u5727\u5668\u3092\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0\u3059\u308b\u5b9f\u9a13\u5b9f\u88c5\u3067\u3059</a>\u3002</p>\n",2"<h3>Calculate scores between queries and keys</h3>\n<p>This method can be overridden for other variations like relative attention.</p>\n": "<h3>\u30af\u30a8\u30ea\u3068\u30ad\u30fc\u9593\u306e\u30b9\u30b3\u30a2\u306e\u8a08\u7b97</h3>\n<p>\u3053\u306e\u65b9\u6cd5\u306f\u3001\u76f8\u5bfe\u7684\u6ce8\u610f\u529b\u306a\u3069\u306e\u4ed6\u306e\u30d0\u30ea\u30a8\u30fc\u30b7\u30e7\u30f3\u3067\u306f\u30aa\u30fc\u30d0\u30fc\u30e9\u30a4\u30c9\u3067\u304d\u307e\u3059\u3002</p>\n",3"<p> <a id=\"MHA\"></a></p>\n<h2>Multi-Head Attention Module</h2>\n<p>This computes scaled multi-headed attention for given <span translate=no>_^_0_^_</span>, <span translate=no>_^_1_^_</span> and <span translate=no>_^_2_^_</span> vectors.</p>\n<p><span translate=no>_^_3_^_</span></p>\n<p>In simple terms, it finds keys that matches the query, and gets the values of those keys.</p>\n<p>It uses dot-product of query and key as the indicator of how matching they are. Before taking the <span translate=no>_^_4_^_</span> the dot-products are scaled by <span translate=no>_^_5_^_</span>. This is done to avoid large dot-product values causing softmax to give very small gradients when <span translate=no>_^_6_^_</span> is large.</p>\n<p>Softmax is calculated along the axis of of the sequence (or time).</p>\n": "<p><a id=\"MHA\"></a></p>\n<h2>\u30de\u30eb\u30c1\u30d8\u30c3\u30c9\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3\u30e2\u30b8\u30e5\u30fc\u30eb</h2>\n<p><span translate=no>_^_0_^_</span>\u4e0e\u3048\u3089\u308c\u305f\u30d9\u30af\u30c8\u30eb\u3084\u30d9\u30af\u30c8\u30eb\u306b\u5bfe\u3057\u3066\u3001\u30b9\u30b1\u30fc\u30ea\u30f3\u30b0\u3055\u308c\u305f\u30de\u30eb\u30c1\u30d8\u30c3\u30c9\u30fb\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3\u3092\u8a08\u7b97\u3057\u307e\u3059\u3002<span translate=no>_^_1_^_</span> <span translate=no>_^_2_^_</span></p>\n<p><span translate=no>_^_3_^_</span></p>\n<p>\u7c21\u5358\u306b\u8a00\u3046\u3068\u3001\u30af\u30a8\u30ea\u306b\u4e00\u81f4\u3059\u308b\u30ad\u30fc\u3092\u898b\u3064\u3051\u3001\u305d\u308c\u3089\u306e\u30ad\u30fc\u306e\u5024\u3092\u53d6\u5f97\u3057\u307e\u3059\u3002</p>\n<p>\u30af\u30a8\u30ea\u3068\u30ad\u30fc\u306e\u30c9\u30c3\u30c8\u7a4d\u304c\u3069\u306e\u7a0b\u5ea6\u4e00\u81f4\u3057\u3066\u3044\u308b\u304b\u3092\u793a\u3059\u6307\u6a19\u3068\u3057\u3066\u4f7f\u7528\u3057\u307e\u3059\u3002<span translate=no>_^_4_^_</span>\u64ae\u5f71\u524d\u306b\u30c9\u30c3\u30c8\u30d7\u30ed\u30c0\u30af\u30c8\u3092\u30b9\u30b1\u30fc\u30ea\u30f3\u30b0\u3057\u307e\u3059\u3002<span translate=no>_^_5_^_</span>\u3053\u308c\u306f\u3001\u30c9\u30c3\u30c8\u7a4d\u5024\u304c\u5927\u304d\u3044\u5834\u5408\u306b softmax \u306e\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u304c\u975e\u5e38\u306b\u5c0f\u3055\u304f\u306a\u308b\u539f\u56e0\u3068\u306a\u3089\u306a\u3044\u3088\u3046\u306b\u3059\u308b\u305f\u3081\u3067\u3059</p>\u3002<span translate=no>_^_6_^_</span>\n<p>Softmax \u306f\u3001\u30b7\u30fc\u30b1\u30f3\u30b9 (\u307e\u305f\u306f\u6642\u9593) \u306e\u8ef8\u306b\u6cbf\u3063\u3066\u8a08\u7b97\u3055\u308c\u307e\u3059\u3002</p>\n",4"<p> <a id=\"PrepareMHA\"></a></p>\n<h2>Prepare for multi-head attention</h2>\n<p>This module does a linear transformation and splits the vector into given number of heads for multi-head attention. This is used to transform <strong>key</strong>, <strong>query</strong>, and <strong>value</strong> vectors.</p>\n": "<p><a id=\"PrepareMHA\"></a></p>\n<h2>\u30de\u30eb\u30c1\u30d8\u30c3\u30c9\u30fb\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3\u306b\u5099\u3048\u307e\u3057\u3087\u3046</h2>\n<p>\u3053\u306e\u30e2\u30b8\u30e5\u30fc\u30eb\u306f\u7dda\u5f62\u5909\u63db\u3092\u884c\u3044\u3001\u30d9\u30af\u30c8\u30eb\u3092\u6307\u5b9a\u3055\u308c\u305f\u6570\u306e\u30d8\u30c3\u30c9\u306b\u5206\u5272\u3057\u3066\u30de\u30eb\u30c1\u30d8\u30c3\u30c9\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3\u3092\u884c\u3044\u307e\u3059\u3002\u3053\u308c\u306f\u3001<strong>\u30ad\u30fc</strong>\u3001<strong>\u30af\u30a8\u30ea</strong>\u3001<strong>\u304a\u3088\u3073\u5024\u306e\u30d9\u30af\u30c8\u30eb\u3092\u5909\u63db\u3059\u308b\u305f\u3081\u306b\u4f7f\u7528\u3055\u308c\u307e\u3059</strong>\u3002</p>\n",5"<p> <span translate=no>_^_0_^_</span> has shape <span translate=no>_^_1_^_</span>, where first dimension is the query dimension. If the query dimension is equal to <span translate=no>_^_2_^_</span> it will be broadcasted.</p>\n": "<p><span translate=no>_^_0_^_</span>\u306b\u306f\u5f62\u72b6\u304c\u3042\u308a<span translate=no>_^_1_^_</span>\u3001\u6700\u521d\u306e\u6b21\u5143\u306f\u30af\u30a8\u30ea\u6b21\u5143\u3067\u3059\u3002<span translate=no>_^_2_^_</span>\u30af\u30a8\u30ea\u30c7\u30a3\u30e1\u30f3\u30b7\u30e7\u30f3\u304c\u305d\u308c\u3068\u7b49\u3057\u3044\u5834\u5408\u306f\u30d6\u30ed\u30fc\u30c9\u30ad\u30e3\u30b9\u30c8\u3055\u308c\u307e\u3059</p>\u3002\n",6"<p> <span translate=no>_^_0_^_</span>, <span translate=no>_^_1_^_</span> and <span translate=no>_^_2_^_</span> are the tensors that store collection of <em>query</em>, <em>key</em> and <em>value</em> vectors. They have shape <span translate=no>_^_3_^_</span>.</p>\n<p><span translate=no>_^_4_^_</span> has shape <span translate=no>_^_5_^_</span> and <span translate=no>_^_6_^_</span> indicates whether for batch <span translate=no>_^_7_^_</span>, query at position <span translate=no>_^_8_^_</span> has access to key-value at position <span translate=no>_^_9_^_</span>.</p>\n": "<p><span translate=no>_^_0_^_</span>\u3001<span translate=no>_^_1_^_</span><span translate=no>_^_2_^_</span>\u304a\u3088\u3073\u306f\u3001<em>\u30af\u30a8\u30ea</em>\u3001<em>\u30ad\u30fc</em>\u3001<em>\u304a\u3088\u3073\u5024\u306e\u30d9\u30af\u30c8\u30eb\u306e\u30b3\u30ec\u30af\u30b7\u30e7\u30f3\u3092\u683c\u7d0d\u3059\u308b\u30c6\u30f3\u30bd\u30eb\u3067\u3059</em>\u3002\u5f62\u304c\u3042\u308a\u307e\u3059<span translate=no>_^_3_^_</span>\u3002</p>\n<p><span translate=no>_^_4_^_</span><span translate=no>_^_5_^_</span>\u5f62\u72b6\u304c\u3042\u308a\u3001\u30d0\u30c3\u30c1\u306e\u5834\u5408<span translate=no>_^_7_^_</span>\u3001<span translate=no>_^_6_^_</span><span translate=no>_^_8_^_</span>\u305d\u306e\u4f4d\u7f6e\u306e\u30af\u30a8\u30ea\u304c\u305d\u306e\u4f4d\u7f6e\u306e\u30ad\u30fc\u5024\u306b\u30a2\u30af\u30bb\u30b9\u3067\u304d\u308b\u304b\u3069\u3046\u304b\u3092\u793a\u3057\u307e\u3059\u3002<span translate=no>_^_9_^_</span></p>\n",7"<p><span translate=no>_^_0_^_</span> attention along the key sequence dimension <span translate=no>_^_1_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span>\u30ad\u30fc\u30b7\u30fc\u30b1\u30f3\u30b9\u6b21\u5143\u306b\u6cbf\u3063\u3066\u6ce8\u76ee <span translate=no>_^_1_^_</span></p>\n",8"<p><span translate=no>_^_0_^_</span>, <span translate=no>_^_1_^_</span> and <span translate=no>_^_2_^_</span> have shape <span translate=no>_^_3_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span>\u3001<span translate=no>_^_1_^_</span><span translate=no>_^_2_^_</span>\u305d\u3057\u3066\u5f62\u304c\u3042\u308b <span translate=no>_^_3_^_</span></p>\n",9"<p>Apply dropout </p>\n": "<p>\u30c9\u30ed\u30c3\u30d7\u30a2\u30a6\u30c8\u3092\u9069\u7528</p>\n",10"<p>Apply mask </p>\n": "<p>\u30de\u30b9\u30af\u3092\u9069\u7528</p>\n",11"<p>Calculate <span translate=no>_^_0_^_</span> or <span translate=no>_^_1_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span>\u8a08\u7b97\u307e\u305f\u306f <span translate=no>_^_1_^_</span></p>\n",12"<p>Compute attention scores <span translate=no>_^_0_^_</span>. This gives a tensor of shape <span translate=no>_^_1_^_</span>. </p>\n": "<p><span translate=no>_^_0_^_</span>\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3\u30b9\u30b3\u30a2\u3092\u8a08\u7b97\u3057\u307e\u3059\u3002<span translate=no>_^_1_^_</span>\u3053\u308c\u306b\u3088\u308a\u5f62\u72b6\u306e\u30c6\u30f3\u30bd\u30eb\u304c\u5f97\u3089\u308c\u307e\u3059</p>\u3002\n",13"<p>Concatenate multiple heads </p>\n": "<p>\u8907\u6570\u306e\u30d8\u30c3\u30c9\u3092\u9023\u7d50</p>\n",14"<p>Dropout </p>\n": "<p>\u30c9\u30ed\u30c3\u30d7\u30a2\u30a6\u30c8</p>\n",15"<p>Input has shape <span translate=no>_^_0_^_</span> or <span translate=no>_^_1_^_</span>. We apply the linear transformation to the last dimension and split that into the heads. </p>\n": "<p><span translate=no>_^_0_^_</span><span translate=no>_^_1_^_</span>\u5165\u529b\u306e\u5f62\u72b6\u306f\u307e\u305f\u306f\u3067\u3059\u3002\u7dda\u5f62\u5909\u63db\u3092\u6700\u5f8c\u306e\u6b21\u5143\u306b\u9069\u7528\u3057\u3001\u305d\u308c\u3092\u982d\u306b\u5206\u5272\u3057\u307e\u3059\u3002</p>\n",16"<p>Linear layer for linear transform </p>\n": "<p>\u7dda\u5f62\u5909\u63db\u7528\u306e\u7dda\u5f62\u5c64</p>\n",17"<p>Linear transform </p>\n": "<p>\u7dda\u5f62\u5909\u63db</p>\n",18"<p>Multiply by values <span translate=no>_^_0_^_</span> </p>\n": "<p>\u5024\u306b\u3088\u308b\u4e57\u7b97 <span translate=no>_^_0_^_</span></p>\n",19"<p>Number of dimensions in vectors in each head </p>\n": "<p>\u5404\u30d8\u30c3\u30c9\u306e\u30d9\u30af\u30c8\u30eb\u306e\u6b21\u5143\u6570</p>\n",20"<p>Number of features per head </p>\n": "<p>\u30d8\u30c3\u30c9\u3042\u305f\u308a\u306e\u6a5f\u80fd\u6570</p>\n",21"<p>Number of heads </p>\n": "<p>\u30d8\u30c3\u30c9\u6570</p>\n",22"<p>Output has shape <span translate=no>_^_0_^_</span> or <span translate=no>_^_1_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span>\u51fa\u529b\u306e\u5f62\u72b6\u304c\u3042\u308b\u304b <span translate=no>_^_1_^_</span></p>\n",23"<p>Output layer </p>\n": "<p>\u51fa\u529b\u30ec\u30a4\u30e4\u30fc</p>\n",24"<p>Prepare <span translate=no>_^_0_^_</span>, <span translate=no>_^_1_^_</span> and <span translate=no>_^_2_^_</span> for attention computation. These will then have shape <span translate=no>_^_3_^_</span>. </p>\n": "<p><span translate=no>_^_0_^_</span><span translate=no>_^_1_^_</span><span translate=no>_^_2_^_</span>\u6ce8\u610f\u529b\u8a08\u7b97\u306e\u6e96\u5099\u3092\u3057\u3066<span translate=no>_^_3_^_</span>\u3053\u308c\u3067\u5f62\u304c\u3067\u304d\u3042\u304c\u308a\u307e\u3059\u3002</p>\n",25"<p>Same mask applied to all heads. </p>\n": "<p>\u3059\u3079\u3066\u306e\u982d\u306b\u540c\u3058\u30de\u30b9\u30af\u3092\u304b\u3051\u307e\u3057\u305f\u3002</p>\n",26"<p>Save attentions for any other calculations </p>\n": "<p>\u4ed6\u306e\u8a08\u7b97\u306b\u6ce8\u610f\u3092\u5411\u3051\u3066\u304a\u304f</p>\n",27"<p>Save attentions if debugging </p>\n": "<p>\u30c7\u30d0\u30c3\u30b0\u6642\u306e\u6ce8\u610f\u4e8b\u9805\u3092\u4fdd\u5b58</p>\n",28"<p>Scale scores <span translate=no>_^_0_^_</span> </p>\n": "<p>\u30b9\u30b1\u30fc\u30eb\u30b9\u30b3\u30a2 <span translate=no>_^_0_^_</span></p>\n",29"<p>Scaling factor before the softmax </p>\n": "<p>\u30bd\u30d5\u30c8\u30de\u30c3\u30af\u30b9\u524d\u306e\u30b9\u30b1\u30fc\u30ea\u30f3\u30b0\u30d5\u30a1\u30af\u30bf\u30fc</p>\n",30"<p>Softmax for attention along the time dimension of <span translate=no>_^_0_^_</span> </p>\n": "<p>\u6642\u9593\u8ef8\u306b\u6cbf\u3063\u305f\u6ce8\u76ee\u306e\u30bd\u30d5\u30c8\u30de\u30c3\u30af\u30b9 <span translate=no>_^_0_^_</span></p>\n",31"<p>Split last dimension into heads </p>\n": "<p>\u6700\u5f8c\u306e\u30c7\u30a3\u30e1\u30f3\u30b7\u30e7\u30f3\u3092\u30d8\u30c3\u30c9\u306b\u5206\u5272</p>\n",32"<p>These transform the <span translate=no>_^_0_^_</span>, <span translate=no>_^_1_^_</span> and <span translate=no>_^_2_^_</span> vectors for multi-headed attention. </p>\n": "<p>\u3053\u308c\u3089\u306f<span translate=no>_^_0_^_</span>\u3001\u3001<span translate=no>_^_1_^_</span><span translate=no>_^_2_^_</span>\u306e\u30d9\u30af\u30c8\u30eb\u3092\u5909\u3048\u3066\u3001\u591a\u9762\u7684\u306a\u6ce8\u610f\u3092\u4fc3\u3057\u307e\u3059\u3002</p>\n",33"<p>We store attentions so that it can be used for logging, or other computations if needed </p>\n": "<p>\u5fc5\u8981\u306b\u5fdc\u3058\u3066\u30ed\u30ae\u30f3\u30b0\u3084\u305d\u306e\u4ed6\u306e\u8a08\u7b97\u306b\u4f7f\u7528\u3067\u304d\u308b\u3088\u3046\u306b\u3001\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3\u3092\u4fdd\u5b58\u3057\u307e\u3059</p>\n",34"<p>resulting mask has shape <span translate=no>_^_0_^_</span> </p>\n": "<p>\u751f\u6210\u3055\u308c\u308b\u30de\u30b9\u30af\u306b\u306f\u5f62\u72b6\u304c\u3042\u308a\u307e\u3059 <span translate=no>_^_0_^_</span></p>\n",35"<ul><li><span translate=no>_^_0_^_</span> is the number of heads. </li>\n<li><span translate=no>_^_1_^_</span> is the number of features in the <span translate=no>_^_2_^_</span>, <span translate=no>_^_3_^_</span> and <span translate=no>_^_4_^_</span> vectors.</li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u306f\u982d\u306e\u6570\u3067\u3059\u3002</li>\n<li><span translate=no>_^_1_^_</span>\u306f<span translate=no>_^_2_^_</span>\u3001<span translate=no>_^_3_^_</span><span translate=no>_^_4_^_</span>\u304a\u3088\u3073\u30d9\u30af\u30c8\u30eb\u5185\u306e\u7279\u5fb4\u306e\u6570\u3067\u3059\u3002</li></ul>\n",36"Multi-Headed Attention (MHA)": "\u30de\u30eb\u30c1\u30d8\u30c3\u30c9\u30fb\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3 (MHA)",37"This implements the Multi-Headed Attention used in transformers using PyTorch with explanations.": "PyTorch\u3092\u4f7f\u3063\u305f\u30c8\u30e9\u30f3\u30b9\u30d5\u30a9\u30fc\u30de\u30fc\u3067\u4f7f\u308f\u308c\u308b\u30de\u30eb\u30c1\u30d8\u30c3\u30c9\u30fb\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3\u3092\u8aac\u660e\u4ed8\u304d\u3067\u5b9f\u88c5\u3057\u3066\u3044\u307e\u3059\u3002"38}3940