Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
labmlai
GitHub Repository: labmlai/annotated_deep_learning_paper_implementations
Path: blob/master/translate_cache/neox/model.ja.json
4924 views
1
{
2
"<h1>GPT-NeoX Model</h1>\n<p>Here is the code for layers of GPT-NeoX model and the code to load 20B checkpoint.</p>\n<p>The method <span translate=no>_^_0_^_</span> in the layers load the checkpoints of that layer. The checkpoint loading helpers are on <a href=\"checkpoint.html\"><span translate=no>_^_1_^_</span></a></p>\n": "<h1>GPT \u30cd\u30aa\u30c3\u30af\u30b9\u30e2\u30c7\u30eb</h1>\n<p>\u3053\u308c\u306f\u3001GPT-Neox\u30e2\u30c7\u30eb\u306e\u30ec\u30a4\u30e4\u30fc\u7528\u306e\u30b3\u30fc\u30c9\u306820B\u306e\u30c1\u30a7\u30c3\u30af\u30dd\u30a4\u30f3\u30c8\u3092\u30ed\u30fc\u30c9\u3059\u308b\u30b3\u30fc\u30c9\u3067\u3059\u3002</p>\n<p><span translate=no>_^_0_^_</span>\u30ec\u30a4\u30e4\u30fc\u5185\u306e\u30e1\u30bd\u30c3\u30c9\u306f\u3001\u305d\u306e\u30ec\u30a4\u30e4\u30fc\u306e\u30c1\u30a7\u30c3\u30af\u30dd\u30a4\u30f3\u30c8\u3092\u30ed\u30fc\u30c9\u3057\u307e\u3059\u3002\u30c1\u30a7\u30c3\u30af\u30dd\u30a4\u30f3\u30c8\u30ed\u30fc\u30c9\u30d8\u30eb\u30d1\u30fc\u304c\u30aa\u30f3\u306b\u306a\u3063\u3066\u3044\u307e\u3059 <a href=\"checkpoint.html\"><span translate=no>_^_1_^_</span></a></p>\n",
3
"<h2>Attention layer</h2>\n": "<h2>\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3\u30ec\u30a4\u30e4\u30fc</h2>\n",
4
"<h2>Embedding layer</h2>\n<p>This is a standard embeddings layer with code to load the checkpoint.</p>\n": "<h2>\u57cb\u3081\u8fbc\u307f\u30ec\u30a4\u30e4\u30fc</h2>\n<p>\u3053\u308c\u306f\u3001\u30c1\u30a7\u30c3\u30af\u30dd\u30a4\u30f3\u30c8\u3092\u30ed\u30fc\u30c9\u3059\u308b\u30b3\u30fc\u30c9\u3092\u542b\u3080\u6a19\u6e96\u306e\u57cb\u3081\u8fbc\u307f\u30ec\u30a4\u30e4\u30fc\u3067\u3059\u3002</p>\n",
5
"<h2>Feedforward Network</h2>\n": "<h2>\u30d5\u30a3\u30fc\u30c9\u30d5\u30a9\u30ef\u30fc\u30c9\u30cd\u30c3\u30c8\u30ef\u30fc\u30af</h2>\n",
6
"<h2>Final normalization layer</h2>\n": "<h2>\u6700\u7d42\u6b63\u898f\u5316\u30ec\u30a4\u30e4\u30fc</h2>\n",
7
"<h2>Rotary Positional Embeddings</h2>\n<p>GPT-NeoX uses <a href=\"https://arxiv.org/abs/2104.09864\">rotary positional embeddings (RoPE)</a>.</p>\n<p>WE have annotated implementation of RoPE <a href=\"https://nn.labml.ai/transformers/rope/index.html\">here</a> with more notes the theory.</p>\n": "<h2>\u30ed\u30fc\u30bf\u30ea\u30fc\u30dd\u30b8\u30b7\u30e7\u30ca\u30eb\u30a8\u30f3\u30d9\u30c7\u30a3\u30f3\u30b0</h2>\n<p><a href=\"https://arxiv.org/abs/2104.09864\">GPT-Neox\u306f\u56de\u8ee2\u5f0f\u30dd\u30b8\u30b7\u30e7\u30ca\u30eb\u30a8\u30f3\u30d9\u30c7\u30a3\u30f3\u30b0</a>\uff08RoPE\uff09\u3092\u4f7f\u7528\u3057\u3066\u3044\u307e\u3059\u3002</p>\n<p><a href=\"https://nn.labml.ai/transformers/rope/index.html\">\u3053\u3053\u3067\u306f</a>\u3001RoPE \u306e\u5b9f\u88c5\u306b\u6ce8\u91c8\u3092\u4ed8\u3051\u3066\u3001\u7406\u8ad6\u306b\u95a2\u3059\u308b\u6ce8\u91c8\u3092\u4ed8\u3051\u307e\u3057\u305f\u3002</p>\n",
8
"<h2>Transformer Layer</h2>\n": "<h2>\u5909\u5727\u5668\u5c64</h2>\n",
9
"<h3>Generator to create layers</h3>\n<p>The layers are generated in the same order as checkpoints.</p>\n<p>It gives <span translate=no>_^_0_^_</span> when a layer is not available; we use the layer indices as NeoX and there are two transformation layers we don&#x27;t need in our implementation.</p>\n<ul><li><span translate=no>_^_1_^_</span> is the number of tokens in the vocabulary </li>\n<li><span translate=no>_^_2_^_</span> is the number of features in the embeddings </li>\n<li><span translate=no>_^_3_^_</span> is the number of transformer layers </li>\n<li><span translate=no>_^_4_^_</span> is the number of attention heads </li>\n<li><span translate=no>_^_5_^_</span> are the set of layers to be used. All layers will be used if None. This is used to test smaller versions of the model with fewer layers </li>\n<li><span translate=no>_^_6_^_</span> specifies whether to clone the transformer layers (a bit faster) </li>\n<li><span translate=no>_^_7_^_</span> is the data type of the model </li>\n<li><span translate=no>_^_8_^_</span> is the device of the model </li>\n<li><span translate=no>_^_9_^_</span> specifies whether to use int8 quantization </li>\n<li><span translate=no>_^_10_^_</span> is the threshold <span translate=no>_^_11_^_</span> used to separate outlier features </li>\n<li><span translate=no>_^_12_^_</span> specifies whether to use <a href=\"https://github.com/HazyResearch/flash-attention\">FlashAttention</a></li></ul>\n": "<h3>\u30ec\u30a4\u30e4\u30fc\u3092\u4f5c\u6210\u3059\u308b\u305f\u3081\u306e\u30b8\u30a7\u30cd\u30ec\u30fc\u30bf\u30fc</h3>\n<p>\u30ec\u30a4\u30e4\u30fc\u306f\u30c1\u30a7\u30c3\u30af\u30dd\u30a4\u30f3\u30c8\u3068\u540c\u3058\u9806\u5e8f\u3067\u751f\u6210\u3055\u308c\u307e\u3059\u3002</p>\n<p><span translate=no>_^_0_^_</span>\u30ec\u30a4\u30e4\u30fc\u304c\u4f7f\u7528\u3067\u304d\u306a\u3044\u5834\u5408\u306b\u8fd4\u3055\u308c\u307e\u3059\u3002\u30ec\u30a4\u30e4\u30fc\u30a4\u30f3\u30c7\u30c3\u30af\u30b9\u3092NeoX\u3068\u3057\u3066\u4f7f\u7528\u3057\u3001\u5b9f\u88c5\u306b\u306f\u5fc5\u8981\u306e\u306a\u3044\u5909\u63db\u30ec\u30a4\u30e4\u30fc\u304c2\u3064\u3042\u308a\u307e\u3059\u3002</p>\n<ul><li><span translate=no>_^_1_^_</span>\u30dc\u30ad\u30e3\u30d6\u30e9\u30ea\u5185\u306e\u30c8\u30fc\u30af\u30f3\u306e\u6570\u3067\u3059</li>\n<li><span translate=no>_^_2_^_</span>\u306f\u57cb\u3081\u8fbc\u307f\u5185\u306e\u30d5\u30a3\u30fc\u30c1\u30e3\u306e\u6570\u3067\u3059</li>\n<li><span translate=no>_^_3_^_</span>\u5909\u5727\u5668\u5c64\u306e\u6570\u3067\u3059</li>\n<li><span translate=no>_^_4_^_</span>\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3\u30fb\u30d8\u30c3\u30c9\u306e\u6570\u3067\u3059</li>\n<li><span translate=no>_^_5_^_</span>\u4f7f\u7528\u3059\u308b\u30ec\u30a4\u30e4\u30fc\u306e\u30bb\u30c3\u30c8\u3067\u3059\u3002None \u306e\u5834\u5408\u306f\u3059\u3079\u3066\u306e\u30ec\u30a4\u30e4\u30fc\u304c\u4f7f\u7528\u3055\u308c\u307e\u3059\u3002\u3053\u308c\u306f\u3001\u30ec\u30a4\u30e4\u30fc\u6570\u306e\u5c11\u306a\u3044\u30e2\u30c7\u30eb\u306e\u5c0f\u3055\u3044\u30d0\u30fc\u30b8\u30e7\u30f3\u3092\u30c6\u30b9\u30c8\u3059\u308b\u5834\u5408\u306b\u4f7f\u7528\u3057\u307e\u3059</li>\u3002\n<li><span translate=no>_^_6_^_</span>\u30c8\u30e9\u30f3\u30b9\u30d5\u30a9\u30fc\u30de\u30fc\u30ec\u30a4\u30e4\u30fc\u306e\u30af\u30ed\u30fc\u30f3\u3092\u4f5c\u6210\u3059\u308b\u304b\u3069\u3046\u304b\u3092\u6307\u5b9a\u3057\u307e\u3059 (\u5c11\u3057\u901f\u304f\u306a\u308a\u307e\u3059)</li>\n<li><span translate=no>_^_7_^_</span>\u30e2\u30c7\u30eb\u306e\u30c7\u30fc\u30bf\u578b\u3067\u3059</li>\n<li><span translate=no>_^_8_^_</span>\u30e2\u30c7\u30eb\u306e\u30c7\u30d0\u30a4\u30b9\u3067\u3059</li>\n<li><span translate=no>_^_9_^_</span>int8 \u91cf\u5b50\u5316\u3092\u4f7f\u7528\u3059\u308b\u304b\u3069\u3046\u304b\u3092\u6307\u5b9a\u3057\u307e\u3059</li>\n<li><span translate=no>_^_10_^_</span><span translate=no>_^_11_^_</span>\u5916\u308c\u5024\u306e\u7279\u5fb4\u3092\u5206\u96e2\u3059\u308b\u305f\u3081\u306e\u95be\u5024\u3067\u3059</li>\n<li><span translate=no>_^_12_^_</span><a href=\"https://github.com/HazyResearch/flash-attention\">\u30d5\u30e9\u30c3\u30b7\u30e5\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3\u3092\u4f7f\u7528\u3059\u308b\u304b\u3069\u3046\u304b\u3092\u6307\u5b9a\u3057\u307e\u3059</a></li></ul>\n",
10
"<h3>Generator to get layers</h3>\n": "<h3>\u30ec\u30a4\u30e4\u30fc\u3092\u53d6\u5f97\u3059\u308b\u305f\u3081\u306e\u30b8\u30a7\u30cd\u30ec\u30fc\u30bf\u30fc</h3>\n",
11
"<h3>Generator to load layers</h3>\n": "<h3>\u30ec\u30a4\u30e4\u30fc\u3092\u30ed\u30fc\u30c9\u3059\u308b\u30b8\u30a7\u30cd\u30ec\u30fc\u30bf\u30fc</h3>\n",
12
"<h3>Returns the total number of layers</h3>\n": "<h3>\u30ec\u30a4\u30e4\u30fc\u306e\u7dcf\u6570\u3092\u8fd4\u3057\u307e\u3059</h3>\n",
13
"<h3>Rotate the features</h3>\n<p><span translate=no>_^_0_^_</span></p>\n": "<h3>\u30d5\u30a3\u30fc\u30c1\u30e3\u3092\u30ed\u30fc\u30c6\u30fc\u30b7\u30e7\u30f3\u3057\u3066\u304f\u3060\u3055\u3044</h3>\n<p><span translate=no>_^_0_^_</span></p>\n",
14
"<h4>Calculate the causal mask</h4>\n<ul><li><span translate=no>_^_0_^_</span> has shape <a href=\"batch_size, query_seq_len, key_seq_len, n_heads\">batch_size, query_seq_len, key_seq_len, n_heads</a></li></ul>\n": "<h4>\u56e0\u679c\u30de\u30b9\u30af\u306e\u8a08\u7b97</h4>\n<ul><li><span translate=no>_^_0_^_</span><a href=\"batch_size, query_seq_len, key_seq_len, n_heads\">\u30d0\u30c3\u30c1\u30b5\u30a4\u30ba\u3001\u30af\u30a8\u30ea\u30b7\u30fc\u30b1\u30f3\u30b9\u30ec\u30f3\u3001\u30ad\u30fc\u30b7\u30fc\u30b1\u30f3\u30b9\u30ec\u30f3\u3001</a> n\u30d8\u30c3\u30ba\u306e\u30b7\u30a7\u30a4\u30d7\u304c\u3042\u308a\u307e\u3059</li></ul>\n",
15
"<h4>Creates and caches a layer</h4>\n<p>Copying cached layers is faster than initializing new layers because it takes time to initialize parameters.</p>\n<ul><li><span translate=no>_^_0_^_</span> is the name of the layer </li>\n<li><span translate=no>_^_1_^_</span> is the function to create the layer </li>\n<p><em>Returns</em> the created layer or a copy of the cached layer</p></ul>\n": "<h4>\u30ec\u30a4\u30e4\u30fc\u3092\u4f5c\u6210\u3057\u3066\u30ad\u30e3\u30c3\u30b7\u30e5\u3057\u307e\u3059</h4>\n<p>\u30ad\u30e3\u30c3\u30b7\u30e5\u3055\u308c\u305f\u30ec\u30a4\u30e4\u30fc\u306e\u30b3\u30d4\u30fc\u306f\u3001\u30d1\u30e9\u30e1\u30fc\u30bf\u30fc\u306e\u521d\u671f\u5316\u306b\u6642\u9593\u304c\u304b\u304b\u308b\u305f\u3081\u3001\u65b0\u3057\u3044\u30ec\u30a4\u30e4\u30fc\u3092\u521d\u671f\u5316\u3059\u308b\u3088\u308a\u3082\u9ad8\u901f\u3067\u3059\u3002</p>\n<ul><li><span translate=no>_^_0_^_</span>\u30ec\u30a4\u30e4\u30fc\u306e\u540d\u524d\u3067\u3059</li>\n<li><span translate=no>_^_1_^_</span>\u30ec\u30a4\u30e4\u30fc\u3092\u4f5c\u6210\u3059\u308b\u95a2\u6570\u3067\u3059</li>\n<p><em>\u4f5c\u6210\u3055\u308c\u305f\u30ec\u30a4\u30e4\u30fc\u307e\u305f\u306f\u30ad\u30e3\u30c3\u30b7\u30e5\u3055\u308c\u305f\u30ec\u30a4\u30e4\u30fc\u306e\u30b3\u30d4\u30fc\u3092\u8fd4\u3057\u307e\u3059</em></p></ul>\n",
16
"<h4>Prepares the layer for usage</h4>\n<p>We move the layer to the device and convert it to the correct data type</p>\n<ul><li><span translate=no>_^_0_^_</span> is the layer to prepare </li>\n<p><em>Returns</em> the prepared layer</p></ul>\n": "<h4>\u30ec\u30a4\u30e4\u30fc\u3092\u4f7f\u7528\u3067\u304d\u308b\u3088\u3046\u306b\u6e96\u5099\u3057\u307e\u3059</h4>\n<p>\u30ec\u30a4\u30e4\u30fc\u3092\u30c7\u30d0\u30a4\u30b9\u306b\u79fb\u52d5\u3057\u3001\u6b63\u3057\u3044\u30c7\u30fc\u30bf\u578b\u306b\u5909\u63db\u3057\u307e\u3059\u3002</p>\n<ul><li><span translate=no>_^_0_^_</span>\u6e96\u5099\u3059\u308b\u30ec\u30a4\u30e4\u30fc\u3067\u3059</li>\n<p><em>\u6e96\u5099\u3057\u305f\u30ec\u30a4\u30e4\u30fc\u3092\u8fd4\u3057\u307e\u3059</em></p></ul>\n",
17
"<p> </p>\n": "<p></p>\n",
18
"<p> <a id=\"post_load_prepare\"></a></p>\n<h3>Layer transformations after loading the checkpoint</h3>\n<p>This function implements layer transformations after loading the checkpoint.</p>\n<p>Currently, it only applies the int8 quantization.</p>\n<ul><li><span translate=no>_^_0_^_</span> is the layer to prepare </li>\n<li><span translate=no>_^_1_^_</span> specifies whether to use int8 quantization </li>\n<li><span translate=no>_^_2_^_</span> is the device of the model </li>\n<li><span translate=no>_^_3_^_</span> is the threshold <span translate=no>_^_4_^_</span> used to separate outlier features </li>\n<p><em>Returns</em> the prepared layer</p></ul>\n": "<p><a id=\"post_load_prepare\"></a></p>\n<h3>\u30c1\u30a7\u30c3\u30af\u30dd\u30a4\u30f3\u30c8\u3092\u30ed\u30fc\u30c9\u3057\u305f\u5f8c\u306e\u30ec\u30a4\u30e4\u30fc\u5909\u63db</h3>\n<p>\u3053\u306e\u95a2\u6570\u306f\u3001\u30c1\u30a7\u30c3\u30af\u30dd\u30a4\u30f3\u30c8\u3092\u8aad\u307f\u8fbc\u3093\u3060\u5f8c\u306b\u30ec\u30a4\u30e4\u30fc\u5909\u63db\u3092\u5b9f\u88c5\u3057\u307e\u3059\u3002</p>\n<p>\u73fe\u5728\u3001\u9069\u7528\u3055\u308c\u308b\u306e\u306f int8 \u91cf\u5b50\u5316\u306e\u307f\u3067\u3059\u3002</p>\n<ul><li><span translate=no>_^_0_^_</span>\u6e96\u5099\u3059\u308b\u30ec\u30a4\u30e4\u30fc\u3067\u3059</li>\n<li><span translate=no>_^_1_^_</span>int8 \u91cf\u5b50\u5316\u3092\u4f7f\u7528\u3059\u308b\u304b\u3069\u3046\u304b\u3092\u6307\u5b9a\u3057\u307e\u3059</li>\n<li><span translate=no>_^_2_^_</span>\u30e2\u30c7\u30eb\u306e\u30c7\u30d0\u30a4\u30b9\u3067\u3059</li>\n<li><span translate=no>_^_3_^_</span><span translate=no>_^_4_^_</span>\u5916\u308c\u5024\u306e\u7279\u5fb4\u3092\u5206\u96e2\u3059\u308b\u305f\u3081\u306e\u95be\u5024\u3067\u3059</li>\n<p><em>\u6e96\u5099\u3057\u305f\u30ec\u30a4\u30e4\u30fc\u3092\u8fd4\u3057\u307e\u3059</em></p></ul>\n",
19
"<p> <span translate=no>_^_0_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span></p>\n",
20
"<p> Code to load the checkpoint</p>\n": "<p>\u30c1\u30a7\u30c3\u30af\u30dd\u30a4\u30f3\u30c8\u3092\u30ed\u30fc\u30c9\u3059\u308b\u30b3\u30fc\u30c9</p>\n",
21
"<p> Readout layer</p>\n": "<p>\u8aad\u307f\u51fa\u3057\u5c64</p>\n",
22
"<p><a href=\"https://github.com/HazyResearch/flash-attention\">FlashAttention</a> </p>\n": "<p><a href=\"https://github.com/HazyResearch/flash-attention\">\u30d5\u30e9\u30c3\u30b7\u30e5\u30fb\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3</a></p>\n",
23
"<p><span translate=no>_^_0_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span></p>\n",
24
"<p>Add RoPE embeddings </p>\n": "<p>RoPe \u57cb\u3081\u8fbc\u307f\u3092\u8ffd\u52a0</p>\n",
25
"<p>Add head dimension </p>\n": "<p>\u982d\u90e8\u5bf8\u6cd5\u3092\u8ffd\u52a0</p>\n",
26
"<p>Add them and the residual connection </p>\n": "<p>\u305d\u308c\u3089\u3068\u6b8b\u308a\u306e\u63a5\u7d9a\u3092\u8ffd\u52a0\u3057\u307e\u3059</p>\n",
27
"<p>Apply mask </p>\n": "<p>\u30de\u30b9\u30af\u3092\u9069\u7528</p>\n",
28
"<p>Attention layer </p>\n": "<p>\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3\u30ec\u30a4\u30e4\u30fc</p>\n",
29
"<p>Attention output transform </p>\n": "<p>\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3\u51fa\u529b\u5909\u63db</p>\n",
30
"<p>Attention query, key and value transform </p>\n": "<p>\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3\u30af\u30a8\u30ea\u3001\u30ad\u30fc\u3001\u5024\u306e\u5909\u63db</p>\n",
31
"<p>Attention scaling factor </p>\n": "<p>\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3\u30b9\u30b1\u30fc\u30ea\u30f3\u30b0\u30d5\u30a1\u30af\u30bf\u30fc</p>\n",
32
"<p>Attention softmax </p>\n": "<p>\u6ce8\u610f\u30bd\u30d5\u30c8\u30de\u30c3\u30af\u30b9</p>\n",
33
"<p>Attention softmax module </p>\n": "<p>\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3\u30bd\u30d5\u30c8\u30de\u30c3\u30af\u30b9\u30e2\u30b8\u30e5\u30fc\u30eb</p>\n",
34
"<p>Base for <span translate=no>_^_0_^_</span> </p>\n": "<p>\u306e\u30d9\u30fc\u30b9 <span translate=no>_^_0_^_</span></p>\n",
35
"<p>Cache <span translate=no>_^_0_^_</span> and <span translate=no>_^_1_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span>\u30ad\u30e3\u30c3\u30b7\u30e5\u3068 <span translate=no>_^_1_^_</span></p>\n",
36
"<p>Cache them </p>\n": "<p>\u305d\u308c\u3089\u3092\u30ad\u30e3\u30c3\u30b7\u30e5\u3059\u308b</p>\n",
37
"<p>Calculate <span translate=no>_^_0_^_</span> and <span translate=no>_^_1_^_</span> in fp32 </p>\n": "<p><span translate=no>_^_0_^_</span>\u8a08\u7b97\u3057\u3066 <span translate=no>_^_1_^_</span> fp32 \u3067</p>\n",
38
"<p>Concatenate so that for row <span translate=no>_^_0_^_</span> we have</p>\n<p><span translate=no>_^_1_^_</span> </p>\n": "<p>\u884c\u304c\u6b21\u306e\u3088\u3046\u306b\u306a\u308b\u3088\u3046\u306b\u9023\u7d50\u3057\u307e\u3059 <span translate=no>_^_0_^_</span></p>\n<p><span translate=no>_^_1_^_</span></p>\n",
39
"<p>Concatenate the past </p>\n": "<p>\u904e\u53bb\u3092\u9023\u7d50\u3059\u308b</p>\n",
40
"<p>Concatenate with features that didn&#x27;t get RoPE embeddings </p>\n": "<p>RoPe \u57cb\u3081\u8fbc\u307f\u306b\u5bfe\u5fdc\u3057\u3066\u3044\u306a\u304b\u3063\u305f\u6a5f\u80fd\u3068\u306e\u9023\u643a</p>\n",
41
"<p>Contraction linear layer </p>\n": "<p>\u53ce\u7e2e\u7dda\u72b6\u5c64</p>\n",
42
"<p>Convert the linear layers </p>\n": "<p>\u7dda\u5f62\u30ec\u30a4\u30e4\u30fc\u306e\u5909\u63db</p>\n",
43
"<p>Convert to fp32 if the current dtype is fp16 </p>\n": "<p>\u73fe\u5728\u306e dtype \u304c fp16 \u306e\u5834\u5408\u306f fp32 \u306b\u5909\u63db</p>\n",
44
"<p>Create mask </p>\n": "<p>\u30de\u30b9\u30af\u4f5c\u6210</p>\n",
45
"<p>Disable auto-casting to fp16 for attention computation </p>\n": "<p>\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3\u8a08\u7b97\u306e fp16 \u3078\u306e\u81ea\u52d5\u30ad\u30e3\u30b9\u30c8\u3092\u7121\u52b9\u306b\u3059\u308b</p>\n",
46
"<p>Do not cast for bfloat </p>\n": "<p>bfloat\u306b\u306f\u30ad\u30e3\u30b9\u30c8\u3057\u306a\u3044\u3067\u304f\u3060\u3055\u3044</p>\n",
47
"<p>Embedding layer </p>\n": "<p>\u57cb\u3081\u8fbc\u307f\u30ec\u30a4\u30e4\u30fc</p>\n",
48
"<p>Expansion linear layer </p>\n": "<p>\u62e1\u5f35\u30ea\u30cb\u30a2\u30ec\u30a4\u30e4\u30fc</p>\n",
49
"<p>FFN first transform </p>\n": "<p>FFN \u30d5\u30a1\u30fc\u30b9\u30c8\u30c8\u30e9\u30f3\u30b9\u30d5\u30a9\u30fc\u30e0</p>\n",
50
"<p>FFN layer </p>\n": "<p>FFN \u30ec\u30a4\u30e4\u30fc</p>\n",
51
"<p>FFN second transform </p>\n": "<p>FFN 2 \u756a\u76ee\u306e\u30c8\u30e9\u30f3\u30b9\u30d5\u30a9\u30fc\u30e0</p>\n",
52
"<p>Final linear layer </p>\n": "<p>\u6700\u7d42\u7dda\u5f62\u30ec\u30a4\u30e4\u30fc</p>\n",
53
"<p>Final normalization layer </p>\n": "<p>\u6700\u7d42\u6b63\u898f\u5316\u30ec\u30a4\u30e4\u30fc</p>\n",
54
"<p>GELU activation </p>\n": "<p>GELU \u30a2\u30af\u30c6\u30a3\u30d9\u30fc\u30b7\u30e7\u30f3</p>\n",
55
"<p>Get attention weighted values </p>\n": "<p>\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3\u52a0\u91cd\u5024\u3092\u53d6\u5f97</p>\n",
56
"<p>Get causal mask </p>\n": "<p>\u30ab\u30b8\u30e5\u30a2\u30eb\u30de\u30b9\u30af\u3092\u30b2\u30c3\u30c8</p>\n",
57
"<p>Get default values if not specified </p>\n": "<p>\u6307\u5b9a\u3057\u306a\u3044\u5834\u5408\u306f\u30c7\u30d5\u30a9\u30eb\u30c8\u5024\u3092\u53d6\u5f97</p>\n",
58
"<p>Get position indexes <span translate=no>_^_0_^_</span> </p>\n": "<p>\u4f4d\u7f6e\u30a4\u30f3\u30c7\u30c3\u30af\u30b9\u3092\u53d6\u5f97 <span translate=no>_^_0_^_</span></p>\n",
59
"<p>Get query, key and value embeddings (all concatenated). The last dimension size will change from n_hidden -&gt; <span translate=no>_^_0_^_</span> </p>\n": "<p>\u30af\u30a8\u30ea\u3001\u30ad\u30fc\u3001\u5024\u306e\u57cb\u3081\u8fbc\u307f (\u3059\u3079\u3066\u9023\u7d50) \u3092\u53d6\u5f97\u3057\u307e\u3059\u3002\u6700\u5f8c\u306e\u30c7\u30a3\u30e1\u30f3\u30b7\u30e7\u30f3\u30b5\u30a4\u30ba\u304c n_hidden \u304b\u3089\u5909\u66f4\u3055\u308c\u307e\u3059</p>-> <span translate=no>_^_0_^_</span>\n",
60
"<p>Get the actual sequence length </p>\n": "<p>\u5b9f\u969b\u306e\u30b7\u30fc\u30b1\u30f3\u30b9\u9577\u3092\u53d6\u5f97</p>\n",
61
"<p>Get the past keys and values. These will have shape <span translate=no>_^_0_^_</span> </p>\n": "<p>\u904e\u53bb\u306e\u30ad\u30fc\u3068\u5024\u3092\u53d6\u5f97\u3057\u307e\u3059\u3002\u3053\u308c\u3089\u306f\u5f62\u306b\u306a\u308a\u307e\u3059 <span translate=no>_^_0_^_</span></p>\n",
62
"<p>Get the sin and cos values from the cache </p>\n": "<p>\u30ad\u30e3\u30c3\u30b7\u30e5\u304b\u3089 sin \u3068 cos \u306e\u5024\u3092\u53d6\u5f97</p>\n",
63
"<p>Get the state id&#x27;s. We use to retrieve previous states and store the next states </p>\n": "<p>\u30b9\u30c6\u30fc\u30c8 ID \u3092\u53d6\u5f97\u3057\u307e\u3059\u3002\u524d\u306e\u30b9\u30c6\u30fc\u30c8\u3092\u53d6\u5f97\u3057\u305f\u308a\u3001\u6b21\u306e\u30b9\u30c6\u30fc\u30c8\u3092\u4fdd\u5b58\u3057\u305f\u308a\u3059\u308b\u306e\u306b\u4f7f\u3044\u307e\u3059\u3002</p>\n",
64
"<p>If there&#x27;s cache </p>\n": "<p>\u30ad\u30e3\u30c3\u30b7\u30e5\u304c\u3042\u308b\u5834\u5408</p>\n",
65
"<p>If we are caching the states of previous tokens </p>\n": "<p>\u4ee5\u524d\u306e\u30c8\u30fc\u30af\u30f3\u306e\u72b6\u614b\u3092\u30ad\u30e3\u30c3\u30b7\u30e5\u3059\u308b\u5834\u5408</p>\n",
66
"<p>Initialize <span translate=no>_^_0_^_</span> </p>\n": "<p>[\u521d\u671f\u5316] <span translate=no>_^_0_^_</span></p>\n",
67
"<p>Initialize <span translate=no>_^_0_^_</span> and <span translate=no>_^_1_^_</span> cache </p>\n": "<p><span translate=no>_^_0_^_</span>\u521d\u671f\u5316\u3068\u30ad\u30e3\u30c3\u30b7\u30e5 <span translate=no>_^_1_^_</span></p>\n",
68
"<p>Layer norm before FFN </p>\n": "<p>FFN \u524d\u306e\u30ec\u30a4\u30e4\u30fc\u30ce\u30eb\u30e0</p>\n",
69
"<p>Layer norm before attention </p>\n": "<p>\u6ce8\u76ee\u3055\u308c\u308b\u524d\u306e\u30ec\u30a4\u30e4\u30fc\u30ce\u30eb\u30e0</p>\n",
70
"<p>Layer normalization before FFN </p>\n": "<p>FFN \u524d\u306e\u30ec\u30a4\u30e4\u30fc\u6b63\u898f\u5316</p>\n",
71
"<p>Layer normalization before attention </p>\n": "<p>\u6ce8\u610f\u524d\u306e\u30ec\u30a4\u30e4\u30fc\u6b63\u898f\u5316</p>\n",
72
"<p>Linear layer for query, key and value </p>\n": "<p>\u30af\u30a8\u30ea\u3001\u30ad\u30fc\u3001\u5024\u306e\u7dda\u5f62\u30ec\u30a4\u30e4\u30fc</p>\n",
73
"<p>NeoX runs attention and feedforward network in parallel </p>\n": "<p>NeoX\u306f\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u3068\u30d5\u30a3\u30fc\u30c9\u30d5\u30a9\u30ef\u30fc\u30c9\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u3092\u4e26\u884c\u3057\u3066\u5b9f\u884c\u3057\u307e\u3059</p>\n",
74
"<p>No cache - simply add RoPE embeddings </p>\n": "<p>\u30ad\u30e3\u30c3\u30b7\u30e5\u306a\u3057-RoPE \u57cb\u3081\u8fbc\u307f\u3092\u8ffd\u52a0\u3059\u308b\u3060\u3051</p>\n",
75
"<p>Number of features for RoPE </p>\n": "<p>RoPE \u306e\u6a5f\u80fd\u306e\u6570</p>\n",
76
"<p>Number of features per head </p>\n": "<p>\u30d8\u30c3\u30c9\u3042\u305f\u308a\u306e\u6a5f\u80fd\u6570</p>\n",
77
"<p>Offset of the current embeddings </p>\n": "<p>\u73fe\u5728\u306e\u57cb\u3081\u8fbc\u307f\u306e\u30aa\u30d5\u30bb\u30c3\u30c8</p>\n",
78
"<p>Only convert the linear layers in the transformer layers </p>\n": "<p>\u30c8\u30e9\u30f3\u30b9\u30ec\u30a4\u30e4\u30fc\u306e\u7dda\u5f62\u30ec\u30a4\u30e4\u30fc\u306e\u307f\u3092\u5909\u63db\u3057\u307e\u3059</p>\n",
79
"<p>Otherwise, use normal attention </p>\n": "<p>\u305d\u308c\u4ee5\u5916\u306e\u5834\u5408\u306f\u3001\u901a\u5e38\u306e\u6ce8\u610f\u3092\u6255\u3063\u3066\u304f\u3060\u3055\u3044</p>\n",
80
"<p>Query and key lengths </p>\n": "<p>\u30af\u30a8\u30ea\u3068\u30ad\u30fc\u306e\u9577\u3055</p>\n",
81
"<p>Readout layer </p>\n": "<p>\u8aad\u307f\u51fa\u3057\u5c64</p>\n",
82
"<p>Reshape from <span translate=no>_^_0_^_</span><a href=\"batch_size, seq_len, n_hidden\">batch_size, seq_len, n_hidden</a>` </p>\n": "<p><span translate=no>_^_0_^_</span><a href=\"batch_size, seq_len, n_hidden\">\u30d0\u30c3\u30c1\u30b5\u30a4\u30ba\u3001\u30b7\u30fc\u30b1\u30f3\u30b9\u756a\u53f7\u3001n_hidden `</a>\u304b\u3089\u5f62\u72b6\u3092\u5909\u66f4</p>\n",
83
"<p>Residual connection </p>\n": "<p>\u6b8b\u7559\u63a5\u7d9a</p>\n",
84
"<p>Return from cache </p>\n": "<p>\u30ad\u30e3\u30c3\u30b7\u30e5\u304b\u3089\u623b\u308b</p>\n",
85
"<p>RoPE embedding module </p>\n": "<p>RoPE \u57cb\u3081\u8fbc\u307f\u30e2\u30b8\u30e5\u30fc\u30eb</p>\n",
86
"<p>RoPE embeddings</p>\n<span translate=no>_^_0_^_</span><p>for <span translate=no>_^_1_^_</span> </p>\n": "<p>\u30ed\u30fc\u30d7\u57cb\u3081\u8fbc\u307f</p>\n<span translate=no>_^_0_^_</span><p>\u306b\u3068\u3063\u3066 <span translate=no>_^_1_^_</span></p>\n",
87
"<p>Save the current state </p>\n": "<p>\u73fe\u5728\u306e\u72b6\u614b\u3092\u4fdd\u5b58\u3059\u308b</p>\n",
88
"<p>Scale attention </p>\n": "<p>\u30b9\u30b1\u30fc\u30eb\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3</p>\n",
89
"<p>Skip if not using int8 quantization </p>\n": "<p>int8 \u91cf\u5b50\u5316\u3092\u4f7f\u7528\u3057\u306a\u3044\u5834\u5408\u306f\u30b9\u30ad\u30c3\u30d7</p>\n",
90
"<p>Split into heads by changing the shape to <span translate=no>_^_0_^_</span> </p>\n": "<p>\u5f62\u72b6\u3092\u4ee5\u4e0b\u306e\u3088\u3046\u306b\u5909\u66f4\u3057\u3066\u982d\u90e8\u306b\u5206\u5272\u3057\u307e\u3059 <span translate=no>_^_0_^_</span></p>\n",
91
"<p>Split into query, key and value each of shape <span translate=no>_^_0_^_</span> </p>\n": "<p>\u5f62\u72b6\u3054\u3068\u306b\u30af\u30a8\u30ea\u3001\u30ad\u30fc\u3001\u5024\u306b\u5206\u5272 <span translate=no>_^_0_^_</span></p>\n",
92
"<p>Split the features. We apply RoPE to only <span translate=no>_^_0_^_</span> features </p>\n": "<p>\u6a5f\u80fd\u3092\u5206\u5272\u3057\u3066\u304f\u3060\u3055\u3044\u3002RoPE <span translate=no>_^_0_^_</span> \u306f\u6a5f\u80fd\u306b\u306e\u307f\u9069\u7528\u3055\u308c\u307e\u3059</p>\n",
93
"<p>Stack them into shape <span translate=no>_^_0_^_</span> </p>\n": "<p>\u305d\u308c\u3089\u3092\u7a4d\u307f\u91cd\u306d\u3066\u5f62\u3092\u6574\u3048\u308b <span translate=no>_^_0_^_</span></p>\n",
94
"<p>The output is of shape <span translate=no>_^_0_^_</span> </p>\n": "<p>\u51fa\u529b\u306f\u6574\u5f62\u3057\u3066\u3044\u307e\u3059 <span translate=no>_^_0_^_</span></p>\n",
95
"<p>To cache causal mask </p>\n": "<p>\u56e0\u679c\u30de\u30b9\u30af\u3092\u30ad\u30e3\u30c3\u30b7\u30e5\u3059\u308b\u306b\u306f</p>\n",
96
"<p>To store <span translate=no>_^_0_^_</span> for the features </p>\n": "<p><span translate=no>_^_0_^_</span>\u6a5f\u80fd\u7528\u306b\u4fdd\u5b58\u3059\u308b\u306b\u306f</p>\n",
97
"<p>Transformer layer </p>\n": "<p>\u5909\u5727\u5668\u5c64</p>\n",
98
"<p>Transformer layers </p>\n": "<p>\u30c8\u30e9\u30f3\u30b9\u30d5\u30a9\u30fc\u30de\u30fc\u5c64</p>\n",
99
"<p>Use <span translate=no>_^_0_^_</span> defined in <a href=\"./utils/llm_int8.html\">utilities</a>. </p>\n": "<p><span translate=no>_^_0_^_</span><a href=\"./utils/llm_int8.html\">\u30e6\u30fc\u30c6\u30a3\u30ea\u30c6\u30a3\u3067\u5b9a\u7fa9\u3055\u308c\u3066\u3044\u308b\u7528\u9014</a>\u3002</p>\n",
100
"<p>Use flash attention </p>\n": "<p>\u30d5\u30e9\u30c3\u30b7\u30e5\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3\u3092\u4f7f\u3046</p>\n",
101
"<ul><li><span translate=no>_^_0_^_</span> are the embeddings of shape <span translate=no>_^_1_^_</span></li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u5f62\u304c\u57cb\u3081\u8fbc\u307e\u308c\u3066\u3044\u308b\u3082\u306e\u3067\u3059 <span translate=no>_^_1_^_</span></li></ul>\n",
102
"<ul><li><span translate=no>_^_0_^_</span> are the token ids of shape <span translate=no>_^_1_^_</span></li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u5f62\u72b6\u306e\u30c8\u30fc\u30af\u30f3ID\u3067\u3059 <span translate=no>_^_1_^_</span></li></ul>\n",
103
"<ul><li><span translate=no>_^_0_^_</span> has shape <span translate=no>_^_1_^_</span> </li>\n<li><span translate=no>_^_2_^_</span> is the starting position of <span translate=no>_^_3_^_</span>. This is <span translate=no>_^_4_^_</span> when we have cached the keys and queries of previous positions</li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u5f62\u304c\u3042\u308b <span translate=no>_^_1_^_</span></li>\n<li><span translate=no>_^_2_^_</span><span translate=no>_^_3_^_</span>\u306e\u958b\u59cb\u4f4d\u7f6e\u3067\u3059\u3002\u3053\u308c\u306f\u3001<span translate=no>_^_4_^_</span>\u4ee5\u524d\u306e\u30dd\u30b8\u30b7\u30e7\u30f3\u306e\u30ad\u30fc\u3068\u30af\u30a8\u30ea\u3092\u30ad\u30e3\u30c3\u30b7\u30e5\u3057\u305f\u3068\u304d\u3067\u3059</li></ul>\u3002\n",
104
"<ul><li><span translate=no>_^_0_^_</span> has shape <span translate=no>_^_1_^_</span></li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u5f62\u304c\u3042\u308b <span translate=no>_^_1_^_</span></li></ul>\n",
105
"<ul><li><span translate=no>_^_0_^_</span> is the embedding size </li>\n<li><span translate=no>_^_1_^_</span> is the number of heads </li>\n<li><span translate=no>_^_2_^_</span> specifies whether to use <a href=\"https://github.com/HazyResearch/flash-attention\">FlashAttention</a></li></ul>\n<p><em>Out implementation doesn&#x27;t include dropout</em>.</p>\n": "<ul><li><span translate=no>_^_0_^_</span>\u306f\u57cb\u3081\u8fbc\u307f\u30b5\u30a4\u30ba</li>\n<li><span translate=no>_^_1_^_</span>\u306f\u982d\u306e\u6570\u3067\u3059</li>\n<li><span translate=no>_^_2_^_</span><a href=\"https://github.com/HazyResearch/flash-attention\">\u30d5\u30e9\u30c3\u30b7\u30e5\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3\u3092\u4f7f\u7528\u3059\u308b\u304b\u3069\u3046\u304b\u3092\u6307\u5b9a\u3057\u307e\u3059</a></li></ul>\n<p><em>\u30a2\u30a6\u30c8\u306e\u5b9f\u88c5\u306b\u306f\u30c9\u30ed\u30c3\u30d7\u30a2\u30a6\u30c8\u306f\u542b\u307e\u308c\u3066\u3044\u307e\u305b\u3093</em>\u3002</p>\n",
106
"<ul><li><span translate=no>_^_0_^_</span> is the embedding size </li>\n<li><span translate=no>_^_1_^_</span> is the size of the vocabulary</li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u306f\u57cb\u3081\u8fbc\u307f\u30b5\u30a4\u30ba</li>\n<li><span translate=no>_^_1_^_</span>\u30dc\u30ad\u30e3\u30d6\u30e9\u30ea\u30fc\u306e\u5927\u304d\u3055\u3067\u3059</li></ul>\n",
107
"<ul><li><span translate=no>_^_0_^_</span> is the embedding size</li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u306f\u57cb\u3081\u8fbc\u307f\u30b5\u30a4\u30ba</li></ul>\n",
108
"<ul><li><span translate=no>_^_0_^_</span> is the number of features for RoPE embeddings </li>\n<li><span translate=no>_^_1_^_</span> is the base for <span translate=no>_^_2_^_</span>, which defaults to <span translate=no>_^_3_^_</span></li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>RoPE \u57cb\u3081\u8fbc\u307f\u306e\u6a5f\u80fd\u306e\u6570\u3067\u3059</li>\n<li><span translate=no>_^_1_^_</span>\u304c\u306e\u57fa\u5e95\u3067<span translate=no>_^_2_^_</span>\u3001\u30c7\u30d5\u30a9\u30eb\u30c8\u306f <span translate=no>_^_3_^_</span></li></ul>\n",
109
"<ul><li><span translate=no>_^_0_^_</span> is the size of the vocabulary </li>\n<li><span translate=no>_^_1_^_</span> is the size of the embeddings</li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u30dc\u30ad\u30e3\u30d6\u30e9\u30ea\u30fc\u306e\u5927\u304d\u3055\u3067\u3059</li>\n<li><span translate=no>_^_1_^_</span>\u306f\u57cb\u3081\u8fbc\u307f\u306e\u30b5\u30a4\u30ba\u3067\u3059</li></ul>\n",
110
"<ul><li><span translate=no>_^_0_^_</span> the number of features in embeddings </li>\n<li><span translate=no>_^_1_^_</span> the number of attention heads </li>\n<li><span translate=no>_^_2_^_</span> percentage of features to add RoPE embeddings </li>\n<li><span translate=no>_^_3_^_</span> masking fill value for attention matrix </li>\n<li><span translate=no>_^_4_^_</span> specifies whether to use <a href=\"https://github.com/HazyResearch/flash-attention\">FlashAttention</a></li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u57cb\u3081\u8fbc\u307f\u306b\u542b\u307e\u308c\u308b\u6a5f\u80fd\u306e\u6570</li>\n<li><span translate=no>_^_1_^_</span>\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3\u30fb\u30d8\u30c3\u30c9\u306e\u6570</li>\n<li><span translate=no>_^_2_^_</span>RoPe \u57cb\u3081\u8fbc\u307f\u3092\u8ffd\u52a0\u3059\u308b\u6a5f\u80fd\u306e\u5272\u5408</li>\n<li><span translate=no>_^_3_^_</span>\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3\u30fb\u30de\u30c8\u30ea\u30c3\u30af\u30b9\u306e\u30de\u30b9\u30ad\u30f3\u30b0\u30fb\u30d5\u30a3\u30eb\u5024</li>\n<li><span translate=no>_^_4_^_</span><a href=\"https://github.com/HazyResearch/flash-attention\">\u30d5\u30e9\u30c3\u30b7\u30e5\u30a2\u30c6\u30f3\u30b7\u30e7\u30f3\u3092\u4f7f\u7528\u3059\u308b\u304b\u3069\u3046\u304b\u3092\u6307\u5b9a\u3057\u307e\u3059</a></li></ul>\n",
111
"GPT-NeoX Model Definition": "GPT-\u30cd\u30aa\u30c3\u30af\u30b9\u30e2\u30c7\u30eb\u5b9a\u7fa9",
112
"This is the model definition of GPT-NeoX.": "\u3053\u308c\u304cGPT-Neox\u306e\u30e2\u30c7\u30eb\u5b9a\u7fa9\u3067\u3059\u3002"
113
}
114