CoCalc -- model.zh.json

GitHub Repository: labmlai/annotated_deep_learning_paper_implementations
Path: blob/master/translate_cache/neox/model.zh.json
⁴⁹²³ views
1
{
2
 "<h1>GPT-NeoX Model</h1>\n<p>Here is the code for layers of GPT-NeoX model and the code to load 20B checkpoint.</p>\n<p>The method <span translate=no>_^_0_^_</span> in the layers load the checkpoints of that layer. The checkpoint loading helpers are on <a href=\"checkpoint.html\"><span translate=no>_^_1_^_</span></a></p>\n": "<h1>GPT-NEOX \u578b\u53f7</h1>\n<p>\u4ee5\u4e0b\u662f GPT-NEOX \u6a21\u578b\u5c42\u7684\u4ee3\u7801\u548c\u52a0\u8f7d 20B \u68c0\u67e5\u70b9\u7684\u4ee3\u7801\u3002</p>\n<p>\u56fe\u5c42<span translate=no>_^_0_^_</span>\u4e2d\u7684\u65b9\u6cd5\u52a0\u8f7d\u8be5\u5c42\u7684\u68c0\u67e5\u70b9\u3002\u68c0\u67e5\u70b9\u52a0\u8f7d\u52a9\u624b\u5df2\u542f\u7528 <a href=\"checkpoint.html\"><span translate=no>_^_1_^_</span></a></p>\n",
3
 "<h2>Attention layer</h2>\n": "<h2>\u6ce8\u610f\u5c42</h2>\n",
4
 "<h2>Embedding layer</h2>\n<p>This is a standard embeddings layer with code to load the checkpoint.</p>\n": "<h2>\u5d4c\u5165\u5c42</h2>\n<p>\u8fd9\u662f\u4e00\u4e2a\u6807\u51c6\u7684\u5d4c\u5165\u5c42\uff0c\u5176\u4e2d\u5305\u542b\u7528\u4e8e\u52a0\u8f7d\u68c0\u67e5\u70b9\u7684\u4ee3\u7801\u3002</p>\n",
5
 "<h2>Feedforward Network</h2>\n": "<h2>\u524d\u9988\u7f51\u7edc</h2>\n",
6
 "<h2>Final normalization layer</h2>\n": "<h2>\u6700\u7ec8\u5f52\u4e00\u5316\u5c42</h2>\n",
7
 "<h2>Rotary Positional Embeddings</h2>\n<p>GPT-NeoX uses <a href=\"https://arxiv.org/abs/2104.09864\">rotary positional embeddings (RoPE)</a>.</p>\n<p>WE have annotated implementation of RoPE <a href=\"https://nn.labml.ai/transformers/rope/index.html\">here</a> with more notes the theory.</p>\n": "<h2>\u65cb\u8f6c\u4f4d\u7f6e\u5d4c\u5165</h2>\n<p>GPT-NEOX \u4f7f\u7528<a href=\"https://arxiv.org/abs/2104.09864\">\u65cb\u8f6c\u4f4d\u7f6e\u5d4c\u5165\uff08RoP\uff09</a>\u3002</p>\n<p>\u6211\u4eec<a href=\"https://nn.labml.ai/transformers/rope/index.html\">\u5728\u8fd9\u91cc</a>\u6ce8\u91ca\u4e86 RoPe \u7684\u5b9e\u73b0\uff0c\u5e76\u9644\u4e0a\u4e86\u66f4\u591a\u5173\u4e8e\u7406\u8bba\u7684\u6ce8\u91ca\u3002</p>\n",
8
 "<h2>Transformer Layer</h2>\n": "<h2>\u53d8\u538b\u5668\u5c42</h2>\n",
9
 "<h3>Generator to create layers</h3>\n<p>The layers are generated in the same order as checkpoints.</p>\n<p>It gives <span translate=no>_^_0_^_</span> when a layer is not available; we use the layer indices as NeoX and there are two transformation layers we don&#x27;t need in our implementation.</p>\n<ul><li><span translate=no>_^_1_^_</span>  is the number of tokens in the vocabulary </li>\n<li><span translate=no>_^_2_^_</span>  is the number of features in the embeddings </li>\n<li><span translate=no>_^_3_^_</span>  is the number of transformer layers </li>\n<li><span translate=no>_^_4_^_</span>  is the number of attention heads </li>\n<li><span translate=no>_^_5_^_</span>  are the set of layers to be used. All layers will be used if None.  This is used to test smaller versions of the model with fewer layers </li>\n<li><span translate=no>_^_6_^_</span>  specifies whether to clone the transformer layers (a bit faster) </li>\n<li><span translate=no>_^_7_^_</span>  is the data type of the model </li>\n<li><span translate=no>_^_8_^_</span>  is the device of the model </li>\n<li><span translate=no>_^_9_^_</span>  specifies whether to use int8 quantization </li>\n<li><span translate=no>_^_10_^_</span>  is the threshold <span translate=no>_^_11_^_</span> used to separate outlier features </li>\n<li><span translate=no>_^_12_^_</span>  specifies whether to use  <a href=\"https://github.com/HazyResearch/flash-attention\">FlashAttention</a></li></ul>\n": "<h3>\u7528\u4e8e\u521b\u5efa\u56fe\u5c42\u7684\u751f\u6210\u5668</h3>\n<p>\u56fe\u5c42\u7684\u751f\u6210\u987a\u5e8f\u4e0e\u68c0\u67e5\u70b9\u7684\u751f\u6210\u987a\u5e8f\u76f8\u540c\u3002</p>\n<p>\u5b83\u5728\u56fe\u5c42\u4e0d\u53ef\u7528<span translate=no>_^_0_^_</span>\u65f6\u7ed9\u51fa\uff1b\u6211\u4eec\u5c06\u56fe\u5c42\u7d22\u5f15\u7528\u4f5c NeoX\uff0c\u5e76\u4e14\u5728\u5b9e\u73b0\u4e2d\u4e0d\u9700\u8981\u4e24\u4e2a\u8f6c\u6362\u5c42\u3002</p>\n<ul><li><span translate=no>_^_1_^_</span>\u662f\u8bcd\u6c47\u8868\u4e2d\u7684\u4ee3\u5e01\u6570\u91cf</li>\n<li><span translate=no>_^_2_^_</span>\u662f\u5d4c\u5165\u4e2d\u7684\u7279\u5f81\u6570\u91cf</li>\n<li><span translate=no>_^_3_^_</span>\u662f\u53d8\u538b\u5668\u5c42\u6570</li>\n<li><span translate=no>_^_4_^_</span>\u662f\u6ce8\u610f\u529b\u5934\u7684\u6570\u91cf</li>\n<li><span translate=no>_^_5_^_</span>\u662f\u8981\u4f7f\u7528\u7684\u56fe\u5c42\u96c6\u3002\u5982\u679c\u6ca1\u6709\uff0c\u5219\u5c06\u4f7f\u7528\u6240\u6709\u56fe\u5c42\u3002\u8fd9\u7528\u4e8e\u6d4b\u8bd5\u5c42\u6570\u8f83\u5c11\u7684\u6a21\u578b\u7684\u8f83\u5c0f\u7248\u672c</li>\n<li><span translate=no>_^_6_^_</span>\u6307\u5b9a\u662f\u5426\u514b\u9686\u53d8\u538b\u5668\u5c42\uff08\u5feb\u4e00\u70b9\uff09</li>\n<li><span translate=no>_^_7_^_</span>\u662f\u6a21\u578b\u7684\u6570\u636e\u7c7b\u578b</li>\n<li><span translate=no>_^_8_^_</span>\u662f\u6a21\u578b\u7684\u8bbe\u5907</li>\n<li><span translate=no>_^_9_^_</span>\u6307\u5b9a\u662f\u5426\u4f7f\u7528 int8 \u91cf\u5316</li>\n<li><span translate=no>_^_10_^_</span>\u662f<span translate=no>_^_11_^_</span>\u7528\u4e8e\u5206\u79bb\u5f02\u5e38\u7279\u5f81\u7684\u9608\u503c</li>\n<li><span translate=no>_^_12_^_</span>\u6307\u5b9a\u662f\u5426\u4f7f\u7528 <a href=\"https://github.com/HazyResearch/flash-attention\">FlashAttention</a></li></ul>\n",
10
 "<h3>Generator to get layers</h3>\n": "<h3>\u83b7\u53d6\u56fe\u5c42\u7684\u751f\u6210\u5668</h3>\n",
11
 "<h3>Generator to load layers</h3>\n": "<h3>\u7528\u4e8e\u52a0\u8f7d\u5c42\u7684\u751f\u6210\u5668</h3>\n",
12
 "<h3>Returns the total number of layers</h3>\n": "<h3>\u8fd4\u56de\u603b\u5c42\u6570</h3>\n",
13
 "<h3>Rotate the features</h3>\n<p><span translate=no>_^_0_^_</span></p>\n": "<h3>\u65cb\u8f6c\u8981\u7d20</h3>\n<p><span translate=no>_^_0_^_</span></p>\n",
14
 "<h4>Calculate the causal mask</h4>\n<ul><li><span translate=no>_^_0_^_</span> has shape <a href=\"batch_size, query_seq_len, key_seq_len, n_heads\">batch_size, query_seq_len, key_seq_len, n_heads</a></li></ul>\n": "<h4>\u8ba1\u7b97\u56e0\u679c\u63a9\u7801</h4>\n<ul><li><span translate=no>_^_0_^_</span>\u6709\u5f62\u72b6 <a href=\"batch_size, query_seq_len, key_seq_len, n_heads\">batch_size\u3001query_seq_len\u3001key_seq_len\u3001n_Heads</a></li></ul>\n",
15
 "<h4>Creates and caches a layer</h4>\n<p>Copying cached layers is faster than initializing new layers because it takes time to initialize parameters.</p>\n<ul><li><span translate=no>_^_0_^_</span>  is the name of the layer </li>\n<li><span translate=no>_^_1_^_</span>  is the function to create the layer </li>\n<p><em>Returns</em>  the created layer or a copy of the cached layer</p></ul>\n": "<h4>\u521b\u5efa\u548c\u7f13\u5b58\u56fe\u5c42</h4>\n<p>\u590d\u5236\u7f13\u5b58\u56fe\u5c42\u6bd4\u521d\u59cb\u5316\u65b0\u56fe\u5c42\u8981\u5feb\uff0c\u56e0\u4e3a\u521d\u59cb\u5316\u53c2\u6570\u9700\u8981\u65f6\u95f4\u3002</p>\n<ul><li><span translate=no>_^_0_^_</span>\u662f\u5c42\u7684\u540d\u79f0</li>\n<li><span translate=no>_^_1_^_</span>\u662f\u521b\u5efa\u56fe\u5c42\u7684\u51fd\u6570</li>\n<p><em>\u8fd4\u56de</em>\u521b\u5efa\u7684\u56fe\u5c42\u6216\u7f13\u5b58\u56fe\u5c42\u7684\u526f\u672c</p></ul>\n",
16
 "<h4>Prepares the layer for usage</h4>\n<p>We move the layer to the device and convert it to the correct data type</p>\n<ul><li><span translate=no>_^_0_^_</span>  is the layer to prepare </li>\n<p><em>Returns</em>  the prepared layer</p></ul>\n": "<h4>\u51c6\u5907\u56fe\u5c42\u4ee5\u4f9b\u4f7f\u7528</h4>\n<p>\u6211\u4eec\u5c06\u56fe\u5c42\u79fb\u52a8\u5230\u8bbe\u5907\u5e76\u5c06\u5176\u8f6c\u6362\u4e3a\u6b63\u786e\u7684\u6570\u636e\u7c7b\u578b</p>\n<ul><li><span translate=no>_^_0_^_</span>\u662f\u8981\u51c6\u5907\u7684\u56fe\u5c42</li>\n<p><em>\u8fd4\u56de</em>\u51c6\u5907\u597d\u7684\u56fe\u5c42</p></ul>\n",
17
 "<p> </p>\n": "<p></p>\n",
18
 "<p> <a id=\"post_load_prepare\"></a></p>\n<h3>Layer transformations after loading the checkpoint</h3>\n<p>This function implements layer transformations after loading the checkpoint.</p>\n<p>Currently, it only applies the int8 quantization.</p>\n<ul><li><span translate=no>_^_0_^_</span>  is the layer to prepare </li>\n<li><span translate=no>_^_1_^_</span>  specifies whether to use int8 quantization </li>\n<li><span translate=no>_^_2_^_</span>  is the device of the model </li>\n<li><span translate=no>_^_3_^_</span>  is the threshold <span translate=no>_^_4_^_</span> used to separate outlier features </li>\n<p><em>Returns</em>  the prepared layer</p></ul>\n": "<p><a id=\"post_load_prepare\"></a></p>\n<h3>\u52a0\u8f7d\u68c0\u67e5\u70b9\u540e\u7684\u56fe\u5c42\u53d8\u6362</h3>\n<p>\u6b64\u51fd\u6570\u5728\u52a0\u8f7d\u68c0\u67e5\u70b9\u540e\u5b9e\u73b0\u5c42\u8f6c\u6362\u3002</p>\n<p>\u76ee\u524d\uff0c\u5b83\u4ec5\u5e94\u7528 int8 \u91cf\u5316\u3002</p>\n<ul><li><span translate=no>_^_0_^_</span>\u662f\u8981\u51c6\u5907\u7684\u56fe\u5c42</li>\n<li><span translate=no>_^_1_^_</span>\u6307\u5b9a\u662f\u5426\u4f7f\u7528 int8 \u91cf\u5316</li>\n<li><span translate=no>_^_2_^_</span>\u662f\u8be5\u578b\u53f7\u7684\u8bbe\u5907</li>\n<li><span translate=no>_^_3_^_</span>\u662f<span translate=no>_^_4_^_</span>\u7528\u4e8e\u5206\u9694\u5f02\u5e38\u503c\u8981\u7d20\u7684\u9608\u503c</li>\n<p><em>\u8fd4\u56de</em>\u51c6\u5907\u597d\u7684\u56fe\u5c42</p></ul>\n",
19
 "<p> <span translate=no>_^_0_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span></p>\n",
20
 "<p> Code to load the checkpoint</p>\n": "<p>\u52a0\u8f7d\u68c0\u67e5\u70b9\u7684\u4ee3\u7801</p>\n",
21
 "<p> Readout layer</p>\n": "<p>\u8bfb\u51fa\u5c42</p>\n",
22
 "<p><a href=\"https://github.com/HazyResearch/flash-attention\">FlashAttention</a> </p>\n": "<p><a href=\"https://github.com/HazyResearch/flash-attention\">\u95ea\u5149\u6ce8\u610f</a></p>\n",
23
 "<p><span translate=no>_^_0_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span></p>\n",
24
 "<p>Add RoPE embeddings </p>\n": "<p>\u6dfb\u52a0\u7ef3\u7d22\u5d4c\u5165</p>\n",
25
 "<p>Add head dimension </p>\n": "<p>\u6dfb\u52a0\u5934\u90e8\u5c3a\u5bf8</p>\n",
26
 "<p>Add them and the residual connection </p>\n": "<p>\u6dfb\u52a0\u5b83\u4eec\u548c\u5269\u4f59\u7684\u8fde\u63a5</p>\n",
27
 "<p>Apply mask </p>\n": "<p>\u6d82\u62b9\u9762\u819c</p>\n",
28
 "<p>Attention layer </p>\n": "<p>\u6ce8\u610f\u5c42</p>\n",
29
 "<p>Attention output transform </p>\n": "<p>\u6ce8\u610f\u529b\u8f93\u51fa\u53d8\u6362</p>\n",
30
 "<p>Attention query, key and value transform </p>\n": "<p>\u6ce8\u610f\u529b\u67e5\u8be2\u3001\u5173\u952e\u548c\u4ef7\u503c\u8f6c\u6362</p>\n",
31
 "<p>Attention scaling factor </p>\n": "<p>\u6ce8\u610f\u529b\u7f29\u653e\u7cfb\u6570</p>\n",
32
 "<p>Attention softmax </p>\n": "<p>\u6ce8\u610f softmax</p>\n",
33
 "<p>Attention softmax module </p>\n": "<p>\u6ce8\u610f softmax \u6a21\u5757</p>\n",
34
 "<p>Base for <span translate=no>_^_0_^_</span> </p>\n": "<p>\u57fa\u5730<span translate=no>_^_0_^_</span></p>\n",
35
 "<p>Cache <span translate=no>_^_0_^_</span> and <span translate=no>_^_1_^_</span> </p>\n": "<p>\u7f13\u5b58<span translate=no>_^_0_^_</span>\u548c<span translate=no>_^_1_^_</span></p>\n",
36
 "<p>Cache them </p>\n": "<p>\u7f13\u5b58\u5b83\u4eec</p>\n",
37
 "<p>Calculate <span translate=no>_^_0_^_</span> and <span translate=no>_^_1_^_</span> in fp32 </p>\n": "<p><span translate=no>_^_1_^_</span>\u5728 fp32 \u4e2d\u8ba1\u7b97<span translate=no>_^_0_^_</span>\u548c</p>\n",
38
 "<p>Concatenate so that for row <span translate=no>_^_0_^_</span> we have</p>\n<p><span translate=no>_^_1_^_</span> </p>\n": "<p>\u8fde\u63a5\u8fd9\u6837<span translate=no>_^_0_^_</span>\u6211\u4eec\u5c31\u6709 row</p>\n<p><span translate=no>_^_1_^_</span></p>\n",
39
 "<p>Concatenate the past </p>\n": "<p>\u4e32\u8054\u8fc7\u53bb</p>\n",
40
 "<p>Concatenate with features that didn&#x27;t get RoPE embeddings </p>\n": "<p>\u8fde\u63a5\u672a\u83b7\u5f97 RoPe \u5d4c\u5165\u7684\u529f\u80fd</p>\n",
41
 "<p>Contraction linear layer </p>\n": "<p>\u6536\u7f29\u7ebf\u6027\u5c42</p>\n",
42
 "<p>Convert the linear layers </p>\n": "<p>\u8f6c\u6362\u7ebf\u6027\u56fe\u5c42</p>\n",
43
 "<p>Convert to fp32 if the current dtype is fp16 </p>\n": "<p>\u5982\u679c\u5f53\u524d\u6570\u636e\u7c7b\u578b\u4e3a fp16\uff0c\u5219\u8f6c\u6362\u4e3a fp32</p>\n",
44
 "<p>Create mask </p>\n": "<p>\u521b\u5efa\u906e\u7f69</p>\n",
45
 "<p>Disable auto-casting to fp16 for attention computation </p>\n": "<p>\u7981\u7528\u81ea\u52a8\u6295\u5c04\u5230 fp16 \u4ee5\u8fdb\u884c\u6ce8\u610f\u529b\u8ba1\u7b97</p>\n",
46
 "<p>Do not cast for bfloat </p>\n": "<p>\u4e0d\u8981\u4e3a bfloat \u8fdb\u884c\u6295\u5c04</p>\n",
47
 "<p>Embedding layer </p>\n": "<p>\u5d4c\u5165\u5c42</p>\n",
48
 "<p>Expansion linear layer </p>\n": "<p>\u6269\u5c55\u7ebf\u6027\u5c42</p>\n",
49
 "<p>FFN first transform </p>\n": "<p>FFN \u9996\u6b21\u6539\u9020</p>\n",
50
 "<p>FFN layer </p>\n": "<p>FFN \u5c42</p>\n",
51
 "<p>FFN second transform </p>\n": "<p>FFN \u7b2c\u4e8c\u6b21\u53d8\u6362</p>\n",
52
 "<p>Final linear layer </p>\n": "<p>\u6700\u540e\u7684\u7ebf\u6027\u5c42</p>\n",
53
 "<p>Final normalization layer </p>\n": "<p>\u6700\u7ec8\u5f52\u4e00\u5316\u5c42</p>\n",
54
 "<p>GELU activation </p>\n": "<p>GELU \u6fc0\u6d3b</p>\n",
55
 "<p>Get attention weighted values </p>\n": "<p>\u83b7\u53d6\u6ce8\u610f\u529b\u52a0\u6743\u503c</p>\n",
56
 "<p>Get causal mask </p>\n": "<p>\u83b7\u5f97\u56e0\u679c\u53e3\u7f69</p>\n",
57
 "<p>Get default values if not specified </p>\n": "<p>\u5982\u679c\u672a\u6307\u5b9a\uff0c\u5219\u83b7\u53d6\u9ed8\u8ba4\u503c</p>\n",
58
 "<p>Get position indexes <span translate=no>_^_0_^_</span> </p>\n": "<p>\u83b7\u53d6\u5934\u5bf8\u6307\u6570<span translate=no>_^_0_^_</span></p>\n",
59
 "<p>Get query, key and value embeddings (all concatenated). The last dimension size will change from n_hidden -&gt; <span translate=no>_^_0_^_</span> </p>\n": "<p>\u83b7\u53d6\u67e5\u8be2\u3001\u952e\u548c\u503c\u5d4c\u5165\uff08\u5168\u90e8\u4e32\u8054\uff09\u3002\u6700\u540e\u4e00\u4e2a\u7ef4\u5ea6\u5927\u5c0f\u5c06\u4ece n_hidden \u66f4\u6539\u4e3a-><span translate=no>_^_0_^_</span></p>\n",
60
 "<p>Get the actual sequence length </p>\n": "<p>\u83b7\u53d6\u5b9e\u9645\u5e8f\u5217\u957f\u5ea6</p>\n",
61
 "<p>Get the past keys and values. These will have shape <span translate=no>_^_0_^_</span> </p>\n": "<p>\u83b7\u53d6\u8fc7\u53bb\u7684\u952e\u548c\u503c\u3002\u8fd9\u4e9b\u4f1a\u6709\u5f62\u72b6<span translate=no>_^_0_^_</span></p>\n",
62
 "<p>Get the sin and cos values from the cache </p>\n": "<p>\u4ece\u7f13\u5b58\u4e2d\u83b7\u53d6 sin \u548c cos \u503c</p>\n",
63
 "<p>Get the state id&#x27;s. We use to retrieve previous states and store the next states </p>\n": "<p>\u83b7\u53d6\u72b6\u6001 ID\u3002\u6211\u4eec\u7528\u5b83\u6765\u68c0\u7d22\u4ee5\u524d\u7684\u72b6\u6001\u5e76\u5b58\u50a8\u4e0b\u4e00\u4e2a\u72b6\u6001</p>\n",
64
 "<p>If there&#x27;s cache </p>\n": "<p>\u5982\u679c\u6709\u7f13\u5b58</p>\n",
65
 "<p>If we are caching the states of previous tokens </p>\n": "<p>\u5982\u679c\u6211\u4eec\u6b63\u5728\u7f13\u5b58\u4e4b\u524d\u4ee4\u724c\u7684\u72b6\u6001</p>\n",
66
 "<p>Initialize <span translate=no>_^_0_^_</span> </p>\n": "<p>\u521d\u59cb\u5316<span translate=no>_^_0_^_</span></p>\n",
67
 "<p>Initialize <span translate=no>_^_0_^_</span> and <span translate=no>_^_1_^_</span> cache </p>\n": "<p>\u521d\u59cb\u5316<span translate=no>_^_0_^_</span>\u5e76<span translate=no>_^_1_^_</span>\u7f13\u5b58</p>\n",
68
 "<p>Layer norm before FFN </p>\n": "<p>FFN \u4e4b\u524d\u7684\u5206\u5c42\u89c4\u8303</p>\n",
69
 "<p>Layer norm before attention </p>\n": "<p>\u6ce8\u610f\u4e4b\u524d\u5148\u8fdb\u884c\u5206\u5c42\u89c4\u8303</p>\n",
70
 "<p>Layer normalization before FFN </p>\n": "<p>FFN \u4e4b\u524d\u7684\u5c42\u6807\u51c6\u5316</p>\n",
71
 "<p>Layer normalization before attention </p>\n": "<p>\u6ce8\u610f\u4e4b\u524d\u7684\u56fe\u5c42\u89c4\u8303\u5316</p>\n",
72
 "<p>Linear layer for query, key and value </p>\n": "<p>\u7528\u4e8e\u67e5\u8be2\u3001\u952e\u548c\u503c\u7684\u7ebf\u6027\u56fe\u5c42</p>\n",
73
 "<p>NeoX runs attention and feedforward network in parallel </p>\n": "<p>NeoX \u5e76\u884c\u8fd0\u884c\u6ce8\u610f\u529b\u548c\u524d\u9988\u7f51\u7edc</p>\n",
74
 "<p>No cache - simply add RoPE embeddings </p>\n": "<p>\u6ca1\u6709\u7f13\u5b58-\u53ea\u9700\u6dfb\u52a0 RoPe \u5d4c\u5165\u5373\u53ef</p>\n",
75
 "<p>Number of features for RoPE </p>\n": "<p>ROPE \u7684\u8981\u7d20\u6570\u91cf</p>\n",
76
 "<p>Number of features per head </p>\n": "<p>\u6bcf\u5934\u7279\u5f81\u6570</p>\n",
77
 "<p>Offset of the current embeddings </p>\n": "<p>\u5f53\u524d\u5d4c\u5165\u7684\u504f\u79fb\u91cf</p>\n",
78
 "<p>Only convert the linear layers in the transformer layers </p>\n": "<p>\u4ec5\u8f6c\u6362\u53d8\u538b\u5668\u5c42\u4e2d\u7684\u7ebf\u6027\u5c42</p>\n",
79
 "<p>Otherwise, use normal attention </p>\n": "<p>\u5426\u5219\uff0c\u8bf7\u6b63\u5e38\u6ce8\u610f</p>\n",
80
 "<p>Query and key lengths </p>\n": "<p>\u67e5\u8be2\u548c\u5bc6\u94a5\u957f\u5ea6</p>\n",
81
 "<p>Readout layer </p>\n": "<p>\u8bfb\u51fa\u5c42</p>\n",
82
 "<p>Reshape from <span translate=no>_^_0_^_</span><a href=\"batch_size, seq_len, n_hidden\">batch_size, seq_len, n_hidden</a>` </p>\n": "<p>\u4ece<span translate=no>_^_0_^_</span> <a href=\"batch_size, seq_len, n_hidden\">batch_size\u3001seq_len\u3001n_hidden \u8fdb\u884c\u91cd\u5851</a> `</p>\n",
83
 "<p>Residual connection </p>\n": "<p>\u5269\u4f59\u8fde\u63a5</p>\n",
84
 "<p>Return from cache </p>\n": "<p>\u4ece\u7f13\u5b58\u4e2d\u8fd4\u56de</p>\n",
85
 "<p>RoPE embedding module </p>\n": "<p>\u7ef3\u7d22\u5d4c\u5165\u6a21\u5757</p>\n",
86
 "<p>RoPE embeddings</p>\n<span translate=no>_^_0_^_</span><p>for <span translate=no>_^_1_^_</span> </p>\n": "<p>\u7ef3\u7d22\u5d4c\u5165</p>\n<span translate=no>_^_0_^_</span><p>\u5bf9\u4e8e<span translate=no>_^_1_^_</span></p>\n",
87
 "<p>Save the current state </p>\n": "<p>\u4fdd\u5b58\u5f53\u524d\u72b6\u6001</p>\n",
88
 "<p>Scale attention </p>\n": "<p>\u7f29\u653e\u6ce8\u610f\u529b</p>\n",
89
 "<p>Skip if not using int8 quantization </p>\n": "<p>\u5982\u679c\u4e0d\u4f7f\u7528 int8 \u91cf\u5316\u5219\u8df3\u8fc7</p>\n",
90
 "<p>Split into heads by changing the shape to <span translate=no>_^_0_^_</span> </p>\n": "<p>\u901a\u8fc7\u5c06\u5f62\u72b6\u6539\u4e3a\u5206\u6210\u5934\u90e8<span translate=no>_^_0_^_</span></p>\n",
91
 "<p>Split into query, key and value each of shape <span translate=no>_^_0_^_</span> </p>\n": "<p>\u5206\u4e3a\u67e5\u8be2\u3001\u952e\u548c\u503c\u5404\u5f62\u72b6<span translate=no>_^_0_^_</span></p>\n",
92
 "<p>Split the features. We apply RoPE to only <span translate=no>_^_0_^_</span> features </p>\n": "<p>\u62c6\u5206\u8981\u7d20\u3002\u6211\u4eec\u4ec5\u5c06 RoPe \u5e94\u7528\u4e8e\u8981<span translate=no>_^_0_^_</span>\u7d20</p>\n",
93
 "<p>Stack them into shape <span translate=no>_^_0_^_</span> </p>\n": "<p>\u5c06\u5b83\u4eec\u5806\u53e0\u6210\u5f62\u72b6<span translate=no>_^_0_^_</span></p>\n",
94
 "<p>The output is of shape <span translate=no>_^_0_^_</span> </p>\n": "<p>\u8f93\u51fa\u7684\u5f62\u72b6\u662f\u8fd9\u6837\u7684<span translate=no>_^_0_^_</span></p>\n",
95
 "<p>To cache causal mask </p>\n": "<p>\u7f13\u5b58\u56e0\u679c\u63a9\u7801</p>\n",
96
 "<p>To store <span translate=no>_^_0_^_</span> for the features </p>\n": "<p>\u4e3a\u8981\u7d20\u5b58\u50a8<span translate=no>_^_0_^_</span></p>\n",
97
 "<p>Transformer layer </p>\n": "<p>\u53d8\u538b\u5668\u5c42</p>\n",
98
 "<p>Transformer layers </p>\n": "<p>\u53d8\u538b\u5668\u5c42</p>\n",
99
 "<p>Use <span translate=no>_^_0_^_</span> defined in <a href=\"./utils/llm_int8.html\">utilities</a>. </p>\n": "<p>\u4f7f\u7528\u5728<a href=\"./utils/llm_int8.html\">\u5b9e\u7528\u7a0b\u5e8f</a>\u4e2d<span translate=no>_^_0_^_</span>\u5b9a\u4e49\u3002</p>\n",
100
 "<p>Use flash attention </p>\n": "<p>\u4f7f\u7528\u95ea\u5149\u706f\u6ce8\u610f\u529b</p>\n",
101
 "<ul><li><span translate=no>_^_0_^_</span>  are the embeddings of shape <span translate=no>_^_1_^_</span></li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u662f\u5f62\u72b6\u7684\u5d4c\u5165<span translate=no>_^_1_^_</span></li></ul>\n",
102
 "<ul><li><span translate=no>_^_0_^_</span>  are the token ids of shape <span translate=no>_^_1_^_</span></li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u662f\u5f62\u72b6\u7684\u4ee4\u724c ID<span translate=no>_^_1_^_</span></li></ul>\n",
103
 "<ul><li><span translate=no>_^_0_^_</span>  has shape <span translate=no>_^_1_^_</span> </li>\n<li><span translate=no>_^_2_^_</span>  is the starting position of <span translate=no>_^_3_^_</span>. This is <span translate=no>_^_4_^_</span> when we have cached the keys and queries of previous positions</li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u6709\u5f62\u72b6<span translate=no>_^_1_^_</span></li>\n<li><span translate=no>_^_2_^_</span>\u662f\u7684\u8d77\u59cb\u4f4d\u7f6e<span translate=no>_^_3_^_</span>\u3002\u8fd9\u662f\u6211\u4eec\u7f13\u5b58\u5148\u524d\u4f4d\u7f6e\u7684\u952e\u548c\u67e5\u8be2<span translate=no>_^_4_^_</span>\u7684\u65f6\u5019</li></ul>\n",
104
 "<ul><li><span translate=no>_^_0_^_</span>  has shape <span translate=no>_^_1_^_</span></li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u6709\u5f62\u72b6<span translate=no>_^_1_^_</span></li></ul>\n",
105
 "<ul><li><span translate=no>_^_0_^_</span>  is the embedding size </li>\n<li><span translate=no>_^_1_^_</span>  is the number of heads </li>\n<li><span translate=no>_^_2_^_</span>  specifies whether to use  <a href=\"https://github.com/HazyResearch/flash-attention\">FlashAttention</a></li></ul>\n<p><em>Out implementation doesn&#x27;t include dropout</em>.</p>\n": "<ul><li><span translate=no>_^_0_^_</span>\u662f\u5d4c\u5165\u5927\u5c0f</li>\n<li><span translate=no>_^_1_^_</span>\u662f\u5934\u6570</li>\n<li><span translate=no>_^_2_^_</span>\u6307\u5b9a\u662f\u5426\u4f7f\u7528 <a href=\"https://github.com/HazyResearch/flash-attention\">FlashAttention</a></li></ul>\n<p><em>Out \u7684\u5b9e\u73b0\u4e0d\u5305\u62ec\u9000\u51fa</em>\u3002</p>\n",
106
 "<ul><li><span translate=no>_^_0_^_</span>  is the embedding size </li>\n<li><span translate=no>_^_1_^_</span>  is the size of the vocabulary</li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u662f\u5d4c\u5165\u7684\u5927\u5c0f</li>\n<li><span translate=no>_^_1_^_</span>\u662f\u8bcd\u6c47\u91cf\u7684\u5927\u5c0f</li></ul>\n",
107
 "<ul><li><span translate=no>_^_0_^_</span>  is the embedding size</li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u662f\u5d4c\u5165\u7684\u5927\u5c0f</li></ul>\n",
108
 "<ul><li><span translate=no>_^_0_^_</span>  is the number of features for RoPE embeddings </li>\n<li><span translate=no>_^_1_^_</span>  is the base for <span translate=no>_^_2_^_</span>, which defaults to <span translate=no>_^_3_^_</span></li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u662f RoPe \u5d4c\u5165\u7684\u8981\u7d20\u6570\u91cf</li>\n<li><span translate=no>_^_1_^_</span>\u662f\u7684\u57fa\u7840<span translate=no>_^_2_^_</span>\uff0c\u9ed8\u8ba4\u4e3a<span translate=no>_^_3_^_</span></li></ul>\n",
109
 "<ul><li><span translate=no>_^_0_^_</span>  is the size of the vocabulary </li>\n<li><span translate=no>_^_1_^_</span>  is the size of the embeddings</li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u662f\u8bcd\u6c47\u91cf\u7684\u5927\u5c0f</li>\n<li><span translate=no>_^_1_^_</span>\u662f\u5d4c\u5165\u7684\u5927\u5c0f</li></ul>\n",
110
 "<ul><li><span translate=no>_^_0_^_</span>  the number of features in embeddings </li>\n<li><span translate=no>_^_1_^_</span>  the number of attention heads </li>\n<li><span translate=no>_^_2_^_</span>  percentage of features to add RoPE embeddings </li>\n<li><span translate=no>_^_3_^_</span>  masking fill value for attention matrix </li>\n<li><span translate=no>_^_4_^_</span>  specifies whether to use  <a href=\"https://github.com/HazyResearch/flash-attention\">FlashAttention</a></li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u5d4c\u5165\u4e2d\u7684\u7279\u5f81\u6570\u91cf</li>\n<li><span translate=no>_^_1_^_</span>\u6ce8\u610f\u529b\u5934\u7684\u6570\u91cf</li>\n<li><span translate=no>_^_2_^_</span>\u6dfb\u52a0 RoPE \u5d4c\u5165\u7684\u529f\u80fd\u767e\u5206\u6bd4</li>\n<li><span translate=no>_^_3_^_</span>\u63a9\u76d6\u6ce8\u610f\u529b\u77e9\u9635\u7684\u586b\u5145\u503c</li>\n<li><span translate=no>_^_4_^_</span>\u6307\u5b9a\u662f\u5426\u4f7f\u7528 <a href=\"https://github.com/HazyResearch/flash-attention\">FlashAttention</a></li></ul>\n",
111
 "GPT-NeoX Model Definition": "GPT-NEOX \u578b\u53f7\u5b9a\u4e49",
112
 "This is the model definition of GPT-NeoX.": "\u8fd9\u662f GPT-NEOX \u7684\u6a21\u578b\u5b9a\u4e49\u3002"
113
}
114
Product

Resources

Company