Path: blob/master/translate_cache/neox/model.zh.json
4923 views
{1"<h1>GPT-NeoX Model</h1>\n<p>Here is the code for layers of GPT-NeoX model and the code to load 20B checkpoint.</p>\n<p>The method <span translate=no>_^_0_^_</span> in the layers load the checkpoints of that layer. The checkpoint loading helpers are on <a href=\"checkpoint.html\"><span translate=no>_^_1_^_</span></a></p>\n": "<h1>GPT-NEOX \u578b\u53f7</h1>\n<p>\u4ee5\u4e0b\u662f GPT-NEOX \u6a21\u578b\u5c42\u7684\u4ee3\u7801\u548c\u52a0\u8f7d 20B \u68c0\u67e5\u70b9\u7684\u4ee3\u7801\u3002</p>\n<p>\u56fe\u5c42<span translate=no>_^_0_^_</span>\u4e2d\u7684\u65b9\u6cd5\u52a0\u8f7d\u8be5\u5c42\u7684\u68c0\u67e5\u70b9\u3002\u68c0\u67e5\u70b9\u52a0\u8f7d\u52a9\u624b\u5df2\u542f\u7528 <a href=\"checkpoint.html\"><span translate=no>_^_1_^_</span></a></p>\n",2"<h2>Attention layer</h2>\n": "<h2>\u6ce8\u610f\u5c42</h2>\n",3"<h2>Embedding layer</h2>\n<p>This is a standard embeddings layer with code to load the checkpoint.</p>\n": "<h2>\u5d4c\u5165\u5c42</h2>\n<p>\u8fd9\u662f\u4e00\u4e2a\u6807\u51c6\u7684\u5d4c\u5165\u5c42\uff0c\u5176\u4e2d\u5305\u542b\u7528\u4e8e\u52a0\u8f7d\u68c0\u67e5\u70b9\u7684\u4ee3\u7801\u3002</p>\n",4"<h2>Feedforward Network</h2>\n": "<h2>\u524d\u9988\u7f51\u7edc</h2>\n",5"<h2>Final normalization layer</h2>\n": "<h2>\u6700\u7ec8\u5f52\u4e00\u5316\u5c42</h2>\n",6"<h2>Rotary Positional Embeddings</h2>\n<p>GPT-NeoX uses <a href=\"https://arxiv.org/abs/2104.09864\">rotary positional embeddings (RoPE)</a>.</p>\n<p>WE have annotated implementation of RoPE <a href=\"https://nn.labml.ai/transformers/rope/index.html\">here</a> with more notes the theory.</p>\n": "<h2>\u65cb\u8f6c\u4f4d\u7f6e\u5d4c\u5165</h2>\n<p>GPT-NEOX \u4f7f\u7528<a href=\"https://arxiv.org/abs/2104.09864\">\u65cb\u8f6c\u4f4d\u7f6e\u5d4c\u5165\uff08RoP\uff09</a>\u3002</p>\n<p>\u6211\u4eec<a href=\"https://nn.labml.ai/transformers/rope/index.html\">\u5728\u8fd9\u91cc</a>\u6ce8\u91ca\u4e86 RoPe \u7684\u5b9e\u73b0\uff0c\u5e76\u9644\u4e0a\u4e86\u66f4\u591a\u5173\u4e8e\u7406\u8bba\u7684\u6ce8\u91ca\u3002</p>\n",7"<h2>Transformer Layer</h2>\n": "<h2>\u53d8\u538b\u5668\u5c42</h2>\n",8"<h3>Generator to create layers</h3>\n<p>The layers are generated in the same order as checkpoints.</p>\n<p>It gives <span translate=no>_^_0_^_</span> when a layer is not available; we use the layer indices as NeoX and there are two transformation layers we don't need in our implementation.</p>\n<ul><li><span translate=no>_^_1_^_</span> is the number of tokens in the vocabulary </li>\n<li><span translate=no>_^_2_^_</span> is the number of features in the embeddings </li>\n<li><span translate=no>_^_3_^_</span> is the number of transformer layers </li>\n<li><span translate=no>_^_4_^_</span> is the number of attention heads </li>\n<li><span translate=no>_^_5_^_</span> are the set of layers to be used. All layers will be used if None. This is used to test smaller versions of the model with fewer layers </li>\n<li><span translate=no>_^_6_^_</span> specifies whether to clone the transformer layers (a bit faster) </li>\n<li><span translate=no>_^_7_^_</span> is the data type of the model </li>\n<li><span translate=no>_^_8_^_</span> is the device of the model </li>\n<li><span translate=no>_^_9_^_</span> specifies whether to use int8 quantization </li>\n<li><span translate=no>_^_10_^_</span> is the threshold <span translate=no>_^_11_^_</span> used to separate outlier features </li>\n<li><span translate=no>_^_12_^_</span> specifies whether to use <a href=\"https://github.com/HazyResearch/flash-attention\">FlashAttention</a></li></ul>\n": "<h3>\u7528\u4e8e\u521b\u5efa\u56fe\u5c42\u7684\u751f\u6210\u5668</h3>\n<p>\u56fe\u5c42\u7684\u751f\u6210\u987a\u5e8f\u4e0e\u68c0\u67e5\u70b9\u7684\u751f\u6210\u987a\u5e8f\u76f8\u540c\u3002</p>\n<p>\u5b83\u5728\u56fe\u5c42\u4e0d\u53ef\u7528<span translate=no>_^_0_^_</span>\u65f6\u7ed9\u51fa\uff1b\u6211\u4eec\u5c06\u56fe\u5c42\u7d22\u5f15\u7528\u4f5c NeoX\uff0c\u5e76\u4e14\u5728\u5b9e\u73b0\u4e2d\u4e0d\u9700\u8981\u4e24\u4e2a\u8f6c\u6362\u5c42\u3002</p>\n<ul><li><span translate=no>_^_1_^_</span>\u662f\u8bcd\u6c47\u8868\u4e2d\u7684\u4ee3\u5e01\u6570\u91cf</li>\n<li><span translate=no>_^_2_^_</span>\u662f\u5d4c\u5165\u4e2d\u7684\u7279\u5f81\u6570\u91cf</li>\n<li><span translate=no>_^_3_^_</span>\u662f\u53d8\u538b\u5668\u5c42\u6570</li>\n<li><span translate=no>_^_4_^_</span>\u662f\u6ce8\u610f\u529b\u5934\u7684\u6570\u91cf</li>\n<li><span translate=no>_^_5_^_</span>\u662f\u8981\u4f7f\u7528\u7684\u56fe\u5c42\u96c6\u3002\u5982\u679c\u6ca1\u6709\uff0c\u5219\u5c06\u4f7f\u7528\u6240\u6709\u56fe\u5c42\u3002\u8fd9\u7528\u4e8e\u6d4b\u8bd5\u5c42\u6570\u8f83\u5c11\u7684\u6a21\u578b\u7684\u8f83\u5c0f\u7248\u672c</li>\n<li><span translate=no>_^_6_^_</span>\u6307\u5b9a\u662f\u5426\u514b\u9686\u53d8\u538b\u5668\u5c42\uff08\u5feb\u4e00\u70b9\uff09</li>\n<li><span translate=no>_^_7_^_</span>\u662f\u6a21\u578b\u7684\u6570\u636e\u7c7b\u578b</li>\n<li><span translate=no>_^_8_^_</span>\u662f\u6a21\u578b\u7684\u8bbe\u5907</li>\n<li><span translate=no>_^_9_^_</span>\u6307\u5b9a\u662f\u5426\u4f7f\u7528 int8 \u91cf\u5316</li>\n<li><span translate=no>_^_10_^_</span>\u662f<span translate=no>_^_11_^_</span>\u7528\u4e8e\u5206\u79bb\u5f02\u5e38\u7279\u5f81\u7684\u9608\u503c</li>\n<li><span translate=no>_^_12_^_</span>\u6307\u5b9a\u662f\u5426\u4f7f\u7528 <a href=\"https://github.com/HazyResearch/flash-attention\">FlashAttention</a></li></ul>\n",9"<h3>Generator to get layers</h3>\n": "<h3>\u83b7\u53d6\u56fe\u5c42\u7684\u751f\u6210\u5668</h3>\n",10"<h3>Generator to load layers</h3>\n": "<h3>\u7528\u4e8e\u52a0\u8f7d\u5c42\u7684\u751f\u6210\u5668</h3>\n",11"<h3>Returns the total number of layers</h3>\n": "<h3>\u8fd4\u56de\u603b\u5c42\u6570</h3>\n",12"<h3>Rotate the features</h3>\n<p><span translate=no>_^_0_^_</span></p>\n": "<h3>\u65cb\u8f6c\u8981\u7d20</h3>\n<p><span translate=no>_^_0_^_</span></p>\n",13"<h4>Calculate the causal mask</h4>\n<ul><li><span translate=no>_^_0_^_</span> has shape <a href=\"batch_size, query_seq_len, key_seq_len, n_heads\">batch_size, query_seq_len, key_seq_len, n_heads</a></li></ul>\n": "<h4>\u8ba1\u7b97\u56e0\u679c\u63a9\u7801</h4>\n<ul><li><span translate=no>_^_0_^_</span>\u6709\u5f62\u72b6 <a href=\"batch_size, query_seq_len, key_seq_len, n_heads\">batch_size\u3001query_seq_len\u3001key_seq_len\u3001n_Heads</a></li></ul>\n",14"<h4>Creates and caches a layer</h4>\n<p>Copying cached layers is faster than initializing new layers because it takes time to initialize parameters.</p>\n<ul><li><span translate=no>_^_0_^_</span> is the name of the layer </li>\n<li><span translate=no>_^_1_^_</span> is the function to create the layer </li>\n<p><em>Returns</em> the created layer or a copy of the cached layer</p></ul>\n": "<h4>\u521b\u5efa\u548c\u7f13\u5b58\u56fe\u5c42</h4>\n<p>\u590d\u5236\u7f13\u5b58\u56fe\u5c42\u6bd4\u521d\u59cb\u5316\u65b0\u56fe\u5c42\u8981\u5feb\uff0c\u56e0\u4e3a\u521d\u59cb\u5316\u53c2\u6570\u9700\u8981\u65f6\u95f4\u3002</p>\n<ul><li><span translate=no>_^_0_^_</span>\u662f\u5c42\u7684\u540d\u79f0</li>\n<li><span translate=no>_^_1_^_</span>\u662f\u521b\u5efa\u56fe\u5c42\u7684\u51fd\u6570</li>\n<p><em>\u8fd4\u56de</em>\u521b\u5efa\u7684\u56fe\u5c42\u6216\u7f13\u5b58\u56fe\u5c42\u7684\u526f\u672c</p></ul>\n",15"<h4>Prepares the layer for usage</h4>\n<p>We move the layer to the device and convert it to the correct data type</p>\n<ul><li><span translate=no>_^_0_^_</span> is the layer to prepare </li>\n<p><em>Returns</em> the prepared layer</p></ul>\n": "<h4>\u51c6\u5907\u56fe\u5c42\u4ee5\u4f9b\u4f7f\u7528</h4>\n<p>\u6211\u4eec\u5c06\u56fe\u5c42\u79fb\u52a8\u5230\u8bbe\u5907\u5e76\u5c06\u5176\u8f6c\u6362\u4e3a\u6b63\u786e\u7684\u6570\u636e\u7c7b\u578b</p>\n<ul><li><span translate=no>_^_0_^_</span>\u662f\u8981\u51c6\u5907\u7684\u56fe\u5c42</li>\n<p><em>\u8fd4\u56de</em>\u51c6\u5907\u597d\u7684\u56fe\u5c42</p></ul>\n",16"<p> </p>\n": "<p></p>\n",17"<p> <a id=\"post_load_prepare\"></a></p>\n<h3>Layer transformations after loading the checkpoint</h3>\n<p>This function implements layer transformations after loading the checkpoint.</p>\n<p>Currently, it only applies the int8 quantization.</p>\n<ul><li><span translate=no>_^_0_^_</span> is the layer to prepare </li>\n<li><span translate=no>_^_1_^_</span> specifies whether to use int8 quantization </li>\n<li><span translate=no>_^_2_^_</span> is the device of the model </li>\n<li><span translate=no>_^_3_^_</span> is the threshold <span translate=no>_^_4_^_</span> used to separate outlier features </li>\n<p><em>Returns</em> the prepared layer</p></ul>\n": "<p><a id=\"post_load_prepare\"></a></p>\n<h3>\u52a0\u8f7d\u68c0\u67e5\u70b9\u540e\u7684\u56fe\u5c42\u53d8\u6362</h3>\n<p>\u6b64\u51fd\u6570\u5728\u52a0\u8f7d\u68c0\u67e5\u70b9\u540e\u5b9e\u73b0\u5c42\u8f6c\u6362\u3002</p>\n<p>\u76ee\u524d\uff0c\u5b83\u4ec5\u5e94\u7528 int8 \u91cf\u5316\u3002</p>\n<ul><li><span translate=no>_^_0_^_</span>\u662f\u8981\u51c6\u5907\u7684\u56fe\u5c42</li>\n<li><span translate=no>_^_1_^_</span>\u6307\u5b9a\u662f\u5426\u4f7f\u7528 int8 \u91cf\u5316</li>\n<li><span translate=no>_^_2_^_</span>\u662f\u8be5\u578b\u53f7\u7684\u8bbe\u5907</li>\n<li><span translate=no>_^_3_^_</span>\u662f<span translate=no>_^_4_^_</span>\u7528\u4e8e\u5206\u9694\u5f02\u5e38\u503c\u8981\u7d20\u7684\u9608\u503c</li>\n<p><em>\u8fd4\u56de</em>\u51c6\u5907\u597d\u7684\u56fe\u5c42</p></ul>\n",18"<p> <span translate=no>_^_0_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span></p>\n",19"<p> Code to load the checkpoint</p>\n": "<p>\u52a0\u8f7d\u68c0\u67e5\u70b9\u7684\u4ee3\u7801</p>\n",20"<p> Readout layer</p>\n": "<p>\u8bfb\u51fa\u5c42</p>\n",21"<p><a href=\"https://github.com/HazyResearch/flash-attention\">FlashAttention</a> </p>\n": "<p><a href=\"https://github.com/HazyResearch/flash-attention\">\u95ea\u5149\u6ce8\u610f</a></p>\n",22"<p><span translate=no>_^_0_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span></p>\n",23"<p>Add RoPE embeddings </p>\n": "<p>\u6dfb\u52a0\u7ef3\u7d22\u5d4c\u5165</p>\n",24"<p>Add head dimension </p>\n": "<p>\u6dfb\u52a0\u5934\u90e8\u5c3a\u5bf8</p>\n",25"<p>Add them and the residual connection </p>\n": "<p>\u6dfb\u52a0\u5b83\u4eec\u548c\u5269\u4f59\u7684\u8fde\u63a5</p>\n",26"<p>Apply mask </p>\n": "<p>\u6d82\u62b9\u9762\u819c</p>\n",27"<p>Attention layer </p>\n": "<p>\u6ce8\u610f\u5c42</p>\n",28"<p>Attention output transform </p>\n": "<p>\u6ce8\u610f\u529b\u8f93\u51fa\u53d8\u6362</p>\n",29"<p>Attention query, key and value transform </p>\n": "<p>\u6ce8\u610f\u529b\u67e5\u8be2\u3001\u5173\u952e\u548c\u4ef7\u503c\u8f6c\u6362</p>\n",30"<p>Attention scaling factor </p>\n": "<p>\u6ce8\u610f\u529b\u7f29\u653e\u7cfb\u6570</p>\n",31"<p>Attention softmax </p>\n": "<p>\u6ce8\u610f softmax</p>\n",32"<p>Attention softmax module </p>\n": "<p>\u6ce8\u610f softmax \u6a21\u5757</p>\n",33"<p>Base for <span translate=no>_^_0_^_</span> </p>\n": "<p>\u57fa\u5730<span translate=no>_^_0_^_</span></p>\n",34"<p>Cache <span translate=no>_^_0_^_</span> and <span translate=no>_^_1_^_</span> </p>\n": "<p>\u7f13\u5b58<span translate=no>_^_0_^_</span>\u548c<span translate=no>_^_1_^_</span></p>\n",35"<p>Cache them </p>\n": "<p>\u7f13\u5b58\u5b83\u4eec</p>\n",36"<p>Calculate <span translate=no>_^_0_^_</span> and <span translate=no>_^_1_^_</span> in fp32 </p>\n": "<p><span translate=no>_^_1_^_</span>\u5728 fp32 \u4e2d\u8ba1\u7b97<span translate=no>_^_0_^_</span>\u548c</p>\n",37"<p>Concatenate so that for row <span translate=no>_^_0_^_</span> we have</p>\n<p><span translate=no>_^_1_^_</span> </p>\n": "<p>\u8fde\u63a5\u8fd9\u6837<span translate=no>_^_0_^_</span>\u6211\u4eec\u5c31\u6709 row</p>\n<p><span translate=no>_^_1_^_</span></p>\n",38"<p>Concatenate the past </p>\n": "<p>\u4e32\u8054\u8fc7\u53bb</p>\n",39"<p>Concatenate with features that didn't get RoPE embeddings </p>\n": "<p>\u8fde\u63a5\u672a\u83b7\u5f97 RoPe \u5d4c\u5165\u7684\u529f\u80fd</p>\n",40"<p>Contraction linear layer </p>\n": "<p>\u6536\u7f29\u7ebf\u6027\u5c42</p>\n",41"<p>Convert the linear layers </p>\n": "<p>\u8f6c\u6362\u7ebf\u6027\u56fe\u5c42</p>\n",42"<p>Convert to fp32 if the current dtype is fp16 </p>\n": "<p>\u5982\u679c\u5f53\u524d\u6570\u636e\u7c7b\u578b\u4e3a fp16\uff0c\u5219\u8f6c\u6362\u4e3a fp32</p>\n",43"<p>Create mask </p>\n": "<p>\u521b\u5efa\u906e\u7f69</p>\n",44"<p>Disable auto-casting to fp16 for attention computation </p>\n": "<p>\u7981\u7528\u81ea\u52a8\u6295\u5c04\u5230 fp16 \u4ee5\u8fdb\u884c\u6ce8\u610f\u529b\u8ba1\u7b97</p>\n",45"<p>Do not cast for bfloat </p>\n": "<p>\u4e0d\u8981\u4e3a bfloat \u8fdb\u884c\u6295\u5c04</p>\n",46"<p>Embedding layer </p>\n": "<p>\u5d4c\u5165\u5c42</p>\n",47"<p>Expansion linear layer </p>\n": "<p>\u6269\u5c55\u7ebf\u6027\u5c42</p>\n",48"<p>FFN first transform </p>\n": "<p>FFN \u9996\u6b21\u6539\u9020</p>\n",49"<p>FFN layer </p>\n": "<p>FFN \u5c42</p>\n",50"<p>FFN second transform </p>\n": "<p>FFN \u7b2c\u4e8c\u6b21\u53d8\u6362</p>\n",51"<p>Final linear layer </p>\n": "<p>\u6700\u540e\u7684\u7ebf\u6027\u5c42</p>\n",52"<p>Final normalization layer </p>\n": "<p>\u6700\u7ec8\u5f52\u4e00\u5316\u5c42</p>\n",53"<p>GELU activation </p>\n": "<p>GELU \u6fc0\u6d3b</p>\n",54"<p>Get attention weighted values </p>\n": "<p>\u83b7\u53d6\u6ce8\u610f\u529b\u52a0\u6743\u503c</p>\n",55"<p>Get causal mask </p>\n": "<p>\u83b7\u5f97\u56e0\u679c\u53e3\u7f69</p>\n",56"<p>Get default values if not specified </p>\n": "<p>\u5982\u679c\u672a\u6307\u5b9a\uff0c\u5219\u83b7\u53d6\u9ed8\u8ba4\u503c</p>\n",57"<p>Get position indexes <span translate=no>_^_0_^_</span> </p>\n": "<p>\u83b7\u53d6\u5934\u5bf8\u6307\u6570<span translate=no>_^_0_^_</span></p>\n",58"<p>Get query, key and value embeddings (all concatenated). The last dimension size will change from n_hidden -> <span translate=no>_^_0_^_</span> </p>\n": "<p>\u83b7\u53d6\u67e5\u8be2\u3001\u952e\u548c\u503c\u5d4c\u5165\uff08\u5168\u90e8\u4e32\u8054\uff09\u3002\u6700\u540e\u4e00\u4e2a\u7ef4\u5ea6\u5927\u5c0f\u5c06\u4ece n_hidden \u66f4\u6539\u4e3a-><span translate=no>_^_0_^_</span></p>\n",59"<p>Get the actual sequence length </p>\n": "<p>\u83b7\u53d6\u5b9e\u9645\u5e8f\u5217\u957f\u5ea6</p>\n",60"<p>Get the past keys and values. These will have shape <span translate=no>_^_0_^_</span> </p>\n": "<p>\u83b7\u53d6\u8fc7\u53bb\u7684\u952e\u548c\u503c\u3002\u8fd9\u4e9b\u4f1a\u6709\u5f62\u72b6<span translate=no>_^_0_^_</span></p>\n",61"<p>Get the sin and cos values from the cache </p>\n": "<p>\u4ece\u7f13\u5b58\u4e2d\u83b7\u53d6 sin \u548c cos \u503c</p>\n",62"<p>Get the state id's. We use to retrieve previous states and store the next states </p>\n": "<p>\u83b7\u53d6\u72b6\u6001 ID\u3002\u6211\u4eec\u7528\u5b83\u6765\u68c0\u7d22\u4ee5\u524d\u7684\u72b6\u6001\u5e76\u5b58\u50a8\u4e0b\u4e00\u4e2a\u72b6\u6001</p>\n",63"<p>If there's cache </p>\n": "<p>\u5982\u679c\u6709\u7f13\u5b58</p>\n",64"<p>If we are caching the states of previous tokens </p>\n": "<p>\u5982\u679c\u6211\u4eec\u6b63\u5728\u7f13\u5b58\u4e4b\u524d\u4ee4\u724c\u7684\u72b6\u6001</p>\n",65"<p>Initialize <span translate=no>_^_0_^_</span> </p>\n": "<p>\u521d\u59cb\u5316<span translate=no>_^_0_^_</span></p>\n",66"<p>Initialize <span translate=no>_^_0_^_</span> and <span translate=no>_^_1_^_</span> cache </p>\n": "<p>\u521d\u59cb\u5316<span translate=no>_^_0_^_</span>\u5e76<span translate=no>_^_1_^_</span>\u7f13\u5b58</p>\n",67"<p>Layer norm before FFN </p>\n": "<p>FFN \u4e4b\u524d\u7684\u5206\u5c42\u89c4\u8303</p>\n",68"<p>Layer norm before attention </p>\n": "<p>\u6ce8\u610f\u4e4b\u524d\u5148\u8fdb\u884c\u5206\u5c42\u89c4\u8303</p>\n",69"<p>Layer normalization before FFN </p>\n": "<p>FFN \u4e4b\u524d\u7684\u5c42\u6807\u51c6\u5316</p>\n",70"<p>Layer normalization before attention </p>\n": "<p>\u6ce8\u610f\u4e4b\u524d\u7684\u56fe\u5c42\u89c4\u8303\u5316</p>\n",71"<p>Linear layer for query, key and value </p>\n": "<p>\u7528\u4e8e\u67e5\u8be2\u3001\u952e\u548c\u503c\u7684\u7ebf\u6027\u56fe\u5c42</p>\n",72"<p>NeoX runs attention and feedforward network in parallel </p>\n": "<p>NeoX \u5e76\u884c\u8fd0\u884c\u6ce8\u610f\u529b\u548c\u524d\u9988\u7f51\u7edc</p>\n",73"<p>No cache - simply add RoPE embeddings </p>\n": "<p>\u6ca1\u6709\u7f13\u5b58-\u53ea\u9700\u6dfb\u52a0 RoPe \u5d4c\u5165\u5373\u53ef</p>\n",74"<p>Number of features for RoPE </p>\n": "<p>ROPE \u7684\u8981\u7d20\u6570\u91cf</p>\n",75"<p>Number of features per head </p>\n": "<p>\u6bcf\u5934\u7279\u5f81\u6570</p>\n",76"<p>Offset of the current embeddings </p>\n": "<p>\u5f53\u524d\u5d4c\u5165\u7684\u504f\u79fb\u91cf</p>\n",77"<p>Only convert the linear layers in the transformer layers </p>\n": "<p>\u4ec5\u8f6c\u6362\u53d8\u538b\u5668\u5c42\u4e2d\u7684\u7ebf\u6027\u5c42</p>\n",78"<p>Otherwise, use normal attention </p>\n": "<p>\u5426\u5219\uff0c\u8bf7\u6b63\u5e38\u6ce8\u610f</p>\n",79"<p>Query and key lengths </p>\n": "<p>\u67e5\u8be2\u548c\u5bc6\u94a5\u957f\u5ea6</p>\n",80"<p>Readout layer </p>\n": "<p>\u8bfb\u51fa\u5c42</p>\n",81"<p>Reshape from <span translate=no>_^_0_^_</span><a href=\"batch_size, seq_len, n_hidden\">batch_size, seq_len, n_hidden</a>` </p>\n": "<p>\u4ece<span translate=no>_^_0_^_</span> <a href=\"batch_size, seq_len, n_hidden\">batch_size\u3001seq_len\u3001n_hidden \u8fdb\u884c\u91cd\u5851</a> `</p>\n",82"<p>Residual connection </p>\n": "<p>\u5269\u4f59\u8fde\u63a5</p>\n",83"<p>Return from cache </p>\n": "<p>\u4ece\u7f13\u5b58\u4e2d\u8fd4\u56de</p>\n",84"<p>RoPE embedding module </p>\n": "<p>\u7ef3\u7d22\u5d4c\u5165\u6a21\u5757</p>\n",85"<p>RoPE embeddings</p>\n<span translate=no>_^_0_^_</span><p>for <span translate=no>_^_1_^_</span> </p>\n": "<p>\u7ef3\u7d22\u5d4c\u5165</p>\n<span translate=no>_^_0_^_</span><p>\u5bf9\u4e8e<span translate=no>_^_1_^_</span></p>\n",86"<p>Save the current state </p>\n": "<p>\u4fdd\u5b58\u5f53\u524d\u72b6\u6001</p>\n",87"<p>Scale attention </p>\n": "<p>\u7f29\u653e\u6ce8\u610f\u529b</p>\n",88"<p>Skip if not using int8 quantization </p>\n": "<p>\u5982\u679c\u4e0d\u4f7f\u7528 int8 \u91cf\u5316\u5219\u8df3\u8fc7</p>\n",89"<p>Split into heads by changing the shape to <span translate=no>_^_0_^_</span> </p>\n": "<p>\u901a\u8fc7\u5c06\u5f62\u72b6\u6539\u4e3a\u5206\u6210\u5934\u90e8<span translate=no>_^_0_^_</span></p>\n",90"<p>Split into query, key and value each of shape <span translate=no>_^_0_^_</span> </p>\n": "<p>\u5206\u4e3a\u67e5\u8be2\u3001\u952e\u548c\u503c\u5404\u5f62\u72b6<span translate=no>_^_0_^_</span></p>\n",91"<p>Split the features. We apply RoPE to only <span translate=no>_^_0_^_</span> features </p>\n": "<p>\u62c6\u5206\u8981\u7d20\u3002\u6211\u4eec\u4ec5\u5c06 RoPe \u5e94\u7528\u4e8e\u8981<span translate=no>_^_0_^_</span>\u7d20</p>\n",92"<p>Stack them into shape <span translate=no>_^_0_^_</span> </p>\n": "<p>\u5c06\u5b83\u4eec\u5806\u53e0\u6210\u5f62\u72b6<span translate=no>_^_0_^_</span></p>\n",93"<p>The output is of shape <span translate=no>_^_0_^_</span> </p>\n": "<p>\u8f93\u51fa\u7684\u5f62\u72b6\u662f\u8fd9\u6837\u7684<span translate=no>_^_0_^_</span></p>\n",94"<p>To cache causal mask </p>\n": "<p>\u7f13\u5b58\u56e0\u679c\u63a9\u7801</p>\n",95"<p>To store <span translate=no>_^_0_^_</span> for the features </p>\n": "<p>\u4e3a\u8981\u7d20\u5b58\u50a8<span translate=no>_^_0_^_</span></p>\n",96"<p>Transformer layer </p>\n": "<p>\u53d8\u538b\u5668\u5c42</p>\n",97"<p>Transformer layers </p>\n": "<p>\u53d8\u538b\u5668\u5c42</p>\n",98"<p>Use <span translate=no>_^_0_^_</span> defined in <a href=\"./utils/llm_int8.html\">utilities</a>. </p>\n": "<p>\u4f7f\u7528\u5728<a href=\"./utils/llm_int8.html\">\u5b9e\u7528\u7a0b\u5e8f</a>\u4e2d<span translate=no>_^_0_^_</span>\u5b9a\u4e49\u3002</p>\n",99"<p>Use flash attention </p>\n": "<p>\u4f7f\u7528\u95ea\u5149\u706f\u6ce8\u610f\u529b</p>\n",100"<ul><li><span translate=no>_^_0_^_</span> are the embeddings of shape <span translate=no>_^_1_^_</span></li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u662f\u5f62\u72b6\u7684\u5d4c\u5165<span translate=no>_^_1_^_</span></li></ul>\n",101"<ul><li><span translate=no>_^_0_^_</span> are the token ids of shape <span translate=no>_^_1_^_</span></li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u662f\u5f62\u72b6\u7684\u4ee4\u724c ID<span translate=no>_^_1_^_</span></li></ul>\n",102"<ul><li><span translate=no>_^_0_^_</span> has shape <span translate=no>_^_1_^_</span> </li>\n<li><span translate=no>_^_2_^_</span> is the starting position of <span translate=no>_^_3_^_</span>. This is <span translate=no>_^_4_^_</span> when we have cached the keys and queries of previous positions</li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u6709\u5f62\u72b6<span translate=no>_^_1_^_</span></li>\n<li><span translate=no>_^_2_^_</span>\u662f\u7684\u8d77\u59cb\u4f4d\u7f6e<span translate=no>_^_3_^_</span>\u3002\u8fd9\u662f\u6211\u4eec\u7f13\u5b58\u5148\u524d\u4f4d\u7f6e\u7684\u952e\u548c\u67e5\u8be2<span translate=no>_^_4_^_</span>\u7684\u65f6\u5019</li></ul>\n",103"<ul><li><span translate=no>_^_0_^_</span> has shape <span translate=no>_^_1_^_</span></li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u6709\u5f62\u72b6<span translate=no>_^_1_^_</span></li></ul>\n",104"<ul><li><span translate=no>_^_0_^_</span> is the embedding size </li>\n<li><span translate=no>_^_1_^_</span> is the number of heads </li>\n<li><span translate=no>_^_2_^_</span> specifies whether to use <a href=\"https://github.com/HazyResearch/flash-attention\">FlashAttention</a></li></ul>\n<p><em>Out implementation doesn't include dropout</em>.</p>\n": "<ul><li><span translate=no>_^_0_^_</span>\u662f\u5d4c\u5165\u5927\u5c0f</li>\n<li><span translate=no>_^_1_^_</span>\u662f\u5934\u6570</li>\n<li><span translate=no>_^_2_^_</span>\u6307\u5b9a\u662f\u5426\u4f7f\u7528 <a href=\"https://github.com/HazyResearch/flash-attention\">FlashAttention</a></li></ul>\n<p><em>Out \u7684\u5b9e\u73b0\u4e0d\u5305\u62ec\u9000\u51fa</em>\u3002</p>\n",105"<ul><li><span translate=no>_^_0_^_</span> is the embedding size </li>\n<li><span translate=no>_^_1_^_</span> is the size of the vocabulary</li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u662f\u5d4c\u5165\u7684\u5927\u5c0f</li>\n<li><span translate=no>_^_1_^_</span>\u662f\u8bcd\u6c47\u91cf\u7684\u5927\u5c0f</li></ul>\n",106"<ul><li><span translate=no>_^_0_^_</span> is the embedding size</li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u662f\u5d4c\u5165\u7684\u5927\u5c0f</li></ul>\n",107"<ul><li><span translate=no>_^_0_^_</span> is the number of features for RoPE embeddings </li>\n<li><span translate=no>_^_1_^_</span> is the base for <span translate=no>_^_2_^_</span>, which defaults to <span translate=no>_^_3_^_</span></li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u662f RoPe \u5d4c\u5165\u7684\u8981\u7d20\u6570\u91cf</li>\n<li><span translate=no>_^_1_^_</span>\u662f\u7684\u57fa\u7840<span translate=no>_^_2_^_</span>\uff0c\u9ed8\u8ba4\u4e3a<span translate=no>_^_3_^_</span></li></ul>\n",108"<ul><li><span translate=no>_^_0_^_</span> is the size of the vocabulary </li>\n<li><span translate=no>_^_1_^_</span> is the size of the embeddings</li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u662f\u8bcd\u6c47\u91cf\u7684\u5927\u5c0f</li>\n<li><span translate=no>_^_1_^_</span>\u662f\u5d4c\u5165\u7684\u5927\u5c0f</li></ul>\n",109"<ul><li><span translate=no>_^_0_^_</span> the number of features in embeddings </li>\n<li><span translate=no>_^_1_^_</span> the number of attention heads </li>\n<li><span translate=no>_^_2_^_</span> percentage of features to add RoPE embeddings </li>\n<li><span translate=no>_^_3_^_</span> masking fill value for attention matrix </li>\n<li><span translate=no>_^_4_^_</span> specifies whether to use <a href=\"https://github.com/HazyResearch/flash-attention\">FlashAttention</a></li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u5d4c\u5165\u4e2d\u7684\u7279\u5f81\u6570\u91cf</li>\n<li><span translate=no>_^_1_^_</span>\u6ce8\u610f\u529b\u5934\u7684\u6570\u91cf</li>\n<li><span translate=no>_^_2_^_</span>\u6dfb\u52a0 RoPE \u5d4c\u5165\u7684\u529f\u80fd\u767e\u5206\u6bd4</li>\n<li><span translate=no>_^_3_^_</span>\u63a9\u76d6\u6ce8\u610f\u529b\u77e9\u9635\u7684\u586b\u5145\u503c</li>\n<li><span translate=no>_^_4_^_</span>\u6307\u5b9a\u662f\u5426\u4f7f\u7528 <a href=\"https://github.com/HazyResearch/flash-attention\">FlashAttention</a></li></ul>\n",110"GPT-NeoX Model Definition": "GPT-NEOX \u578b\u53f7\u5b9a\u4e49",111"This is the model definition of GPT-NeoX.": "\u8fd9\u662f GPT-NEOX \u7684\u6a21\u578b\u5b9a\u4e49\u3002"112}113114