CoCalc -- model.si.json

GitHub Repository: labmlai/annotated_deep_learning_paper_implementations
Path: blob/master/translate_cache/neox/model.si.json
⁴⁹²⁴ views
1
{
2
 "<h1>GPT-NeoX Model</h1>\n<p>Here is the code for layers of GPT-NeoX model and the code to load 20B checkpoint.</p>\n<p>The method <span translate=no>_^_0_^_</span> in the layers load the checkpoints of that layer. The checkpoint loading helpers are on <a href=\"checkpoint.html\"><span translate=no>_^_1_^_</span></a></p>\n": "<h1>\u0da2\u0dd3\u0db4\u0dd3\u0da7\u0dd3-\u0db1\u0dd2\u0dba\u0ddd\u0d9a\u0dca\u0dc3\u0dca\u0d86\u0d9a\u0dd8\u0dad\u0dd2\u0dba</h1>\n<p>\u0da2\u0dd3\u0db4\u0dd3\u0da7\u0dd3-\u0db1\u0dd2\u0dba\u0ddd\u0d9a\u0dca\u0dc3\u0dca\u0d86\u0d9a\u0dd8\u0dad\u0dd2\u0dba\u0dda \u0dc3\u0dca\u0dae\u0dbb \u0dc3\u0db3\u0dc4\u0dcf \u0d9a\u0dda\u0dad\u0dba \u0dc3\u0dc4 20B \u0db8\u0dd4\u0dbb\u0db4\u0ddc\u0dbd \u0db4\u0dd6\u0dbb\u0dab\u0dba \u0d9a\u0dd2\u0dbb\u0dd3\u0db8\u0dda \u0d9a\u0dda\u0dad\u0dba \u0db8\u0dd9\u0db1\u0dca\u0db1. </p>\n<p>\u0dc3\u0dca\u0dae\u0dbb <span translate=no>_^_0_^_</span> \u0dc0\u0dbd \u0d87\u0dad\u0dd2 \u0d9a\u0dca\u0dbb\u0db8\u0dba \u0d91\u0db8 \u0dc3\u0dca\u0dae\u0dbb\u0dba\u0dda \u0db8\u0dd4\u0dbb\u0db4\u0ddc\u0dbd\u0dc0\u0dbd\u0dca \u0db4\u0dd6\u0dbb\u0dab\u0dba \u0d9a\u0dbb\u0dba\u0dd2. \u0db8\u0dd4\u0dbb\u0db4\u0ddc\u0dbd\u0dc0\u0dbd\u0dca \u0db4\u0dd0\u0da7\u0dc0\u0dd3\u0db8\u0dda \u0dc3\u0dc4\u0dcf\u0dba\u0d9a\u0dba\u0db1\u0dca \u0d9a\u0dca\u0dbb\u0dd2\u0dba\u0dcf\u0dad\u0dca\u0db8\u0d9a \u0dc0\u0dda <a href=\"checkpoint.html\"><span translate=no>_^_1_^_</span></a></p>\n",
3
 "<h2>Attention layer</h2>\n": "<h2>\u0d85\u0dc0\u0db0\u0dcf\u0db1\u0dba\u0dc3\u0dca\u0dae\u0dbb\u0dba</h2>\n",
4
 "<h2>Embedding layer</h2>\n<p>This is a standard embeddings layer with code to load the checkpoint.</p>\n": "<h2>\u0d9a\u0dcf\u0dc0\u0dd0\u0daf\u0dca\u0daf\u0dd3\u0db8\u0dc3\u0dca\u0dae\u0dbb\u0dba</h2>\n<p>\u0db8\u0dd9\u0dba\u0db8\u0dd4\u0dbb\u0db4\u0ddc\u0dbd\u0da7 \u0db4\u0dd0\u0da7\u0dc0\u0dd3\u0db8 \u0dc3\u0db3\u0dc4\u0dcf \u0d9a\u0dda\u0dad\u0dba \u0dc3\u0dc4\u0dd2\u0dad \u0dc3\u0db8\u0dca\u0db8\u0dad \u0d9a\u0dcf\u0dc0\u0dd0\u0daf\u0dca\u0daf\u0dd3\u0db8\u0dca \u0dc3\u0dca\u0dae\u0dbb\u0dba\u0d9a\u0dd2. </p>\n",
5
 "<h2>Feedforward Network</h2>\n": "<h2>\u0db4\u0dca\u0dbb\u0dad\u0dd2\u0db4\u0ddd\u0dc2\u0dab\u0da2\u0dcf\u0dbd\u0dba</h2>\n",
6
 "<h2>Final normalization layer</h2>\n": "<h2>\u0d85\u0dc0\u0dc3\u0dcf\u0db1\u0dc3\u0dcf\u0db8\u0dcf\u0db1\u0dca\u0dba\u0d9a\u0dbb\u0dab \u0dc3\u0dca\u0dad\u0dbb\u0dba</h2>\n",
7
 "<h2>Rotary Positional Embeddings</h2>\n<p>GPT-NeoX uses <a href=\"https://arxiv.org/abs/2104.09864\">rotary positional embeddings (RoPE)</a>.</p>\n<p>WE have annotated implementation of RoPE <a href=\"https://nn.labml.ai/transformers/rope/index.html\">here</a> with more notes the theory.</p>\n": "<h2>\u0dbb\u0ddc\u0da7\u0dbb\u0dd2\u0dc3\u0dca\u0dae\u0dcf\u0db1\u0dd3\u0dba \u0d9a\u0dcf\u0dc0\u0dd0\u0daf\u0dca\u0daf\u0dd3\u0db8\u0dca</h2>\n<p>\u0da2\u0dd3\u0db4\u0dd3\u0da7\u0dd3-\u0db1\u0dd2\u0dba\u0ddd\u0d9a\u0dca\u0dc3\u0dca <a href=\"https://arxiv.org/abs/2104.09864\">\u0db7\u0dca\u0dbb\u0db8\u0dab \u0dc3\u0dca\u0dae\u0dcf\u0db1\u0dd3\u0dba \u0d9a\u0dcf\u0dc0\u0dd0\u0daf\u0dca\u0daf\u0dd3\u0db8\u0dca (\u0d9a\u0db9\u0dba)</a>\u0db7\u0dcf\u0dc0\u0dd2\u0dad\u0dcf \u0d9a\u0dbb\u0dba\u0dd2. </p>\n<p>\u0d85\u0db4\u0dd2\u0db1\u0dca\u0dba\u0dcf\u0dba \u0dc0\u0dd0\u0da9\u0dd2 \u0dc3\u0da7\u0dc4\u0db1\u0dca \u0dc3\u0db8\u0d9f <a href=\"https://nn.labml.ai/transformers/rope/index.html\">\u0db8\u0dd9\u0dc4\u0dd2</a> \u0d9a\u0db9\u0dba \u0d9a\u0dca\u0dbb\u0dd2\u0dba\u0dcf\u0dad\u0dca\u0db8\u0d9a \u0d9a\u0dd2\u0dbb\u0dd3\u0db8 \u0dc0\u0dd2\u0dc3\u0dca\u0dad\u0dbb \u0d9a\u0dbb \u0d87\u0dad. </p>\n",
8
 "<h2>Transformer Layer</h2>\n": "<h2>\u0da7\u0dca\u0dbb\u0dcf\u0db1\u0dca\u0dc3\u0dca\u0dc6\u0ddd\u0db8\u0dbb\u0dca\u0dc3\u0dca\u0dae\u0dbb\u0dba</h2>\n",
9
 "<h3>Generator to create layers</h3>\n<p>The layers are generated in the same order as checkpoints.</p>\n<p>It gives <span translate=no>_^_0_^_</span> when a layer is not available; we use the layer indices as NeoX and there are two transformation layers we don&#x27;t need in our implementation.</p>\n<ul><li><span translate=no>_^_1_^_</span>  is the number of tokens in the vocabulary </li>\n<li><span translate=no>_^_2_^_</span>  is the number of features in the embeddings </li>\n<li><span translate=no>_^_3_^_</span>  is the number of transformer layers </li>\n<li><span translate=no>_^_4_^_</span>  is the number of attention heads </li>\n<li><span translate=no>_^_5_^_</span>  are the set of layers to be used. All layers will be used if None.  This is used to test smaller versions of the model with fewer layers </li>\n<li><span translate=no>_^_6_^_</span>  specifies whether to clone the transformer layers (a bit faster) </li>\n<li><span translate=no>_^_7_^_</span>  is the data type of the model </li>\n<li><span translate=no>_^_8_^_</span>  is the device of the model </li>\n<li><span translate=no>_^_9_^_</span>  specifies whether to use int8 quantization </li>\n<li><span translate=no>_^_10_^_</span>  is the threshold <span translate=no>_^_11_^_</span> used to separate outlier features </li>\n<li><span translate=no>_^_12_^_</span>  specifies whether to use  <a href=\"https://github.com/HazyResearch/flash-attention\">FlashAttention</a></li></ul>\n": "<h3>\u0dc3\u0dca\u0dae\u0dbb \u0db1\u0dd2\u0dbb\u0dca\u0db8\u0dcf\u0dab\u0dba \u0d9a\u0dd2\u0dbb\u0dd3\u0db8\u0da7 \u0d8b\u0dad\u0dca\u0db4\u0dcf\u0daf\u0d9a \u0dba\u0db1\u0dca\u0dad\u0dca\u0dbb\u0dba</h3>\n<p>\u0dc3\u0dca\u0dae\u0dbb \u0da2\u0db1\u0db1\u0dba \u0d9a\u0dbb\u0db1\u0dd4 \u0dbd\u0db6\u0db1\u0dca\u0db1\u0dda \u0db8\u0dd4\u0dbb\u0db4\u0ddc\u0dbd\u0dc0\u0dbd\u0dca \u0db8\u0dd9\u0db1\u0dca \u0d91\u0d9a\u0db8 \u0d85\u0db1\u0dd4\u0db4\u0dd2\u0dc5\u0dd2\u0dc0\u0dd9\u0dbd\u0d9a\u0dd2.</p>\n<p>\u0dc3\u0dca\u0dae\u0dbb\u0dba\u0d9a\u0dca \u0db1\u0ddc\u0db8\u0dd0\u0dad\u0dd2<span translate=no>_^_0_^_</span> \u0dc0\u0dd2\u0da7 \u0d91\u0dba \u0dbd\u0db6\u0dcf \u0daf\u0dd9\u0dba\u0dd2; \u0d85\u0db4\u0dd2 \u0dc3\u0dca\u0dae\u0dbb \u0daf\u0dbb\u0dca\u0dc1\u0d9a \u0db1\u0dd2\u0dba\u0ddd\u0d9a\u0dca\u0dc3\u0dca \u0dbd\u0dd9\u0dc3 \u0db7\u0dcf\u0dc0\u0dd2\u0dad\u0dcf \u0d9a\u0dbb\u0db1 \u0d85\u0dad\u0dbb \u0d85\u0db4\u0d9c\u0dda \u0d9a\u0dca\u0dbb\u0dd2\u0dba\u0dcf\u0dad\u0dca\u0db8\u0d9a \u0d9a\u0dd2\u0dbb\u0dd3\u0db8\u0dda\u0daf\u0dd3 \u0d85\u0db4\u0da7 \u0d85\u0dc0\u0dc1\u0dca\u0dba \u0db1\u0ddc\u0dc0\u0db1 \u0db4\u0dbb\u0dd2\u0dc0\u0dbb\u0dca\u0dad\u0db1 \u0dc3\u0dca\u0dae\u0dbb \u0daf\u0dd9\u0d9a\u0d9a\u0dca \u0dad\u0dd2\u0db6\u0dda.</p>\n<ul><li><span translate=no>_^_1_^_</span>\u0dba\u0db1\u0dd4 \u0dc0\u0da0\u0db1 \u0db8\u0dcf\u0dbd\u0dcf\u0dc0\u0dda \u0da7\u0ddd\u0d9a\u0db1 \u0d9c\u0dab\u0db1</li>\n<li><span translate=no>_^_2_^_</span>\u0db8\u0dd9\u0db8 \u0d9a\u0dcf\u0dc0\u0dd0\u0daf\u0dca\u0daf\u0dd3\u0db8\u0dca \u0daf\u0dd3 \u0dc0\u0dd2\u0dc1\u0dda\u0dc2\u0dcf\u0d82\u0d9c \u0dc3\u0d82\u0d9b\u0dca\u0dba\u0dcf\u0dc0</li>\n<li><span translate=no>_^_3_^_</span>\u0da7\u0dca\u0dbb\u0dcf\u0db1\u0dca\u0dc3\u0dca\u0dc6\u0ddd\u0db8\u0dbb\u0dca \u0dc3\u0dca\u0dae\u0dbb \u0d9c\u0dab\u0db1 \u0dc0\u0dda</li>\n<li><span translate=no>_^_4_^_</span>\u0d85\u0dc0\u0db0\u0dcf\u0db1\u0dba \u0dba\u0ddc\u0db8\u0dd4 \u0db4\u0dca\u0dbb\u0db0\u0dcf\u0db1\u0dd3\u0db1\u0dca \u0dc3\u0d82\u0d9b\u0dca\u0dba\u0dcf\u0dc0 \u0dc0\u0dda</li>\n<li><span translate=no>_^_5_^_</span>\u0db7\u0dcf\u0dc0\u0dd2\u0dad\u0dcf \u0d9a\u0dc5 \u0dba\u0dd4\u0dad\u0dd4 \u0dc3\u0dca\u0dae\u0dbb \u0dc3\u0db8\u0dd6\u0dc4\u0dba\u0dba\u0dd2. \u0d9a\u0dd2\u0dc3\u0dd2\u0dc0\u0d9a\u0dca \u0db1\u0ddc\u0db8\u0dd0\u0dad\u0dd2 \u0db1\u0db8\u0dca \u0dc3\u0dd2\u0dba\u0dbd\u0dd4\u0db8 \u0dc3\u0dca\u0dae\u0dbb \u0db7\u0dcf\u0dc0\u0dd2\u0dad\u0dcf \u0d9a\u0dbb\u0db1\u0dd4 \u0d87\u0dad. \u0d85\u0da9\u0dd4 \u0dc3\u0dca\u0dae\u0dbb \u0dc3\u0dc4\u0dd2\u0dad \u0d86\u0d9a\u0dd8\u0dad\u0dd2\u0dba\u0dda \u0d9a\u0dd4\u0da9\u0dcf \u0d85\u0db1\u0dd4\u0dc0\u0dcf\u0daf\u0dba\u0db1\u0dca \u0db4\u0dbb\u0dd3\u0d9a\u0dca\u0dc2\u0dcf \u0d9a\u0dd2\u0dbb\u0dd3\u0db8\u0da7 \u0db8\u0dd9\u0dba \u0db7\u0dcf\u0dc0\u0dd2\u0dad\u0dcf</li> \u0d9a\u0dbb\u0dba\u0dd2\n<li><span translate=no>_^_6_^_</span>\u0da7\u0dca\u0dbb\u0dcf\u0db1\u0dca\u0dc3\u0dca\u0dc6\u0ddd\u0db8\u0dbb\u0dca \u0dc3\u0dca\u0dae\u0dbb \u0d9a\u0dca\u0dbd\u0ddd\u0db1 \u0d9a\u0dc5 \u0dba\u0dd4\u0dad\u0dd4\u0daf \u0dba\u0db1\u0dca\u0db1 \u0db1\u0dd2\u0dba\u0db8 \u0d9a\u0dbb\u0dba\u0dd2 (\u0da7\u0dd2\u0d9a\u0d9a\u0dca \u0dc0\u0dda\u0d9c\u0dc0\u0dad\u0dca)</li>\n<li><span translate=no>_^_7_^_</span>\u0d86\u0d9a\u0dd8\u0dad\u0dd2\u0dba\u0dda \u0daf\u0dad\u0dca\u0dad \u0dc0\u0dbb\u0dca\u0d9c\u0dba\u0dba\u0dd2</li>\n<li><span translate=no>_^_8_^_</span>\u0d86\u0d9a\u0dd8\u0dad\u0dd2\u0dba\u0dda \u0d8b\u0db4\u0dcf\u0d82\u0d9c\u0dba \u0dc0\u0dda</li>\n<li><span translate=no>_^_9_^_</span>INT8 \u0db4\u0dca\u0dbb\u0db8\u0dcf\u0dab\u0d9a\u0dbb\u0dab\u0dba \u0db7\u0dcf\u0dc0\u0dd2\u0dad\u0dcf \u0d9a\u0dc5 \u0dba\u0dd4\u0dad\u0dd4\u0daf \u0dba\u0db1\u0dca\u0db1 \u0db1\u0dd2\u0dba\u0db8 \u0d9a\u0dbb\u0dba\u0dd2</li>\n<li><span translate=no>_^_10_^_</span>\u0db4\u0dd2\u0da7\u0dad \u0dc0\u0dd2\u0dc1\u0dda\u0dc2\u0dcf\u0d82\u0d9c \u0dc0\u0dd9\u0db1\u0dca \u0d9a\u0dd2\u0dbb\u0dd3\u0db8<span translate=no>_^_11_^_</span> \u0dc3\u0db3\u0dc4\u0dcf \u0db7\u0dcf\u0dc0\u0dd2\u0dad\u0dcf \u0d9a\u0dbb\u0db1 \u0d91\u0dc5\u0dd2\u0db4\u0dad\u0dca\u0dad \u0dc0\u0dda</li>\n<li><span translate=no>_^_12_^_</span><a href=\"https://github.com/HazyResearch/flash-attention\">FlashAttention</a> \u0db7\u0dcf\u0dc0\u0dd2\u0dad\u0dcf \u0d9a\u0dc5 \u0dba\u0dd4\u0dad\u0dd4\u0daf \u0dba\u0db1\u0dca\u0db1 \u0db1\u0dd2\u0dba\u0db8 \u0d9a\u0dbb\u0dba\u0dd2</li></ul>\n",
10
 "<h3>Generator to get layers</h3>\n": "<h3>\u0dc3\u0dca\u0dae\u0dbb\u0dbd\u0db6\u0dcf \u0d9c\u0dd0\u0db1\u0dd3\u0db8 \u0dc3\u0db3\u0dc4\u0dcf \u0d8b\u0dad\u0dca\u0db4\u0dcf\u0daf\u0d9a \u0dba\u0db1\u0dca\u0dad\u0dca\u0dbb\u0dba</h3>\n",
11
 "<h3>Generator to load layers</h3>\n": "<h3>\u0dc3\u0dca\u0dae\u0dbb\u0db4\u0dd0\u0da7\u0dc0\u0dd3\u0db8\u0da7 \u0d8b\u0dad\u0dca\u0db4\u0dcf\u0daf\u0d9a \u0dba\u0db1\u0dca\u0dad\u0dca\u0dbb\u0dba</h3>\n",
12
 "<h3>Returns the total number of layers</h3>\n": "<h3>\u0db8\u0dd4\u0dc5\u0dd4\u0dc3\u0dca\u0dae\u0dbb \u0d9c\u0dab\u0db1 \u0db1\u0dd0\u0dc0\u0dad \u0dbd\u0db6\u0dcf \u0daf\u0dd9\u0dba\u0dd2</h3>\n",
13
 "<h3>Rotate the features</h3>\n<p><span translate=no>_^_0_^_</span></p>\n": "<h3>\u0dc0\u0dd2\u0dc1\u0dda\u0dc2\u0dcf\u0d82\u0d9c\u0d9a\u0dbb\u0d9a\u0dc0\u0db1\u0dca\u0db1</h3>\n<p><span translate=no>_^_0_^_</span></p>\n",
14
 "<h4>Calculate the causal mask</h4>\n<ul><li><span translate=no>_^_0_^_</span> has shape <a href=\"batch_size, query_seq_len, key_seq_len, n_heads\">batch_size, query_seq_len, key_seq_len, n_heads</a></li></ul>\n": "<h4>\u0dc4\u0dda\u0dad\u0dd4\u0d86\u0dc0\u0dbb\u0dab \u0d9c\u0dab\u0db1\u0dba \u0d9a\u0dbb\u0db1\u0dca\u0db1</h4>\n<ul><li><span translate=no>_^_0_^_</span> \u0dc4\u0dd0\u0da9\u0dba\u0dda <a href=\"batch_size, query_seq_len, key_seq_len, n_heads\">\u0d9a\u0dcf\u0dab\u0dca\u0da9_\u0db4\u0dca\u0dbb\u0db8\u0dcf\u0dab\u0dba, \u0dc0\u0dd2\u0db8\u0dbb\u0dca_\u0dc3\u0dd9\u0d9a\u0dca_\u0dbd\u0db1\u0dca, \u0dba\u0dad\u0dd4\u0dbb\u0dd4_\u0dc3\u0dda\u0d9a\u0dca_\u0dbd\u0db1\u0dca, \u0d91\u0db1\u0dca_\u0dc4\u0dd9\u0da9\u0dca\u0dc3\u0dca</a>\u0d87\u0dad</li></ul>\n",
15
 "<h4>Creates and caches a layer</h4>\n<p>Copying cached layers is faster than initializing new layers because it takes time to initialize parameters.</p>\n<ul><li><span translate=no>_^_0_^_</span>  is the name of the layer </li>\n<li><span translate=no>_^_1_^_</span>  is the function to create the layer </li>\n<p><em>Returns</em>  the created layer or a copy of the cached layer</p></ul>\n": "<h4>\u0dc3\u0dca\u0dad\u0dbb\u0dba\u0d9a\u0dca\u0db1\u0dd2\u0dbb\u0dca\u0db8\u0dcf\u0dab\u0dba \u0d9a\u0dbb \u0dc4\u0dd0\u0da0\u0dca \u0d9a\u0dbb\u0dba\u0dd2</h4>\n<p>\u0db4\u0dbb\u0dcf\u0db8\u0dd2\u0dad\u0dd3\u0db1\u0dca\u0d86\u0dbb\u0db8\u0dca\u0db7 \u0d9a\u0dd2\u0dbb\u0dd3\u0db8\u0da7 \u0d9a\u0dcf\u0dbd\u0dba \u0d9c\u0dad\u0dc0\u0db1 \u0db1\u0dd2\u0dc3\u0dcf \u0dc4\u0dd0\u0db9\u0dd2\u0dbd\u0dd2 \u0dc3\u0dca\u0dae\u0dbb \u0db4\u0dd2\u0da7\u0db4\u0dad\u0dca \u0d9a\u0dd2\u0dbb\u0dd3\u0db8 \u0db1\u0dc0 \u0dc3\u0dca\u0dae\u0dbb \u0d86\u0dbb\u0db8\u0dca\u0db7 \u0d9a\u0dd2\u0dbb\u0dd3\u0db8\u0da7 \u0dc0\u0da9\u0dcf \u0dc0\u0dda\u0d9c\u0dc0\u0dad\u0dca \u0dc0\u0dda. </p>\n<ul><li><span translate=no>_^_0_^_</span> \u0dba\u0db1\u0dd4 \u0dc3\u0dca\u0dad\u0dbb\u0dba\u0dda \u0db1\u0db8\u0dba\u0dd2 </li>\n<li><span translate=no>_^_1_^_</span> \u0dc3\u0dca\u0dad\u0dbb\u0dba \u0db1\u0dd2\u0dbb\u0dca\u0db8\u0dcf\u0dab\u0dba \u0d9a\u0dd2\u0dbb\u0dd3\u0db8\u0dda \u0d9a\u0dcf\u0dbb\u0dca\u0dba\u0dba\u0dba\u0dd2 </li>\n<p>\u0dc3\u0dcf\u0daf\u0db1\u0dbd\u0daf \u0dc3\u0dca\u0dad\u0dbb\u0dba \u0dc4\u0ddd \u0d9a\u0dd0\u0da0\u0dca \u0dc3\u0dca\u0dae\u0dbb\u0dba\u0dda \u0db4\u0dd2\u0da7\u0db4\u0dad\u0d9a\u0dca<em>\u0d86\u0db4\u0dc3\u0dd4 \u0dbd\u0db6\u0dcf \u0daf\u0dd9\u0dba\u0dd2</em> </p></ul>\n",
16
 "<h4>Prepares the layer for usage</h4>\n<p>We move the layer to the device and convert it to the correct data type</p>\n<ul><li><span translate=no>_^_0_^_</span>  is the layer to prepare </li>\n<p><em>Returns</em>  the prepared layer</p></ul>\n": "<h4>\u0db7\u0dcf\u0dc0\u0dd2\u0dad\u0dba\u0dc3\u0db3\u0dc4\u0dcf \u0dc3\u0dca\u0dad\u0dbb\u0dba \u0dc3\u0d9a\u0dc3\u0dca \u0d9a\u0dbb\u0dba\u0dd2</h4>\n<p>\u0d85\u0db4\u0dd2\u0dc3\u0dca\u0dad\u0dbb\u0dba \u0d8b\u0db4\u0dcf\u0d82\u0d9c\u0dba \u0dc0\u0dd9\u0dad \u0d9c\u0dd9\u0db1 \u0d9c\u0ddc\u0dc3\u0dca \u0db1\u0dd2\u0dc0\u0dd0\u0dbb\u0daf\u0dd2 \u0daf\u0dad\u0dca\u0dad \u0dc0\u0dbb\u0dca\u0d9c\u0dba\u0da7 \u0db4\u0dbb\u0dd2\u0dc0\u0dbb\u0dca\u0dad\u0db1\u0dba \u0d9a\u0dbb\u0db8\u0dd4</p>\n<ul><li><span translate=no>_^_0_^_</span> \u0dc3\u0d9a\u0dc3\u0dca \u0d9a\u0dc5 \u0dba\u0dd4\u0dad\u0dd4 \u0dc3\u0dca\u0dae\u0dbb\u0dba\u0dba\u0dd2 </li>\n<p>\u0dc3\u0d9a\u0dc3\u0dca\u0d9a\u0dc5 \u0dc3\u0dca\u0dad\u0dbb\u0dba<em>\u0d86\u0db4\u0dc3\u0dd4 \u0dbd\u0db6\u0dcf \u0daf\u0dd9\u0dba\u0dd2</em> </p></ul>\n",
17
 "<p> </p>\n": "<p> </p>\n",
18
 "<p> <a id=\"post_load_prepare\"></a></p>\n<h3>Layer transformations after loading the checkpoint</h3>\n<p>This function implements layer transformations after loading the checkpoint.</p>\n<p>Currently, it only applies the int8 quantization.</p>\n<ul><li><span translate=no>_^_0_^_</span>  is the layer to prepare </li>\n<li><span translate=no>_^_1_^_</span>  specifies whether to use int8 quantization </li>\n<li><span translate=no>_^_2_^_</span>  is the device of the model </li>\n<li><span translate=no>_^_3_^_</span>  is the threshold <span translate=no>_^_4_^_</span> used to separate outlier features </li>\n<p><em>Returns</em>  the prepared layer</p></ul>\n": "<p> <a id=\"post_load_prepare\"></a></p>\n<h3>\u0db4\u0dd2\u0dbb\u0dd2\u0d9a\u0dca\u0dc3\u0dd4\u0db8\u0dca\u0dc3\u0dca\u0dae\u0dcf\u0db1\u0dba \u0db4\u0dd0\u0da7\u0dc0\u0dd3\u0db8\u0dd9\u0db1\u0dca \u0db4\u0dc3\u0dd4 \u0dc3\u0dca\u0dae\u0dbb \u0db4\u0dbb\u0dd2\u0dc0\u0dbb\u0dca\u0dad\u0db1\u0dba\u0db1\u0dca</h3>\n<p>\u0db8\u0dd9\u0db8\u0dc1\u0dca\u0dbb\u0dd2\u0dad\u0dba \u0db4\u0dd2\u0dbb\u0dd2\u0d9a\u0dca\u0dc3\u0dd4\u0db8\u0dca \u0dc3\u0dca\u0dae\u0dcf\u0db1\u0dba \u0db4\u0dd0\u0da7\u0dc0\u0dd3\u0db8\u0dd9\u0db1\u0dca \u0db4\u0dc3\u0dd4 \u0dc3\u0dca\u0dae\u0dbb \u0db4\u0dbb\u0dd2\u0dc0\u0dbb\u0dca\u0dad\u0db1\u0dba\u0db1\u0dca \u0d9a\u0dca\u0dbb\u0dd2\u0dba\u0dcf\u0dad\u0dca\u0db8\u0d9a \u0d9a\u0dbb\u0dba\u0dd2. </p>\n<p>\u0daf\u0dd0\u0db1\u0da7\u0d91\u0dba \u0d85\u0daf\u0dcf\u0dc5 \u0dc0\u0db1\u0dca\u0db1\u0dda INT8 \u0db4\u0dca\u0dbb\u0db8\u0dcf\u0dab\u0d9a\u0dbb\u0dab\u0dba \u0db4\u0db8\u0dab\u0dd2. </p>\n<ul><li><span translate=no>_^_0_^_</span> \u0dc3\u0d9a\u0dc3\u0dca \u0d9a\u0dc5 \u0dba\u0dd4\u0dad\u0dd4 \u0dc3\u0dca\u0dae\u0dbb\u0dba\u0dba\u0dd2 </li>\n<li><span translate=no>_^_1_^_</span> INT8 \u0db4\u0dca\u0dbb\u0db8\u0dcf\u0dab\u0d9a\u0dbb\u0dab\u0dba \u0db7\u0dcf\u0dc0\u0dd2\u0dad\u0dcf \u0d9a\u0dc5 \u0dba\u0dd4\u0dad\u0dd4\u0daf \u0dba\u0db1\u0dca\u0db1 \u0db1\u0dd2\u0dba\u0db8 \u0d9a\u0dbb\u0dba\u0dd2 </li>\n<li><span translate=no>_^_2_^_</span> \u0d86\u0d9a\u0dd8\u0dad\u0dd2\u0dba\u0dda \u0d8b\u0db4\u0dcf\u0d82\u0d9c\u0dba \u0dc0\u0dda </li>\n<li><span translate=no>_^_3_^_</span> \u0dba\u0db1\u0dd4 \u0db4\u0dd2\u0da7\u0dad \u0dc0\u0dd2\u0dc1\u0dda\u0dc2\u0dcf\u0d82\u0d9c \u0dc0\u0dd9\u0db1\u0dca \u0d9a\u0dd2\u0dbb\u0dd3\u0db8 <span translate=no>_^_4_^_</span> \u0dc3\u0db3\u0dc4\u0dcf \u0db7\u0dcf\u0dc0\u0dd2\u0dad\u0dcf \u0d9a\u0dbb\u0db1 \u0d91\u0dc5\u0dd2\u0db4\u0dad\u0dca\u0dad </li>\n<p>\u0dc3\u0d9a\u0dc3\u0dca\u0d9a\u0dc5 \u0dc3\u0dca\u0dad\u0dbb\u0dba<em>\u0d86\u0db4\u0dc3\u0dd4 \u0dbd\u0db6\u0dcf \u0daf\u0dd9\u0dba\u0dd2</em> </p></ul>\n",
19
 "<p> <span translate=no>_^_0_^_</span> </p>\n": "<p> <span translate=no>_^_0_^_</span> </p>\n",
20
 "<p> Code to load the checkpoint</p>\n": "<p> \u0db8\u0dd4\u0dbb\u0db4\u0ddc\u0dbd\u0db4\u0dd6\u0dbb\u0dab\u0dba \u0d9a\u0dd2\u0dbb\u0dd3\u0db8\u0da7 \u0d9a\u0dda\u0dad\u0dba</p>\n",
21
 "<p> Readout layer</p>\n": "<p> \u0d9a\u0dd2\u0dba\u0dc0\u0dd3\u0db8\u0dda\u0dc3\u0dca\u0dae\u0dbb\u0dba</p>\n",
22
 "<p><a href=\"https://github.com/HazyResearch/flash-attention\">FlashAttention</a> </p>\n": "<p><a href=\"https://github.com/HazyResearch/flash-attention\">\u0dc6\u0dca\u0dbd\u0dd1\u0dc2\u0dca \u0d85\u0dc0\u0db0\u0dcf\u0db1\u0dba</a></p>\n",
23
 "<p><span translate=no>_^_0_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span> </p>\n",
24
 "<p>Add RoPE embeddings </p>\n": "<p>\u0d9a\u0db9\u0dba\u0d9a\u0dcf\u0dc0\u0dd0\u0daf\u0dca\u0daf\u0dd3\u0db8\u0dca \u0d91\u0d9a\u0dad\u0dd4 \u0d9a\u0dbb\u0db1\u0dca\u0db1 </p>\n",
25
 "<p>Add head dimension </p>\n": "<p>\u0dc4\u0dd2\u0dc3\u0db8\u0dcf\u0db1\u0dba\u0d9a\u0dca \u0d91\u0d9a\u0dca \u0d9a\u0dbb\u0db1\u0dca\u0db1 </p>\n",
26
 "<p>Add them and the residual connection </p>\n": "<p>\u0d92\u0dc0\u0dcf\u0dc3\u0dc4 \u0d85\u0dc0\u0dc1\u0dda\u0dc2 \u0dc3\u0db8\u0dca\u0db6\u0db1\u0dca\u0db0\u0dad\u0dcf\u0dc0\u0dba \u0d91\u0d9a\u0dad\u0dd4 \u0d9a\u0dbb\u0db1\u0dca\u0db1 </p>\n",
27
 "<p>Apply mask </p>\n": "<p>\u0dc0\u0dd9\u0dc3\u0dca\u0dba\u0ddc\u0daf\u0db1\u0dca\u0db1 </p>\n",
28
 "<p>Attention layer </p>\n": "<p>\u0d85\u0dc0\u0db0\u0dcf\u0db1\u0dba\u0dc3\u0dca\u0dae\u0dbb\u0dba </p>\n",
29
 "<p>Attention output transform </p>\n": "<p>\u0d85\u0dc0\u0db0\u0dcf\u0db1\u0dba\u0dba\u0ddc\u0db8\u0dd4 \u0db4\u0dca\u0dbb\u0dad\u0dd2\u0daf\u0dcf\u0db1\u0dba \u0db4\u0dbb\u0dd2\u0dab\u0dcf\u0db8\u0db1\u0dba </p>\n",
30
 "<p>Attention query, key and value transform </p>\n": "<p>\u0d85\u0dc0\u0db0\u0dcf\u0db1\u0dba\u0dc0\u0dd2\u0db8\u0dc3\u0dd4\u0db8, \u0dba\u0dad\u0dd4\u0dbb \u0dc3\u0dc4 \u0d85\u0d9c\u0dba \u0db4\u0dbb\u0dd2\u0dc0\u0dbb\u0dca\u0dad\u0db1\u0dba \u0d9a\u0dd2\u0dbb\u0dd3\u0db8 </p>\n",
31
 "<p>Attention scaling factor </p>\n": "<p>\u0d85\u0dc0\u0db0\u0dcf\u0db1\u0dba\u0db4\u0dbb\u0dd2\u0db8\u0dcf\u0dab \u0dc3\u0dcf\u0db0\u0d9a\u0dba </p>\n",
32
 "<p>Attention softmax </p>\n": "<p>\u0d85\u0dc0\u0db0\u0dcf\u0db1\u0dba\u0dc3\u0ddc\u0dc6\u0dca\u0da7\u0dca\u0db8\u0dd0\u0d9a\u0dca\u0dc3\u0dca </p>\n",
33
 "<p>Attention softmax module </p>\n": "<p>\u0d85\u0dc0\u0db0\u0dcf\u0db1\u0dba\u0dba\u0ddc\u0db8\u0dd4 \u0d9a\u0dbb\u0db1\u0dca\u0db1 \u0dc3\u0ddc\u0dc6\u0dca\u0da7\u0dca\u0db8\u0dd0\u0d9a\u0dca\u0dc3\u0dca \u0db8\u0ddc\u0da9\u0dd2\u0dba\u0dd4\u0dbd\u0dba </p>\n",
34
 "<p>Base for <span translate=no>_^_0_^_</span> </p>\n": "<p>\u0dc3\u0db3\u0dc4\u0dcf\u0db8\u0dd6\u0dbd\u0dd2\u0d9a <span translate=no>_^_0_^_</span> </p>\n",
35
 "<p>Cache <span translate=no>_^_0_^_</span> and <span translate=no>_^_1_^_</span> </p>\n": "<p>\u0dc4\u0dd0\u0db9\u0dd2\u0dbd\u0dd2\u0dba <span translate=no>_^_0_^_</span> \u0dc3\u0dc4 <span translate=no>_^_1_^_</span> </p>\n",
36
 "<p>Cache them </p>\n": "<p>\u0d92\u0dc0\u0dcf\u0dc4\u0dd0\u0db9\u0dd2\u0dbd\u0dd2\u0dba </p>\n",
37
 "<p>Calculate <span translate=no>_^_0_^_</span> and <span translate=no>_^_1_^_</span> in fp32 </p>\n": "<p>\u0d9c\u0dab\u0db1\u0dba\u0d9a\u0dbb\u0db1\u0dca\u0db1 <span translate=no>_^_0_^_</span> \u0dc3\u0dc4 <span translate=no>_^_1_^_</span> fp32 </p>\n",
38
 "<p>Concatenate so that for row <span translate=no>_^_0_^_</span> we have</p>\n<p><span translate=no>_^_1_^_</span> </p>\n": "<p>\u0db4\u0dda\u0dc5\u0dd2\u0dba\u0dc3\u0db3\u0dc4\u0dcf <span translate=no>_^_0_^_</span> \u0d85\u0db4\u0da7 \u0d87\u0dad\u0dd2 \u0db4\u0dbb\u0dd2\u0daf\u0dd2 \u0dc3\u0d82\u0dba\u0dd4\u0d9a\u0dca\u0dad \u0d9a\u0dbb\u0db1\u0dca\u0db1</p>\n<p><span translate=no>_^_1_^_</span> </p>\n",
39
 "<p>Concatenate the past </p>\n": "<p>\u0d85\u0dad\u0dd3\u0dad\u0dba\u0dc3\u0d82\u0dba\u0dd4\u0d9a\u0dca\u0dad \u0d9a\u0dbb\u0db1\u0dca\u0db1 </p>\n",
40
 "<p>Concatenate with features that didn&#x27;t get RoPE embeddings </p>\n": "<p>\u0d9a\u0db9\u0dba\u0d9a\u0dcf\u0dc0\u0dd0\u0daf\u0dca\u0daf\u0dd3\u0db8\u0dca \u0dbd\u0db6\u0dcf \u0db1\u0ddc\u0d9c\u0dad\u0dca \u0dc0\u0dd2\u0dc1\u0dda\u0dc2\u0dcf\u0d82\u0d9c \u0dc3\u0db8\u0d9f \u0dc3\u0d82\u0dba\u0dd4\u0d9a\u0dca\u0dad \u0dc0\u0db1\u0dca\u0db1 </p>\n",
41
 "<p>Contraction linear layer </p>\n": "<p>\u0dc3\u0d82\u0d9a\u0ddd\u0da0\u0db1\u0dba\u0dbb\u0dda\u0d9b\u0dd3\u0dba \u0dc3\u0dca\u0dae\u0dbb\u0dba </p>\n",
42
 "<p>Convert the linear layers </p>\n": "<p>\u0dbb\u0dda\u0d9b\u0dd3\u0dba\u0dc3\u0dca\u0dae\u0dbb \u0db4\u0dbb\u0dd2\u0dc0\u0dbb\u0dca\u0dad\u0db1\u0dba \u0d9a\u0dbb\u0db1\u0dca\u0db1 </p>\n",
43
 "<p>Convert to fp32 if the current dtype is fp16 </p>\n": "<p>\u0dc0\u0dad\u0dca\u0db8\u0db1\u0dcadtype fp16 \u0db1\u0db8\u0dca fp32 \u0db6\u0dc0\u0da7 \u0db4\u0dbb\u0dd2\u0dc0\u0dbb\u0dca\u0dad\u0db1\u0dba \u0d9a\u0dbb\u0db1\u0dca\u0db1 </p>\n",
44
 "<p>Create mask </p>\n": "<p>\u0dc0\u0dd9\u0dc3\u0dca\u0db8\u0dd4\u0dc4\u0dd4\u0dab \u0dc3\u0dcf\u0daf\u0db1\u0dca\u0db1 </p>\n",
45
 "<p>Disable auto-casting to fp16 for attention computation </p>\n": "<p>\u0d85\u0dc0\u0db0\u0dcf\u0db1\u0dba\u0d9c\u0dab\u0db1\u0dba \u0d9a\u0dd2\u0dbb\u0dd3\u0db8 \u0dc3\u0db3\u0dc4\u0dcf fp16 \u0d9a\u0dd2\u0dbb\u0dd3\u0db8\u0da7 \u0dc3\u0dca\u0dc0\u0dba\u0d82\u0d9a\u0dca\u0dbb\u0dd3\u0dba-\u0dc0\u0dcf\u0dad\u0dca\u0dad\u0dd4 \u0d85\u0d9a\u0dca\u0dbb\u0dd3\u0dba </p>\n",
46
 "<p>Do not cast for bfloat </p>\n": "<p>bfloat\u0dc3\u0db3\u0dc4\u0dcf \u0dc0\u0dcf\u0dad\u0dca\u0dad\u0dd4 \u0db1\u0ddc\u0d9a\u0dbb\u0db1\u0dca\u0db1 </p>\n",
47
 "<p>Embedding layer </p>\n": "<p>\u0d9a\u0dcf\u0dc0\u0dd0\u0daf\u0dca\u0daf\u0dd3\u0db8\u0dc3\u0dca\u0dae\u0dbb\u0dba </p>\n",
48
 "<p>Expansion linear layer </p>\n": "<p>\u0db4\u0dd4\u0dc5\u0dd4\u0dbd\u0dca\u0dbb\u0dda\u0d9b\u0dd3\u0dba \u0dc3\u0dca\u0dae\u0dbb\u0dba </p>\n",
49
 "<p>FFN first transform </p>\n": "<p>FFN\u0db4\u0dc5\u0db8\u0dd4 \u0db4\u0dbb\u0dd2\u0dc0\u0dbb\u0dca\u0dad\u0db1\u0dba </p>\n",
50
 "<p>FFN layer </p>\n": "<p>FFN\u0dc3\u0dca\u0dae\u0dbb\u0dba </p>\n",
51
 "<p>FFN second transform </p>\n": "<p>FFN\u0daf\u0dd9\u0dc0\u0db1 \u0db4\u0dbb\u0dd2\u0dab\u0dcf\u0db8\u0db1\u0dba </p>\n",
52
 "<p>Final linear layer </p>\n": "<p>\u0d85\u0dc0\u0dc3\u0dcf\u0db1\u0dbb\u0dda\u0d9b\u0dd3\u0dba \u0dc3\u0dca\u0dae\u0dbb\u0dba </p>\n",
53
 "<p>Final normalization layer </p>\n": "<p>\u0d85\u0dc0\u0dc3\u0dcf\u0db1\u0dc3\u0dcf\u0db8\u0dcf\u0db1\u0dca\u0dba\u0d9a\u0dbb\u0dab \u0dc3\u0dca\u0dad\u0dbb\u0dba </p>\n",
54
 "<p>GELU activation </p>\n": "<p>GELU\u0dc3\u0d9a\u0dca\u0dbb\u0dd2\u0dba \u0d9a\u0dd2\u0dbb\u0dd3\u0db8 </p>\n",
55
 "<p>Get attention weighted values </p>\n": "<p>\u0d85\u0dc0\u0db0\u0dcf\u0db1\u0dba\u0db6\u0dbb \u0dad\u0dd0\u0db6\u0dd6 \u0d85\u0d9c\u0dba\u0db1\u0dca \u0dbd\u0db6\u0dcf \u0d9c\u0db1\u0dca\u0db1 </p>\n",
56
 "<p>Get causal mask </p>\n": "<p>\u0dc4\u0dda\u0dad\u0dd4\u0dc0\u0dd9\u0dc3\u0dca\u0db8\u0dd4\u0dc4\u0dd4\u0dab \u0dbd\u0db6\u0dcf \u0d9c\u0db1\u0dca\u0db1 </p>\n",
57
 "<p>Get default values if not specified </p>\n": "<p>\u0db1\u0dd2\u0dba\u0db8\u0d9a\u0dbb \u0db1\u0ddc\u0db8\u0dd0\u0dad\u0dd2 \u0db1\u0db8\u0dca \u0db4\u0dd9\u0dbb\u0db1\u0dd2\u0db8\u0dd2 \u0d85\u0d9c\u0dba\u0db1\u0dca \u0dbd\u0db6\u0dcf \u0d9c\u0db1\u0dca\u0db1 </p>\n",
58
 "<p>Get position indexes <span translate=no>_^_0_^_</span> </p>\n": "<p>\u0dc3\u0dca\u0dae\u0dcf\u0db1\u0daf\u0dbb\u0dca\u0dc1\u0d9a \u0dbd\u0db6\u0dcf \u0d9c\u0db1\u0dca\u0db1 <span translate=no>_^_0_^_</span> </p>\n",
59
 "<p>Get query, key and value embeddings (all concatenated). The last dimension size will change from n_hidden -&gt; <span translate=no>_^_0_^_</span> </p>\n": "<p>\u0dc0\u0dd2\u0db8\u0dc3\u0dd4\u0db8, \u0dba\u0dad\u0dd4\u0dbb \u0dc3\u0dc4 \u0dc0\u0da7\u0dd2\u0db1\u0dcf\u0d9a\u0db8\u0dca \u0d9a\u0dcf\u0dc0\u0dd0\u0daf\u0dca\u0daf\u0dd3\u0db8\u0dca \u0dbd\u0db6\u0dcf \u0d9c\u0db1\u0dca\u0db1 (\u0dc3\u0dd2\u0dba\u0dbd\u0dca\u0dbd \u0dc3\u0d82\u0dba\u0dd4\u0d9a\u0dca\u0dad \u0d9a\u0dbb \u0d87\u0dad). \u0db4\u0dc3\u0dd4\u0d9c\u0dd2\u0dba \u0db8\u0dcf\u0db1\u0dba\u0d9a\u0dca \u0db4\u0dca\u0dbb\u0db8\u0dcf\u0dab\u0dba n_hidden -> \u0dc3\u0dd2\u0da7 \u0dc0\u0dd9\u0db1\u0dc3\u0dca \u0dc0\u0db1\u0dd4 \u0d87\u0dad <span translate=no>_^_0_^_</span> </p>\n",
60
 "<p>Get the actual sequence length </p>\n": "<p>\u0dc3\u0dad\u0dca\u0dba\u0d85\u0db1\u0dd4\u0d9a\u0dca\u0dbb\u0db8\u0dba\u0dda \u0daf\u0dd2\u0d9c \u0dbd\u0db6\u0dcf \u0d9c\u0db1\u0dca\u0db1 </p>\n",
61
 "<p>Get the past keys and values. These will have shape <span translate=no>_^_0_^_</span> </p>\n": "<p>\u0d85\u0dad\u0dd3\u0dad\u0dba\u0dad\u0dd4\u0dbb\u0dd4 \u0dc3\u0dc4 \u0d85\u0d9c\u0dba\u0db1\u0dca \u0dbd\u0db6\u0dcf \u0d9c\u0db1\u0dca\u0db1. \u0db8\u0dda\u0dc0\u0dcf\u0da7 \u0dc4\u0dd0\u0da9\u0dba \u0d87\u0dad <span translate=no>_^_0_^_</span> </p>\n",
62
 "<p>Get the sin and cos values from the cache </p>\n": "<p>\u0dc4\u0dd0\u0db9\u0dd2\u0dbd\u0dd2\u0dba\u0dc3\u0dd2\u0da7 \u0db4\u0dcf\u0db4\u0dba \u0dc3\u0dc4 \u0d9a\u0ddd\u0dc3\u0dca \u0d85\u0d9c\u0dba\u0db1\u0dca \u0dbd\u0db6\u0dcf \u0d9c\u0db1\u0dca\u0db1 </p>\n",
63
 "<p>Get the state id&#x27;s. We use to retrieve previous states and store the next states </p>\n": "<p>\u0dbb\u0dcf\u0da2\u0dca\u0dbaid \u0d9c\u0dda \u0dbd\u0db6\u0dcf \u0d9c\u0db1\u0dca\u0db1. \u0d85\u0db4\u0dd2 \u0db4\u0dd9\u0dbb \u0dbb\u0dcf\u0da2\u0dca\u0dba\u0dba\u0db1\u0dca \u0dbd\u0db6\u0dcf \u0d9c\u0dd0\u0db1\u0dd3\u0db8\u0da7 \u0dc4\u0dcf \u0d89\u0daf\u0dd2\u0dbb\u0dd2 \u0dbb\u0dcf\u0da2\u0dca\u0dba\u0dba\u0db1\u0dca \u0d9c\u0db6\u0da9\u0dcf \u0d9a\u0dd2\u0dbb\u0dd3\u0db8 \u0dc3\u0db3\u0dc4\u0dcf \u0db7\u0dcf\u0dc0\u0dd2\u0dad\u0dcf </p>\n",
64
 "<p>If there&#x27;s cache </p>\n": "<p>\u0dc4\u0dd0\u0db9\u0dd2\u0dbd\u0dd2\u0dba\u0dad\u0dd2\u0db6\u0dda \u0db1\u0db8\u0dca </p>\n",
65
 "<p>If we are caching the states of previous tokens </p>\n": "<p>\u0d85\u0db4\u0dd2\u0db4\u0dd9\u0dbb \u0da7\u0ddd\u0d9a\u0db1 \u0dc0\u0dbd \u0dad\u0dad\u0dca\u0dc0\u0dba\u0db1\u0dca \u0dc4\u0dd0\u0db9\u0dd2\u0dbd\u0dd2 \u0d9a\u0dbb\u0db1\u0dca\u0db1\u0dda \u0db1\u0db8\u0dca </p>\n",
66
 "<p>Initialize <span translate=no>_^_0_^_</span> </p>\n": "<p>\u0d86\u0dbb\u0db8\u0dca\u0db7\u0d9a\u0dbb\u0db1\u0dca\u0db1 <span translate=no>_^_0_^_</span> </p>\n",
67
 "<p>Initialize <span translate=no>_^_0_^_</span> and <span translate=no>_^_1_^_</span> cache </p>\n": "<p>\u0d86\u0dbb\u0db8\u0dca\u0db7\u0d9a\u0dbb\u0db1\u0dca\u0db1 <span translate=no>_^_0_^_</span> \u0dc3\u0dc4 <span translate=no>_^_1_^_</span> \u0dc4\u0dd0\u0db9\u0dd2\u0dbd\u0dd2\u0dba </p>\n",
68
 "<p>Layer norm before FFN </p>\n": "<p>FFN\u0da7 \u0db4\u0dd9\u0dbb \u0dc3\u0dca\u0dae\u0dbb \u0dc3\u0db8\u0dca\u0db8\u0dad\u0dba </p>\n",
69
 "<p>Layer norm before attention </p>\n": "<p>\u0d85\u0dc0\u0db0\u0dcf\u0db1\u0dba\u0da7\u0db4\u0dd9\u0dbb \u0dc3\u0dca\u0dae\u0dbb \u0dc3\u0db8\u0dca\u0db8\u0dad\u0dba </p>\n",
70
 "<p>Layer normalization before FFN </p>\n": "<p>FFN\u0da7 \u0db4\u0dd9\u0dbb \u0dc3\u0dca\u0dae\u0dbb \u0dc3\u0dcf\u0db8\u0dcf\u0db1\u0dca\u0dba\u0d9a\u0dbb\u0dab\u0dba </p>\n",
71
 "<p>Layer normalization before attention </p>\n": "<p>\u0d85\u0dc0\u0db0\u0dcf\u0db1\u0dba\u0da7\u0db4\u0dd9\u0dbb \u0dc3\u0dca\u0dae\u0dbb \u0dc3\u0dcf\u0db8\u0dcf\u0db1\u0dca\u0dba\u0d9a\u0dbb\u0dab\u0dba </p>\n",
72
 "<p>Linear layer for query, key and value </p>\n": "<p>\u0dc0\u0dd2\u0db8\u0dc3\u0dd4\u0db8, \u0dba\u0dad\u0dd4\u0dbb \u0dc3\u0dc4 \u0dc0\u0da7\u0dd2\u0db1\u0dcf\u0d9a\u0db8 \u0dc3\u0db3\u0dc4\u0dcf \u0dbb\u0dda\u0d9b\u0dd3\u0dba \u0dc3\u0dca\u0dae\u0dbb\u0dba </p>\n",
73
 "<p>NeoX runs attention and feedforward network in parallel </p>\n": "<p>\u0db1\u0dd2\u0dba\u0ddd\u0d9a\u0dca\u0dc3\u0dca\u0dc3\u0db8\u0dcf\u0db1\u0dca\u0dad\u0dbb\u0dc0 \u0d85\u0dc0\u0db0\u0dcf\u0db1\u0dba \u0dc3\u0dc4 \u0db4\u0dca\u0dbb\u0dad\u0dd2\u0db4\u0ddd\u0dc2\u0dab \u0da2\u0dcf\u0dbd\u0dba \u0d9a\u0dca\u0dbb\u0dd2\u0dba\u0dcf\u0dad\u0dca\u0db8\u0d9a \u0d9a\u0dbb\u0dba\u0dd2 </p>\n",
74
 "<p>No cache - simply add RoPE embeddings </p>\n": "<p>\u0dc4\u0dd0\u0db9\u0dd2\u0dbd\u0dd2\u0dba\u0d9a\u0dca\u0db1\u0dd0\u0dad - \u0d9a\u0db9\u0dba \u0d9a\u0dcf\u0dc0\u0dd0\u0daf\u0dca\u0daf\u0dd3\u0db8\u0dca \u0d91\u0d9a\u0dad\u0dd4 \u0d9a\u0dbb\u0db1\u0dca\u0db1 </p>\n",
75
 "<p>Number of features for RoPE </p>\n": "<p>\u0d9a\u0db9\u0dba\u0dc3\u0db3\u0dc4\u0dcf \u0dc0\u0dd2\u0dc1\u0dda\u0dc2\u0dcf\u0d82\u0d9c \u0d9c\u0dab\u0db1 </p>\n",
76
 "<p>Number of features per head </p>\n": "<p>\u0dc4\u0dd2\u0dc3\u0d9a\u0da7\u0dc0\u0dd2\u0dc1\u0dda\u0dc2\u0dcf\u0d82\u0d9c \u0d9c\u0dab\u0db1 </p>\n",
77
 "<p>Offset of the current embeddings </p>\n": "<p>\u0dc0\u0dad\u0dca\u0db8\u0db1\u0dca\u0d9a\u0dcf\u0dc0\u0dd0\u0daf\u0dca\u0daf\u0dd3\u0db8\u0dca \u0dc0\u0dbd \u0d95\u0dc6\u0dca\u0dc3\u0dd9\u0da7\u0dca </p>\n",
78
 "<p>Only convert the linear layers in the transformer layers </p>\n": "<p>\u0da7\u0dca\u0dbb\u0dcf\u0db1\u0dca\u0dc3\u0dca\u0dc6\u0ddd\u0db8\u0dbb\u0dca\u0dc3\u0dca\u0dae\u0dbb \u0dc0\u0dbd \u0dbb\u0dda\u0d9b\u0dd3\u0dba \u0dc3\u0dca\u0dae\u0dbb \u0db4\u0db8\u0dab\u0d9a\u0dca \u0db4\u0dbb\u0dd2\u0dc0\u0dbb\u0dca\u0dad\u0db1\u0dba \u0d9a\u0dbb\u0db1\u0dca\u0db1 </p>\n",
79
 "<p>Otherwise, use normal attention </p>\n": "<p>\u0d91\u0dc3\u0dda \u0db1\u0ddc\u0db8\u0dd0\u0dad\u0dd2 \u0db1\u0db8\u0dca, \u0dc3\u0dcf\u0db8\u0dcf\u0db1\u0dca\u0dba \u0d85\u0dc0\u0db0\u0dcf\u0db1\u0dba \u0db7\u0dcf\u0dc0\u0dd2\u0dad\u0dcf \u0d9a\u0dbb\u0db1\u0dca\u0db1</p>\n",
80
 "<p>Query and key lengths </p>\n": "<p>\u0dc0\u0dd2\u0db8\u0dc3\u0dd4\u0db8\u0dc3\u0dc4 \u0dba\u0dad\u0dd4\u0dbb\u0dd4 \u0daf\u0dd2\u0d9c </p>\n",
81
 "<p>Readout layer </p>\n": "<p>\u0d9a\u0dd2\u0dba\u0dc0\u0dd3\u0db8\u0dda\u0dc3\u0dca\u0dae\u0dbb\u0dba </p>\n",
82
 "<p>Reshape from <span translate=no>_^_0_^_</span><a href=\"batch_size, seq_len, n_hidden\">batch_size, seq_len, n_hidden</a>` </p>\n": "<p><span translate=no>_^_0_^_</span><a href=\"batch_size, seq_len, n_hidden\">Batch_size, seq_len, n_hidden</a>`\u0dc0\u0dd9\u0dad\u0dd2\u0db1\u0dca \u0db1\u0dd0\u0dc0\u0dad \u0dc3\u0d9a\u0dc3\u0dca \u0d9a\u0dbb\u0db1\u0dca\u0db1 </p>\n",
83
 "<p>Residual connection </p>\n": "<p>\u0d85\u0dc0\u0dc1\u0dda\u0dc2\u0dc3\u0db8\u0dca\u0db6\u0db1\u0dca\u0db0\u0dad\u0dcf\u0dc0\u0dba </p>\n",
84
 "<p>Return from cache </p>\n": "<p>\u0dc4\u0dd0\u0db9\u0dd2\u0dbd\u0dd2\u0dba\u0dc3\u0dd2\u0da7 \u0d86\u0db4\u0dc3\u0dd4 </p>\n",
85
 "<p>RoPE embedding module </p>\n": "<p>\u0d9a\u0db9\u0dba\u0d9a\u0dcf\u0dc0\u0dd0\u0daf\u0dca\u0daf\u0dd3\u0db8 \u0db8\u0ddc\u0da9\u0dd2\u0dba\u0dd4\u0dbd\u0dba </p>\n",
86
 "<p>RoPE embeddings</p>\n<span translate=no>_^_0_^_</span><p>for <span translate=no>_^_1_^_</span> </p>\n": "<p>\u0d9a\u0db9\u0dba\u0d9a\u0dcf\u0dc0\u0dd0\u0daf\u0dca\u0daf\u0dd3\u0db8\u0dca</p>\n<span translate=no>_^_0_^_</span><p>\u0dc3\u0db3\u0dc4\u0dcf <span translate=no>_^_1_^_</span> </p>\n",
87
 "<p>Save the current state </p>\n": "<p>\u0dc0\u0dad\u0dca\u0db8\u0db1\u0dca\u0dad\u0dad\u0dca\u0dc0\u0dba \u0dc3\u0dd4\u0dbb\u0d9a\u0dd2\u0db1\u0dca\u0db1 </p>\n",
88
 "<p>Scale attention </p>\n": "<p>\u0db4\u0dbb\u0dd2\u0db8\u0dcf\u0dab\u0d85\u0dc0\u0db0\u0dcf\u0db1\u0dba </p>\n",
89
 "<p>Skip if not using int8 quantization </p>\n": "<p>INT8\u0db4\u0dca\u0dbb\u0db8\u0dcf\u0dab\u0d9a\u0dbb\u0dab\u0dba \u0db7\u0dcf\u0dc0\u0dd2\u0dad\u0dcf \u0db1\u0ddc\u0d9a\u0dbb\u0db1\u0dca\u0db1\u0dda \u0db1\u0db8\u0dca \u0db8\u0d9f \u0dc4\u0dbb\u0dd2\u0db1\u0dca\u0db1 </p>\n",
90
 "<p>Split into heads by changing the shape to <span translate=no>_^_0_^_</span> </p>\n": "<p>\u0dc4\u0dd0\u0da9\u0dba\u0dc0\u0dd9\u0db1\u0dc3\u0dca \u0d9a\u0dd2\u0dbb\u0dd3\u0db8\u0dd9\u0db1\u0dca \u0dc4\u0dd2\u0dc3\u0dca \u0dc0\u0dbd\u0da7 \u0db6\u0dd9\u0daf\u0db1\u0dca\u0db1 <span translate=no>_^_0_^_</span> </p>\n",
91
 "<p>Split into query, key and value each of shape <span translate=no>_^_0_^_</span> </p>\n": "<p>\u0dc0\u0dd2\u0db8\u0dc3\u0dd4\u0db8\u0da7\u0db6\u0dd9\u0daf\u0db1\u0dca\u0db1, \u0dba\u0dad\u0dd4\u0dbb \u0dc3\u0dc4 \u0dc4\u0dd0\u0da9\u0dba \u0d91\u0d9a\u0dca \u0d91\u0d9a\u0dca \u0d85\u0d9c\u0dba \u0d9a\u0dbb\u0db1\u0dca\u0db1 <span translate=no>_^_0_^_</span> </p>\n",
92
 "<p>Split the features. We apply RoPE to only <span translate=no>_^_0_^_</span> features </p>\n": "<p>\u0dc0\u0dd2\u0dc1\u0dda\u0dc2\u0dcf\u0d82\u0d9c\u0db6\u0dd9\u0daf\u0db1\u0dca\u0db1. <span translate=no>_^_0_^_</span> \u0dc0\u0dd2\u0dc1\u0dda\u0dc2\u0dcf\u0d82\u0d9c \u0dc3\u0db3\u0dc4\u0dcf \u0db4\u0db8\u0dab\u0d9a\u0dca \u0d85\u0db4\u0dd2 \u0d9a\u0db9\u0dba \u0dba\u0ddc\u0daf\u0db1\u0dca\u0db1\u0dd9\u0db8\u0dd4 </p>\n",
93
 "<p>Stack them into shape <span translate=no>_^_0_^_</span> </p>\n": "<p>\u0d92\u0dc0\u0dcf \u0dc4\u0dd0\u0da9\u0dba\u0da7 \u0d9c\u0ddc\u0da9\u0d9c\u0dc3\u0db1\u0dca\u0db1<span translate=no>_^_0_^_</span></p>\n",
94
 "<p>The output is of shape <span translate=no>_^_0_^_</span> </p>\n": "<p>\u0db4\u0dca\u0dbb\u0dad\u0dd2\u0daf\u0dcf\u0db1\u0dba \u0dc4\u0dd0\u0da9\u0dba\u0dd9\u0db1\u0dca \u0dba\u0dd4\u0d9a\u0dca\u0dad \u0dc0\u0dda<span translate=no>_^_0_^_</span></p>\n",
95
 "<p>To cache causal mask </p>\n": "<p>\u0dc4\u0dda\u0dad\u0dd4\u0d86\u0dc0\u0dbb\u0dab \u0dc4\u0dd0\u0db9\u0dd2\u0dbd\u0dd2 \u0d9a\u0dd2\u0dbb\u0dd3\u0db8\u0da7 </p>\n",
96
 "<p>To store <span translate=no>_^_0_^_</span> for the features </p>\n": "<p>\u0dc0\u0dd2\u0dc1\u0dda\u0dc2\u0dcf\u0d82\u0d9c <span translate=no>_^_0_^_</span> \u0dc3\u0db3\u0dc4\u0dcf \u0d9c\u0db6\u0da9\u0dcf \u0d9a\u0dd2\u0dbb\u0dd3\u0db8\u0da7 </p>\n",
97
 "<p>Transformer layer </p>\n": "<p>\u0da7\u0dca\u0dbb\u0dcf\u0db1\u0dca\u0dc3\u0dca\u0dc6\u0ddd\u0db8\u0dbb\u0dca\u0dc3\u0dca\u0dae\u0dbb\u0dba </p>\n",
98
 "<p>Transformer layers </p>\n": "<p>\u0da7\u0dca\u0dbb\u0dcf\u0db1\u0dca\u0dc3\u0dca\u0dc6\u0ddd\u0db8\u0dbb\u0dca\u0dc3\u0dca\u0dae\u0dbb </p>\n",
99
 "<p>Use <span translate=no>_^_0_^_</span> defined in <a href=\"./utils/llm_int8.html\">utilities</a>. </p>\n": "<p><a href=\"./utils/llm_int8.html\">\u0d8b\u0db4\u0dba\u0ddd\u0d9c\u0dd2\u0dad\u0dcf</a>\u0dc0\u0dbd <span translate=no>_^_0_^_</span> \u0d85\u0dbb\u0dca\u0dae \u0daf\u0d9a\u0dca\u0dc0\u0dcf \u0d87\u0dad\u0dd2 \u0db7\u0dcf\u0dc0\u0dd2\u0dad\u0dcf \u0d9a\u0dbb\u0db1\u0dca\u0db1. </p>\n",
100
 "<p>Use flash attention </p>\n": "<p>\u0dc6\u0dca\u0dbd\u0dd1\u0dc2\u0dca \u0d85\u0dc0\u0db0\u0dcf\u0db1\u0dba \u0db7\u0dcf\u0dc0\u0dd2\u0dad\u0dcf \u0d9a\u0dbb\u0db1\u0dca\u0db1</p>\n",
101
 "<ul><li><span translate=no>_^_0_^_</span>  are the embeddings of shape <span translate=no>_^_1_^_</span></li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span> \u0dc4\u0dd0\u0da9\u0dba\u0dda \u0d9a\u0dcf\u0dc0\u0dd0\u0daf\u0dca\u0daf\u0dd3\u0db8\u0dca \u0dc0\u0dda <span translate=no>_^_1_^_</span></li></ul>\n",
102
 "<ul><li><span translate=no>_^_0_^_</span>  are the token ids of shape <span translate=no>_^_1_^_</span></li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span> \u0dc4\u0dd0\u0da9\u0dba\u0dda \u0da7\u0ddd\u0d9a\u0db1\u0dca \u0dc4\u0dd0\u0db3\u0dd4\u0db1\u0dd4\u0db8\u0dca \u0dc0\u0dda <span translate=no>_^_1_^_</span></li></ul>\n",
103
 "<ul><li><span translate=no>_^_0_^_</span>  has shape <span translate=no>_^_1_^_</span> </li>\n<li><span translate=no>_^_2_^_</span>  is the starting position of <span translate=no>_^_3_^_</span>. This is <span translate=no>_^_4_^_</span> when we have cached the keys and queries of previous positions</li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span> \u0dc4\u0dd0\u0da9\u0dba \u0d87\u0dad <span translate=no>_^_1_^_</span> </li>\n<li><span translate=no>_^_2_^_</span> \u0dba\u0db1\u0dd4 \u0d86\u0dbb\u0db8\u0dca\u0db7\u0d9a \u0dc3\u0dca\u0dae\u0dcf\u0db1\u0dba\u0dba\u0dd2 <span translate=no>_^_3_^_</span>. \u0db4\u0dd9\u0dbb \u0dad\u0db1\u0dad\u0dd4\u0dbb\u0dd4 \u0dc0\u0dbd \u0dba\u0dad\u0dd4\u0dbb\u0dd4 \u0dc3\u0dc4 \u0dc0\u0dd2\u0db8\u0dc3\u0dd4\u0db8\u0dca \u0d85\u0db4 \u0dc4\u0dd0\u0db9\u0dd2\u0dbd\u0dd2 \u0d9a\u0dc5 <span translate=no>_^_4_^_</span> \u0dc0\u0dd2\u0da7 \u0db8\u0dd9\u0dba \u0dc3\u0dd2\u0daf\u0dd4 \u0dc0\u0dda</li></ul>\n",
104
 "<ul><li><span translate=no>_^_0_^_</span>  has shape <span translate=no>_^_1_^_</span></li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span> \u0dc4\u0dd0\u0da9\u0dba \u0d87\u0dad <span translate=no>_^_1_^_</span></li></ul>\n",
105
 "<ul><li><span translate=no>_^_0_^_</span>  is the embedding size </li>\n<li><span translate=no>_^_1_^_</span>  is the number of heads </li>\n<li><span translate=no>_^_2_^_</span>  specifies whether to use  <a href=\"https://github.com/HazyResearch/flash-attention\">FlashAttention</a></li></ul>\n<p><em>Out implementation doesn&#x27;t include dropout</em>.</p>\n": "<ul><li><span translate=no>_^_0_^_</span>\u0d9a\u0dcf\u0dc0\u0dd0\u0daf\u0dca\u0daf\u0dd3\u0db8 \u0db4\u0dca\u0dbb\u0db8\u0dcf\u0dab\u0dba \u0dc0\u0dda</li>\n<li><span translate=no>_^_1_^_</span>\u0dc4\u0dd2\u0dc3\u0dca \u0dc3\u0d82\u0d9b\u0dca\u0dba\u0dcf\u0dc0 \u0dc0\u0dda</li>\n<li><span translate=no>_^_2_^_</span><a href=\"https://github.com/HazyResearch/flash-attention\">FlashAttention</a> \u0db7\u0dcf\u0dc0\u0dd2\u0dad\u0dcf \u0d9a\u0dc5 \u0dba\u0dd4\u0dad\u0dd4\u0daf \u0dba\u0db1\u0dca\u0db1 \u0db1\u0dd2\u0dba\u0db8 \u0d9a\u0dbb\u0dba\u0dd2</li></ul>\n<p><em>\u0db4\u0dd2\u0da7\u0dad \u0d9a\u0dca\u0dbb\u0dd2\u0dba\u0dcf\u0dad\u0dca\u0db8\u0d9a \u0d9a\u0dd2\u0dbb\u0dd3\u0db8 \u0d85\u0dad\u0dc4\u0dd0\u0dbb \u0daf\u0dd0\u0db8\u0dd3\u0db8 \u0d87\u0dad\u0dd4\u0dc5\u0dad\u0dca \u0db1\u0ddc\u0dc0\u0dda</em>.</p>\n",
106
 "<ul><li><span translate=no>_^_0_^_</span>  is the embedding size </li>\n<li><span translate=no>_^_1_^_</span>  is the size of the vocabulary</li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span> \u0d9a\u0dcf\u0dc0\u0dd0\u0daf\u0dca\u0daf\u0dd3\u0db8 \u0db4\u0dca\u0dbb\u0db8\u0dcf\u0dab\u0dba \u0dc0\u0dda </li>\n<li><span translate=no>_^_1_^_</span> \u0dc0\u0da0\u0db1 \u0db8\u0dcf\u0dbd\u0dcf\u0dc0\u0dda \u0db4\u0dca\u0dbb\u0db8\u0dcf\u0dab\u0dba \u0dc0\u0dda</li></ul>\n",
107
 "<ul><li><span translate=no>_^_0_^_</span>  is the embedding size</li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span> \u0d9a\u0dcf\u0dc0\u0dd0\u0daf\u0dca\u0daf\u0dd3\u0db8 \u0db4\u0dca\u0dbb\u0db8\u0dcf\u0dab\u0dba \u0dc0\u0dda</li></ul>\n",
108
 "<ul><li><span translate=no>_^_0_^_</span>  is the number of features for RoPE embeddings </li>\n<li><span translate=no>_^_1_^_</span>  is the base for <span translate=no>_^_2_^_</span>, which defaults to <span translate=no>_^_3_^_</span></li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span> \u0d9a\u0db9\u0dba \u0d9a\u0dcf\u0dc0\u0dd0\u0daf\u0dca\u0daf\u0dd3\u0db8\u0dca \u0dc3\u0db3\u0dc4\u0dcf \u0dc0\u0dd2\u0dc1\u0dda\u0dc2\u0dcf\u0d82\u0d9c \u0d9c\u0dab\u0db1 </li>\n<li><span translate=no>_^_1_^_</span> \u0dc3\u0db3\u0dc4\u0dcf \u0db4\u0daf\u0db1\u0db8 \u0dc0\u0dda <span translate=no>_^_2_^_</span>, \u0d91\u0dba \u0db4\u0dd0\u0dc4\u0dd0\u0dbb \u0dc4\u0dbb\u0dd2\u0db1 <span translate=no>_^_3_^_</span></li></ul>\n",
109
 "<ul><li><span translate=no>_^_0_^_</span>  is the size of the vocabulary </li>\n<li><span translate=no>_^_1_^_</span>  is the size of the embeddings</li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span> \u0dc0\u0da0\u0db1 \u0db8\u0dcf\u0dbd\u0dcf\u0dc0\u0dda \u0db4\u0dca\u0dbb\u0db8\u0dcf\u0dab\u0dba \u0dc0\u0dda </li>\n<li><span translate=no>_^_1_^_</span> \u0db8\u0dd9\u0db8 \u0d9a\u0dcf\u0dc0\u0dd0\u0daf\u0dca\u0daf\u0dd3\u0db8\u0dca \u0db4\u0dca\u0dbb\u0db8\u0dcf\u0dab\u0dba</li></ul>\n",
110
 "<ul><li><span translate=no>_^_0_^_</span>  the number of features in embeddings </li>\n<li><span translate=no>_^_1_^_</span>  the number of attention heads </li>\n<li><span translate=no>_^_2_^_</span>  percentage of features to add RoPE embeddings </li>\n<li><span translate=no>_^_3_^_</span>  masking fill value for attention matrix </li>\n<li><span translate=no>_^_4_^_</span>  specifies whether to use  <a href=\"https://github.com/HazyResearch/flash-attention\">FlashAttention</a></li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u0d9a\u0dcf\u0dc0\u0dd0\u0daf\u0dca\u0daf\u0dd3\u0db8\u0dca \u0dc0\u0dbd \u0dc0\u0dd2\u0dc1\u0dda\u0dc2\u0dcf\u0d82\u0d9c \u0d9c\u0dab\u0db1</li>\n<li><span translate=no>_^_1_^_</span>\u0d85\u0dc0\u0db0\u0dcf\u0db1\u0dba \u0dba\u0ddc\u0db8\u0dd4 \u0db4\u0dca\u0dbb\u0db0\u0dcf\u0db1\u0dd3\u0db1\u0dca \u0dc3\u0d82\u0d9b\u0dca\u0dba\u0dcf\u0dc0</li>\n<li><span translate=no>_^_2_^_</span>\u0d9a\u0db9\u0dba \u0d9a\u0dcf\u0dc0\u0dd0\u0daf\u0dca\u0daf\u0dd3\u0db8\u0dca \u0d91\u0d9a\u0dad\u0dd4 \u0d9a\u0dd2\u0dbb\u0dd3\u0db8 \u0dc3\u0db3\u0dc4\u0dcf \u0dc0\u0dd2\u0dc1\u0dda\u0dc2\u0dcf\u0d82\u0d9c \u0db4\u0dca\u0dbb\u0dad\u0dd2\u0dc1\u0dad\u0dba</li>\n<li><span translate=no>_^_3_^_</span>\u0d85\u0dc0\u0db0\u0dcf\u0db1\u0dba \u0dba\u0ddc\u0db8\u0dd4 \u0db1\u0dca\u0dba\u0dcf\u0dc3\u0dba \u0dc3\u0db3\u0dc4\u0dcf \u0d86\u0dc0\u0dbb\u0dab \u0db4\u0dd2\u0dbb\u0dc0\u0dd4\u0db8\u0dca \u0d85\u0d9c\u0dba</li>\n<li><span translate=no>_^_4_^_</span><a href=\"https://github.com/HazyResearch/flash-attention\">FlashAttention</a> \u0db7\u0dcf\u0dc0\u0dd2\u0dad\u0dcf \u0d9a\u0dc5 \u0dba\u0dd4\u0dad\u0dd4\u0daf \u0dba\u0db1\u0dca\u0db1 \u0db1\u0dd2\u0dba\u0db8 \u0d9a\u0dbb\u0dba\u0dd2</li></ul>\n",
111
 "GPT-NeoX Model Definition": "\u0da2\u0dd3\u0db4\u0dd3\u0da7\u0dd3-\u0db1\u0dd2\u0dba\u0ddd\u0d9a\u0dca\u0dc3\u0dca \u0d86\u0daf\u0dbb\u0dca\u0dc1 \u0d85\u0dbb\u0dca\u0dae \u0daf\u0dd0\u0d9a\u0dca\u0dc0\u0dd3\u0db8",
112
 "This is the model definition of GPT-NeoX.": "\u0da2\u0dd3\u0db4\u0dd3\u0da7\u0dd3-\u0db1\u0dd2\u0dba\u0ddd\u0d9a\u0dca\u0dc3\u0dca \u0dc4\u0dd2 \u0d86\u0daf\u0dbb\u0dca\u0dc1 \u0d85\u0dbb\u0dca\u0dae \u0daf\u0dd0\u0d9a\u0dca\u0dc0\u0dd3\u0db8 \u0db8\u0dd9\u0dba\u0dba\u0dd2."
113
}
114
Product

Resources

Company