Path: blob/master/translate_cache/rl/ppo/experiment.zh.json
4923 views
{1"<h1>PPO Experiment with Atari Breakout</h1>\n<p>This experiment trains Proximal Policy Optimization (PPO) agent Atari Breakout game on OpenAI Gym. It runs the <a href=\"../game.html\">game environments on multiple processes</a> to sample efficiently.</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/ppo/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n": "<h1>PPO \u4e0e Atari Breakout \u8fdb\u884c\u5b9e\u9a8c</h1>\n<p>\u8be5\u5b9e\u9a8c\u5728OpenAI Gym\u4e0a\u8bad\u7ec3\u4e86\u8fd1\u7aef\u7b56\u7565\u4f18\u5316\uff08PPO\uff09\u4ee3\u7406Atari Breakout\u6e38\u620f\u3002\u5b83\u5728<a href=\"../game.html\">\u591a\u4e2a\u8fdb\u7a0b\u4e0a\u8fd0\u884c\u6e38\u620f\u73af\u5883</a>\u4ee5\u9ad8\u6548\u91c7\u6837\u3002</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/ppo/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n",2"<h2>Model</h2>\n": "<h2>\u578b\u53f7</h2>\n",3"<h2>Run it</h2>\n": "<h2>\u8fd0\u884c\u5b83</h2>\n",4"<h2>Trainer</h2>\n": "<h2>\u8bad\u7ec3\u5e08</h2>\n",5"<h3>Calculate total loss</h3>\n": "<h3>\u8ba1\u7b97\u603b\u635f\u5931</h3>\n",6"<h3>Destroy</h3>\n<p>Stop the workers</p>\n": "<h3>\u6467\u6bc1</h3>\n<p>\u963b\u6b62\u5de5\u4eba</p>\n",7"<h3>Run training loop</h3>\n": "<h3>\u8dd1\u6b65\u8bad\u7ec3\u5faa\u73af</h3>\n",8"<h3>Sample data with current policy</h3>\n": "<h3>\u5f53\u524d\u653f\u7b56\u7684\u6837\u672c\u6570\u636e</h3>\n",9"<h3>Train the model based on samples</h3>\n": "<h3>\u6839\u636e\u6837\u672c\u8bad\u7ec3\u6a21\u578b</h3>\n",10"<h4>Configurations</h4>\n": "<h4>\u914d\u7f6e</h4>\n",11"<h4>Initialize</h4>\n": "<h4>\u521d\u59cb\u5316</h4>\n",12"<h4>Normalize advantage function</h4>\n": "<h4>\u89c4\u8303\u5316\u4f18\u52bf\u51fd\u6570</h4>\n",13"<p> </p>\n": "<p></p>\n",14"<p><span translate=no>_^_0_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span></p>\n",15"<p><span translate=no>_^_0_^_</span> keeps track of the last observation from each worker, which is the input for the model to sample the next action </p>\n": "<p><span translate=no>_^_0_^_</span>\u8ddf\u8e2a\u6765\u81ea\u6bcf\u4e2a worker \u7684\u6700\u540e\u4e00\u4e2a\u89c2\u6d4b\u503c\uff0c\u8fd9\u662f\u6a21\u578b\u5bf9\u4e0b\u4e00\u4e2a\u64cd\u4f5c\u8fdb\u884c\u91c7\u6837\u7684\u8f93\u5165</p>\n",16"<p><span translate=no>_^_0_^_</span> returns sampled from <span translate=no>_^_1_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span>\u4ece\u4e2d\u62bd\u6837\u7684\u8fd4\u56de<span translate=no>_^_1_^_</span></p>\n",17"<p><span translate=no>_^_0_^_</span>, <span translate=no>_^_1_^_</span> are actions sampled from <span translate=no>_^_2_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span>\uff0c<span translate=no>_^_1_^_</span>\u662f\u4ece\u4e2d\u91c7\u6837\u7684\u52a8\u4f5c<span translate=no>_^_2_^_</span></p>\n",18"<p><span translate=no>_^_0_^_</span>, where <span translate=no>_^_1_^_</span> is advantages sampled from <span translate=no>_^_2_^_</span>. Refer to sampling function in <a href=\"#main\">Main class</a> below for the calculation of <span translate=no>_^_3_^_</span>. </p>\n": "<p><span translate=no>_^_0_^_</span>\uff0c\u4f18<span translate=no>_^_1_^_</span>\u52bf\u4ece\u54ea\u91cc\u62bd\u6837<span translate=no>_^_2_^_</span>\u3002\u6709\u5173\u8ba1\u7b97\uff0c\u8bf7\u53c2\u9605\u4ee5\u4e0b <a href=\"#main\">Main \u7c7b</a>\u4e2d\u7684\u91c7\u6837\u51fd\u6570<span translate=no>_^_3_^_</span>\u3002</p>\n",19"<p>A fully connected layer takes the flattened frame from third convolution layer, and outputs 512 features </p>\n": "<p>\u5b8c\u5168\u8fde\u63a5\u7684\u56fe\u5c42\u4ece\u7b2c\u4e09\u4e2a\u5377\u79ef\u56fe\u5c42\u83b7\u53d6\u5e73\u5766\u7684\u5e27\uff0c\u5e76\u8f93\u51fa 512 \u4e2a\u8981\u7d20</p>\n",20"<p>A fully connected layer to get logits for <span translate=no>_^_0_^_</span> </p>\n": "<p>\u4e00\u4e2a\u5b8c\u5168\u8fde\u63a5\u7684\u5c42\uff0c\u7528\u4e8e\u83b7\u53d6\u65e5\u5fd7<span translate=no>_^_0_^_</span></p>\n",21"<p>A fully connected layer to get value function </p>\n": "<p>\u4e00\u4e2a\u5b8c\u5168\u8fde\u63a5\u7684\u5c42\u6765\u83b7\u53d6\u4ef7\u503c\u51fd\u6570</p>\n",22"<p>Add a new line to the screen periodically </p>\n": "<p>\u5b9a\u671f\u5728\u5c4f\u5e55\u4e0a\u6dfb\u52a0\u65b0\u884c</p>\n",23"<p>Add to tracker </p>\n": "<p>\u6dfb\u52a0\u5230\u8ffd\u8e2a\u5668</p>\n",24"<p>Calculate Entropy Bonus</p>\n<p><span translate=no>_^_0_^_</span> </p>\n": "<p>\u8ba1\u7b97\u71b5\u52a0\u6210</p>\n<p><span translate=no>_^_0_^_</span></p>\n",25"<p>Calculate gradients </p>\n": "<p>\u8ba1\u7b97\u68af\u5ea6</p>\n",26"<p>Calculate policy loss </p>\n": "<p>\u8ba1\u7b97\u4fdd\u5355\u635f\u5931</p>\n",27"<p>Calculate value function loss </p>\n": "<p>\u8ba1\u7b97\u503c\u51fd\u6570\u635f\u5931</p>\n",28"<p>Clip gradients </p>\n": "<p>\u526a\u8f91\u6e10\u53d8</p>\n",29"<p>Clipping range </p>\n": "<p>\u88c1\u526a\u8303\u56f4</p>\n",30"<p>Configurations </p>\n": "<p>\u914d\u7f6e</p>\n",31"<p>Create the experiment </p>\n": "<p>\u521b\u5efa\u5b9e\u9a8c</p>\n",32"<p>Entropy bonus coefficient </p>\n": "<p>\u71b5\u52a0\u6210\u7cfb\u6570</p>\n",33"<p>GAE with <span translate=no>_^_0_^_</span> and <span translate=no>_^_1_^_</span> </p>\n": "<p>\u4f7f\u7528<span translate=no>_^_0_^_</span>\u548c\u7684 GAE<span translate=no>_^_1_^_</span></p>\n",34"<p>Get value of after the final step </p>\n": "<p>\u5728\u6700\u540e\u4e00\u6b65\u4e4b\u540e\u83b7\u53d6\u7684\u503c</p>\n",35"<p>Initialize the trainer </p>\n": "<p>\u521d\u59cb\u5316\u8bad\u7ec3\u5668</p>\n",36"<p>It learns faster with a higher number of epochs, but becomes a little unstable; that is, the average episode reward does not monotonically increase over time. May be reducing the clipping range might solve it. </p>\n": "<p>\u968f\u7740\u65f6\u4ee3\u6570\u91cf\u7684\u589e\u52a0\uff0c\u5b83\u5b66\u4e60\u5f97\u66f4\u5feb\uff0c\u4f46\u4f1a\u53d8\u5f97\u6709\u70b9\u4e0d\u7a33\u5b9a\uff1b\u4e5f\u5c31\u662f\u8bf4\uff0c\u5e73\u5747\u5267\u96c6\u5956\u52b1\u4e0d\u4f1a\u968f\u7740\u65f6\u95f4\u7684\u63a8\u79fb\u800c\u5355\u8c03\u589e\u52a0\u3002\u53ef\u80fd\u4f1a\u7f29\u5c0f\u526a\u5207\u8303\u56f4\u53ef\u80fd\u4f1a\u89e3\u51b3\u8fd9\u4e2a\u95ee\u9898\u3002</p>\n",37"<p>Learning rate </p>\n": "<p>\u5b66\u4e60\u7387</p>\n",38"<p>Number of mini batches </p>\n": "<p>\u5fae\u578b\u6279\u6b21\u6570</p>\n",39"<p>Number of steps to run on each process for a single update </p>\n": "<p>\u5355\u6b21\u66f4\u65b0\u7684\u6bcf\u4e2a\u8fdb\u7a0b\u8981\u8fd0\u884c\u7684\u6b65\u9aa4\u6570</p>\n",40"<p>Number of updates </p>\n": "<p>\u66f4\u65b0\u6b21\u6570</p>\n",41"<p>Number of worker processes </p>\n": "<p>\u5de5\u4f5c\u8fdb\u7a0b\u6570</p>\n",42"<p>PPO Loss </p>\n": "<p>PPO \u635f\u5931</p>\n",43"<p>Run and monitor the experiment </p>\n": "<p>\u8fd0\u884c\u5e76\u76d1\u63a7\u5b9e\u9a8c</p>\n",44"<p>Sampled observations are fed into the model to get <span translate=no>_^_0_^_</span> and <span translate=no>_^_1_^_</span>; we are treating observations as state </p>\n": "<p>\u91c7\u6837\u89c2\u6d4b\u503c\u88ab\u8f93\u5165\u5230\u6a21\u578b\u4e2d\u4ee5\u83b7\u53d6<span translate=no>_^_0_^_</span>\u548c<span translate=no>_^_1_^_</span>\uff1b\u6211\u4eec\u5c06\u89c2\u6d4b\u503c\u89c6\u4e3a\u72b6\u6001</p>\n",45"<p>Save tracked indicators. </p>\n": "<p>\u4fdd\u5b58\u8ddf\u8e2a\u7684\u6307\u6807\u3002</p>\n",46"<p>Scale observations from <span translate=no>_^_0_^_</span> to <span translate=no>_^_1_^_</span> </p>\n": "<p>\u5c06\u89c2\u6d4b\u503c\u4ece\u7f29\u653e<span translate=no>_^_0_^_</span>\u5230<span translate=no>_^_1_^_</span></p>\n",47"<p>Select device </p>\n": "<p>\u9009\u62e9\u8bbe\u5907</p>\n",48"<p>Set learning rate </p>\n": "<p>\u8bbe\u7f6e\u5b66\u4e60\u901f\u7387</p>\n",49"<p>Stop the workers </p>\n": "<p>\u963b\u6b62\u5de5\u4eba</p>\n",50"<p>The first convolution layer takes a 84x84 frame and produces a 20x20 frame </p>\n": "<p>\u7b2c\u4e00\u4e2a\u5377\u79ef\u5c42\u91c7\u7528 84x84 \u5e27\u5e76\u751f\u6210 20x20 \u5e27</p>\n",51"<p>The second convolution layer takes a 20x20 frame and produces a 9x9 frame </p>\n": "<p>\u7b2c\u4e8c\u4e2a\u5377\u79ef\u5c42\u91c7\u7528 20x20 \u5e27\u5e76\u751f\u6210 9x9 \u7684\u5e27</p>\n",52"<p>The third convolution layer takes a 9x9 frame and produces a 7x7 frame </p>\n": "<p>\u7b2c\u4e09\u4e2a\u5377\u79ef\u5c42\u91c7\u7528 9x9 \u5e27\u5e76\u751f\u6210 7x7 \u5e27</p>\n",53"<p>Update parameters based on gradients </p>\n": "<p>\u6839\u636e\u6e10\u53d8\u66f4\u65b0\u53c2\u6570</p>\n",54"<p>Value Loss </p>\n": "<p>\u4ef7\u503c\u635f\u5931</p>\n",55"<p>Value loss coefficient </p>\n": "<p>\u4ef7\u503c\u635f\u5931\u7cfb\u6570</p>\n",56"<p>You can change this while the experiment is running. \u2699\ufe0f Learning rate. </p>\n": "<p>\u4f60\u53ef\u4ee5\u5728\u5b9e\u9a8c\u8fd0\u884c\u65f6\u66f4\u6539\u6b64\u8bbe\u7f6e\u3002\u2699\ufe0f \u5b66\u4e60\u7387\u3002</p>\n",57"<p>Zero out the previously calculated gradients </p>\n": "<p>\u5c06\u5148\u524d\u8ba1\u7b97\u7684\u68af\u5ea6\u5f52\u96f6</p>\n",58"<p>calculate advantages </p>\n": "<p>\u8ba1\u7b97\u4f18\u52bf</p>\n",59"<p>collect episode info, which is available if an episode finished; this includes total reward and length of the episode - look at <span translate=no>_^_0_^_</span> to see how it works. </p>\n": "<p>\u6536\u96c6\u5267\u96c6\u4fe1\u606f\uff0c\u5728\u5267\u96c6\u7ed3\u675f\u540e\u53ef\u7528\uff1b\u8fd9\u5305\u62ec\u603b\u5956\u52b1\u548c\u5267\u96c6\u957f\u5ea6\u2014\u2014\u770b\u770b<span translate=no>_^_0_^_</span>\u5b83\u662f\u5982\u4f55\u8fd0\u4f5c\u7684\u3002</p>\n",60"<p>create workers </p>\n": "<p>\u521b\u5efa\u5de5\u4f5c\u4eba\u5458</p>\n",61"<p>for each mini batch </p>\n": "<p>\u6bcf\u5c0f\u6279\u6b21</p>\n",62"<p>for monitoring </p>\n": "<p>\u7528\u4e8e\u76d1\u63a7</p>\n",63"<p>get mini batch </p>\n": "<p>\u83b7\u5f97\u5c0f\u6279\u91cf</p>\n",64"<p>get results after executing the actions </p>\n": "<p>\u6267\u884c\u64cd\u4f5c\u540e\u83b7\u5f97\u7ed3\u679c</p>\n",65"<p>initialize tensors for observations </p>\n": "<p>\u521d\u59cb\u5316\u89c2\u6d4b\u503c\u7684\u5f20\u91cf</p>\n",66"<p>last 100 episode information </p>\n": "<p>\u6700\u8fd1 100 \u96c6\u4fe1\u606f</p>\n",67"<p>model </p>\n": "<p>\u6a21\u578b</p>\n",68"<p>number of epochs to train the model with sampled data </p>\n": "<p>\u4f7f\u7528\u91c7\u6837\u6570\u636e\u8bad\u7ec3\u6a21\u578b\u7684\u5468\u671f\u6570</p>\n",69"<p>number of mini batches </p>\n": "<p>\u5fae\u578b\u6279\u6b21\u6570</p>\n",70"<p>number of steps to run on each process for a single update </p>\n": "<p>\u5355\u6b21\u66f4\u65b0\u7684\u6bcf\u4e2a\u8fdb\u7a0b\u8981\u8fd0\u884c\u7684\u6b65\u9aa4\u6570</p>\n",71"<p>number of updates </p>\n": "<p>\u66f4\u65b0\u6b21\u6570</p>\n",72"<p>number of worker processes </p>\n": "<p>\u5de5\u4f5c\u8fdb\u7a0b\u7684\u6570\u91cf</p>\n",73"<p>optimizer </p>\n": "<p>\u4f18\u5316\u8005</p>\n",74"<p>run sampled actions on each worker </p>\n": "<p>\u5bf9\u6bcf\u4e2a worker \u8fd0\u884c\u91c7\u6837\u64cd\u4f5c</p>\n",75"<p>sample <span translate=no>_^_0_^_</span> from each worker </p>\n": "<p>\u6bcf\u4f4d\u5de5\u4f5c\u4eba\u5458<span translate=no>_^_0_^_</span>\u7684\u6837\u672c</p>\n",76"<p>sample actions from <span translate=no>_^_0_^_</span> for each worker; this returns arrays of size <span translate=no>_^_1_^_</span> </p>\n": "<p>\u6bcf\u4e2a worker<span translate=no>_^_0_^_</span> \u7684\u793a\u4f8b\u64cd\u4f5c\uff1b\u8fd9\u4f1a\u8fd4\u56de\u5927\u5c0f\u6570\u7ec4<span translate=no>_^_1_^_</span></p>\n",77"<p>sample with current policy </p>\n": "<p>\u5f53\u524d\u653f\u7b56\u7684\u6837\u672c</p>\n",78"<p>samples are currently in <span translate=no>_^_0_^_</span> table, we should flatten it for training </p>\n": "<p>\u6837\u672c\u76ee\u524d\u5728<span translate=no>_^_0_^_</span>\u8868\u4e2d\uff0c\u6211\u4eec\u5e94\u8be5\u5c06\u5176\u538b\u5e73\u4ee5\u8fdb\u884c\u8bad\u7ec3</p>\n",79"<p>shuffle for each epoch </p>\n": "<p>\u968f\u673a\u64ad\u653e\u6bcf\u4e2a\u65f6\u4ee3</p>\n",80"<p>size of a mini batch </p>\n": "<p>\u5c0f\u6279\u91cf\u7684\u5927\u5c0f</p>\n",81"<p>total number of samples for a single update </p>\n": "<p>\u5355\u6b21\u66f4\u65b0\u7684\u6837\u672c\u603b\u6570</p>\n",82"<p>train </p>\n": "<p>\u706b\u8f66</p>\n",83"<p>train the model </p>\n": "<p>\u8bad\u7ec3\u6a21\u578b</p>\n",84"<p>\u2699\ufe0f Clip range. </p>\n": "<p>\u2699\ufe0f \u526a\u8f91\u8303\u56f4\u3002</p>\n",85"<p>\u2699\ufe0f Entropy bonus coefficient. You can change this while the experiment is running. </p>\n": "<p>\u2699\ufe0f \u71b5\u52a0\u6210\u7cfb\u6570\u3002\u4f60\u53ef\u4ee5\u5728\u5b9e\u9a8c\u8fd0\u884c\u65f6\u66f4\u6539\u6b64\u8bbe\u7f6e\u3002</p>\n",86"<p>\u2699\ufe0f Number of epochs to train the model with sampled data. You can change this while the experiment is running. </p>\n": "<p>\u2699\ufe0f \u4f7f\u7528\u91c7\u6837\u6570\u636e\u8bad\u7ec3\u6a21\u578b\u7684\u65f6\u4ee3\u6570\u3002\u4f60\u53ef\u4ee5\u5728\u5b9e\u9a8c\u8fd0\u884c\u65f6\u66f4\u6539\u6b64\u8bbe\u7f6e\u3002</p>\n",87"<p>\u2699\ufe0f Value loss coefficient. You can change this while the experiment is running. </p>\n": "<p>\u2699\ufe0f \u4ef7\u503c\u635f\u5931\u7cfb\u6570\u3002\u4f60\u53ef\u4ee5\u5728\u5b9e\u9a8c\u8fd0\u884c\u65f6\u66f4\u6539\u6b64\u8bbe\u7f6e\u3002</p>\n",88"Annotated implementation to train a PPO agent on Atari Breakout game.": "\u5e26\u6ce8\u91ca\u7684\u5b9e\u73b0\uff0c\u7528\u4e8e\u5728 Atari Breakout \u6e38\u620f\u4e2d\u8bad\u7ec3 PPO \u7279\u5de5\u3002",89"PPO Experiment with Atari Breakout": "PPO \u4f7f\u7528 Atari Breakout \u8fdb\u884c\u5b9e\u9a8c"90}9192