Path: blob/master/translate_cache/rl/dqn/experiment.zh.json
4937 views
{1"<h1>DQN Experiment with Atari Breakout</h1>\n<p>This experiment trains a Deep Q Network (DQN) to play Atari Breakout game on OpenAI Gym. It runs the <a href=\"../game.html\">game environments on multiple processes</a> to sample efficiently.</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n": "<h1>\u4f7f\u7528 Atari Breakout \u8fdb\u884c DQN \u5b9e\u9a8c</h1>\n<p>\u8be5\u5b9e\u9a8c\u8bad\u7ec3 Deep Q Network (DQN) \u5728 OpenAI Gym \u4e0a\u73a9 Atari Breakout \u6e38\u620f\u3002\u5b83\u5728<a href=\"../game.html\">\u591a\u4e2a\u8fdb\u7a0b\u4e0a\u8fd0\u884c\u6e38\u620f\u73af\u5883</a>\u4ee5\u9ad8\u6548\u91c7\u6837\u3002</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n",2"<h2>Run it</h2>\n": "<h2>\u8fd0\u884c\u5b83</h2>\n",3"<h2>Trainer</h2>\n": "<h2>\u8bad\u7ec3\u5e08</h2>\n",4"<h3>Destroy</h3>\n<p>Stop the workers</p>\n": "<h3>\u6467\u6bc1</h3>\n<p>\u963b\u6b62\u5de5\u4eba</p>\n",5"<h3>Run training loop</h3>\n": "<h3>\u8dd1\u6b65\u8bad\u7ec3\u5faa\u73af</h3>\n",6"<h3>Sample data</h3>\n": "<h3>\u6837\u672c\u6570\u636e</h3>\n",7"<h3>Train the model</h3>\n": "<h3>\u8bad\u7ec3\u6a21\u578b</h3>\n",8"<h4><span translate=no>_^_0_^_</span>-greedy Sampling</h4>\n<p>When sampling actions we use a <span translate=no>_^_1_^_</span>-greedy strategy, where we take a greedy action with probabiliy <span translate=no>_^_2_^_</span> and take a random action with probability <span translate=no>_^_3_^_</span>. We refer to <span translate=no>_^_4_^_</span> as <span translate=no>_^_5_^_</span>.</p>\n": "<h4><span translate=no>_^_0_^_</span>-\u8d2a\u5a6a\u91c7\u6837</h4>\n\u5728\u5bf9@@ <p>\u52a8\u4f5c\u8fdb\u884c\u62bd\u6837\u65f6\uff0c\u6211\u4eec\u4f7f\u7528<span translate=no>_^_1_^_</span>-greedy\u7b56\u7565\uff0c\u5176\u4e2d\u6211\u4eec\u91c7\u53d6\u6982\u7387\u7684\u8d2a\u5a6a\u52a8\u4f5c\uff0c<span translate=no>_^_2_^_</span>\u5e76\u968f\u673a\u91c7\u53d6\u6982\u7387\u52a8\u4f5c<span translate=no>_^_3_^_</span>\u3002\u6211\u4eec\u79f0\u4e4b<span translate=no>_^_4_^_</span>\u4e3a<span translate=no>_^_5_^_</span>\u3002</p>\n",9"<p><span translate=no>_^_0_^_</span> for prioritized replay </p>\n": "<p><span translate=no>_^_0_^_</span>\u7528\u4e8e\u4f18\u5148\u91cd\u64ad</p>\n",10"<p><span translate=no>_^_0_^_</span> for replay buffer as a function of updates </p>\n": "<p><span translate=no>_^_0_^_</span>\u4f5c\u4e3a\u66f4\u65b0\u51fd\u6570\u7684\u91cd\u64ad\u7f13\u51b2\u533a</p>\n",11"<p><span translate=no>_^_0_^_</span>, exploration fraction </p>\n": "<p><span translate=no>_^_0_^_</span>\uff0c\u52d8\u63a2\u5206\u6570</p>\n",12"<p>Add a new line to the screen periodically </p>\n": "<p>\u5b9a\u671f\u5728\u5c4f\u5e55\u4e0a\u6dfb\u52a0\u65b0\u884c</p>\n",13"<p>Add transition to replay buffer </p>\n": "<p>\u5c06\u8fc7\u6e21\u6dfb\u52a0\u5230\u91cd\u64ad\u7f13\u51b2\u533a</p>\n",14"<p>Calculate gradients </p>\n": "<p>\u8ba1\u7b97\u68af\u5ea6</p>\n",15"<p>Calculate priorities for replay buffer <span translate=no>_^_0_^_</span> </p>\n": "<p>\u8ba1\u7b97\u91cd\u64ad\u7f13\u51b2\u533a\u7684\u4f18\u5148\u7ea7<span translate=no>_^_0_^_</span></p>\n",16"<p>Clip gradients </p>\n": "<p>\u526a\u8f91\u6e10\u53d8</p>\n",17"<p>Collect information from each worker </p>\n": "<p>\u6536\u96c6\u6bcf\u4f4d\u5458\u5de5\u7684\u4fe1\u606f</p>\n",18"<p>Compute Temporal Difference (TD) errors, <span translate=no>_^_0_^_</span>, and the loss, <span translate=no>_^_1_^_</span>. </p>\n": "<p>\u8ba1\u7b97\u65f6\u5dee (TD) \u8bef\u5dee\u548c\u635f\u5931<span translate=no>_^_1_^_</span>\u3002<span translate=no>_^_0_^_</span></p>\n",19"<p>Configurations </p>\n": "<p>\u914d\u7f6e</p>\n",20"<p>Copy to target network initially </p>\n": "<p>\u6700\u521d\u590d\u5236\u5230\u76ee\u6807\u7f51\u7edc</p>\n",21"<p>Create the experiment </p>\n": "<p>\u521b\u5efa\u5b9e\u9a8c</p>\n",22"<p>Get <span translate=no>_^_0_^_</span> </p>\n": "<p>\u5f97\u5230<span translate=no>_^_0_^_</span></p>\n",23"<p>Get Q_values for the current observation </p>\n": "<p>\u83b7\u53d6\u5f53\u524d\u89c2\u6d4b\u503c\u7684 Q_Values</p>\n",24"<p>Get results after executing the actions </p>\n": "<p>\u6267\u884c\u64cd\u4f5c\u540e\u83b7\u53d6\u7ed3\u679c</p>\n",25"<p>Get the Q-values of the next state for <a href=\"index.html\">Double Q-learning</a>. Gradients shouldn't propagate for these </p>\n": "<p>\u83b7\u53d6 \u201c<a href=\"index.html\">\u53cc Q \u5b66\u4e60\u201d \u7684\u4e0b\u4e00\u4e2a\u72b6\u6001\u7684 Q</a> \u503c\u3002\u68af\u5ea6\u4e0d\u5e94\u8be5\u4e3a\u8fd9\u4e9b\u4f20\u64ad</p>\n",26"<p>Get the predicted Q-value </p>\n": "<p>\u83b7\u53d6\u9884\u6d4b\u7684 Q \u503c</p>\n",27"<p>Initialize the trainer </p>\n": "<p>\u521d\u59cb\u5316\u8bad\u7ec3\u5668</p>\n",28"<p>Last 100 episode information </p>\n": "<p>\u6700\u8fd1 100 \u96c6\u4fe1\u606f</p>\n",29"<p>Learning rate. </p>\n": "<p>\u5b66\u4e60\u7387\u3002</p>\n",30"<p>Mini batch size </p>\n": "<p>\u5c0f\u6279\u91cf</p>\n",31"<p>Model for sampling and training </p>\n": "<p>\u91c7\u6837\u548c\u8bad\u7ec3\u6a21\u578b</p>\n",32"<p>Number of epochs to train the model with sampled data. </p>\n": "\u4f7f\u7528@@ <p>\u91c7\u6837\u6570\u636e\u8bad\u7ec3\u6a21\u578b\u7684\u5468\u671f\u6570\u3002</p>\n",33"<p>Number of steps to run on each process for a single update </p>\n": "<p>\u5355\u6b21\u66f4\u65b0\u7684\u6bcf\u4e2a\u8fdb\u7a0b\u8981\u8fd0\u884c\u7684\u6b65\u9aa4\u6570</p>\n",34"<p>Number of updates </p>\n": "<p>\u66f4\u65b0\u6b21\u6570</p>\n",35"<p>Number of worker processes </p>\n": "<p>\u5de5\u4f5c\u8fdb\u7a0b\u6570</p>\n",36"<p>Periodically update target network </p>\n": "<p>\u5b9a\u671f\u66f4\u65b0\u76ee\u6807\u7f51\u7edc</p>\n",37"<p>Pick the action based on <span translate=no>_^_0_^_</span> </p>\n": "<p>\u6839\u636e\u4ee5\u4e0b\u5185\u5bb9\u9009\u62e9\u64cd\u4f5c<span translate=no>_^_0_^_</span></p>\n",38"<p>Replay buffer with <span translate=no>_^_0_^_</span>. Capacity of the replay buffer must be a power of 2. </p>\n": "\u4f7f\u7528@@ <p>\u91cd\u64ad\u7f13\u51b2\u533a<span translate=no>_^_0_^_</span>\u3002\u91cd\u64ad\u7f13\u51b2\u533a\u7684\u5bb9\u91cf\u5fc5\u987b\u662f 2 \u7684\u5e42\u3002</p>\n",39"<p>Run and monitor the experiment </p>\n": "<p>\u8fd0\u884c\u5e76\u76d1\u63a7\u5b9e\u9a8c</p>\n",40"<p>Run sampled actions on each worker </p>\n": "<p>\u5bf9\u6bcf\u4e2a\u5de5\u4f5c\u5668\u8fd0\u884c\u91c7\u6837\u64cd\u4f5c</p>\n",41"<p>Sample <span translate=no>_^_0_^_</span> </p>\n": "<p>\u6837\u672c<span translate=no>_^_0_^_</span></p>\n",42"<p>Sample actions </p>\n": "<p>\u64cd\u4f5c\u793a\u4f8b</p>\n",43"<p>Sample from priority replay buffer </p>\n": "<p>\u6765\u81ea\u4f18\u5148\u7ea7\u91cd\u64ad\u7f13\u51b2\u533a\u7684\u6837\u672c</p>\n",44"<p>Sample the action with highest Q-value. This is the greedy action. </p>\n": "<p>\u91c7\u6837\u5177\u6709\u6700\u9ad8 Q \u503c\u7684\u52a8\u4f5c\u3002\u8fd9\u662f\u8d2a\u5a6a\u7684\u884c\u52a8\u3002</p>\n",45"<p>Sample with current policy </p>\n": "<p>\u5f53\u524d\u653f\u7b56\u7684\u793a\u4f8b</p>\n",46"<p>Sampling doesn't need gradients </p>\n": "<p>\u91c7\u6837\u4e0d\u9700\u8981\u6e10\u53d8</p>\n",47"<p>Save tracked indicators. </p>\n": "<p>\u4fdd\u5b58\u8ddf\u8e2a\u7684\u6307\u6807\u3002</p>\n",48"<p>Scale observations from <span translate=no>_^_0_^_</span> to <span translate=no>_^_1_^_</span> </p>\n": "<p>\u5c06\u89c2\u6d4b\u503c\u4ece\u7f29\u653e<span translate=no>_^_0_^_</span>\u5230<span translate=no>_^_1_^_</span></p>\n",49"<p>Select device </p>\n": "<p>\u9009\u62e9\u8bbe\u5907</p>\n",50"<p>Set learning rate </p>\n": "<p>\u8bbe\u7f6e\u5b66\u4e60\u901f\u7387</p>\n",51"<p>Start training after the buffer is full </p>\n": "<p>\u7f13\u51b2\u533a\u6ee1\u540e\u5f00\u59cb\u8bad\u7ec3</p>\n",52"<p>Stop the workers </p>\n": "<p>\u963b\u6b62\u5de5\u4eba</p>\n",53"<p>Target model updating interval </p>\n": "<p>\u76ee\u6807\u6a21\u578b\u66f4\u65b0\u95f4\u9694</p>\n",54"<p>This doesn't need gradients </p>\n": "<p>\u8fd9\u4e0d\u9700\u8981\u6e10\u53d8</p>\n",55"<p>Train the model </p>\n": "<p>\u8bad\u7ec3\u6a21\u578b</p>\n",56"<p>Uniformly sample and action </p>\n": "<p>\u7edf\u4e00\u91c7\u6837\u548c\u884c\u52a8</p>\n",57"<p>Update parameters based on gradients </p>\n": "<p>\u6839\u636e\u6e10\u53d8\u66f4\u65b0\u53c2\u6570</p>\n",58"<p>Update replay buffer priorities </p>\n": "<p>\u66f4\u65b0\u91cd\u64ad\u7f13\u51b2\u533a\u4f18\u5148\u7ea7</p>\n",59"<p>Whether to chose greedy action or the random action </p>\n": "<p>\u9009\u62e9\u8d2a\u5a6a\u52a8\u4f5c\u8fd8\u662f\u968f\u673a\u52a8\u4f5c</p>\n",60"<p>Zero out the previously calculated gradients </p>\n": "<p>\u5c06\u5148\u524d\u8ba1\u7b97\u7684\u68af\u5ea6\u5f52\u96f6</p>\n",61"<p>create workers </p>\n": "<p>\u521b\u5efa\u5de5\u4f5c\u4eba\u5458</p>\n",62"<p>exploration as a function of updates </p>\n": "<p>\u4f5c\u4e3a\u66f4\u65b0\u51fd\u6570\u7684\u63a2\u7d22</p>\n",63"<p>get the initial observations </p>\n": "<p>\u83b7\u5f97\u521d\u6b65\u89c2\u6d4b\u503c</p>\n",64"<p>initialize tensors for observations </p>\n": "<p>\u521d\u59cb\u5316\u89c2\u6d4b\u503c\u7684\u5f20\u91cf</p>\n",65"<p>learning rate </p>\n": "<p>\u5b66\u4e60\u7387</p>\n",66"<p>loss function </p>\n": "<p>\u635f\u5931\u51fd\u6570</p>\n",67"<p>number of training iterations </p>\n": "<p>\u8bad\u7ec3\u8fed\u4ee3\u6b21\u6570</p>\n",68"<p>number of updates </p>\n": "<p>\u66f4\u65b0\u6b21\u6570</p>\n",69"<p>number of workers </p>\n": "<p>\u5de5\u4f5c\u4eba\u5458\u4eba\u6570</p>\n",70"<p>optimizer </p>\n": "<p>\u4f18\u5316\u8005</p>\n",71"<p>reset the workers </p>\n": "<p>\u91cd\u7f6e\u5de5\u4f5c\u4eba\u5458</p>\n",72"<p>size of mini batch for training </p>\n": "<p>\u7528\u4e8e\u8bad\u7ec3\u7684\u5fae\u578b\u6279\u6b21\u7684\u5927\u5c0f</p>\n",73"<p>steps sampled on each update </p>\n": "<p>\u6bcf\u6b21\u66f4\u65b0\u65f6\u91c7\u6837\u7684\u6b65\u9aa4</p>\n",74"<p>target model to get <span translate=no>_^_0_^_</span> </p>\n": "<p>\u8981\u83b7\u53d6\u7684\u76ee\u6807\u6a21\u578b<span translate=no>_^_0_^_</span></p>\n",75"<p>update current observation </p>\n": "<p>\u66f4\u65b0\u5f53\u524d\u89c2\u6d4b\u503c</p>\n",76"<p>update episode information. collect episode info, which is available if an episode finished; this includes total reward and length of the episode - look at <span translate=no>_^_0_^_</span> to see how it works. </p>\n": "<p>\u66f4\u65b0\u5267\u96c6\u4fe1\u606f\u3002\u6536\u96c6\u5267\u96c6\u4fe1\u606f\uff0c\u5982\u679c\u5267\u96c6\u7ed3\u675f\u5219\u53ef\u7528\uff1b\u8fd9\u5305\u62ec\u603b\u5956\u52b1\u548c\u5267\u96c6\u65f6\u957f\u2014\u2014\u770b\u770b<span translate=no>_^_0_^_</span>\u5b83\u662f\u5982\u4f55\u8fd0\u4f5c\u7684\u3002</p>\n",77"<p>update target network every 250 update </p>\n": "<p>\u6bcf 250 \u6b21\u66f4\u65b0\u4e00\u6b21\u76ee\u6807\u7f51\u7edc</p>\n",78"DQN Experiment with Atari Breakout": "\u4f7f\u7528 Atari Breakout \u8fdb\u884c DQN \u5b9e",79"Implementation of DQN experiment with Atari Breakout": "\u4f7f\u7528 Atari Breakout \u5b9e\u65bd DQN \u5b9e\u9a8c"80}8182