Path: blob/master/translate_cache/rl/dqn/experiment.ja.json
4945 views
{1"<h1>DQN Experiment with Atari Breakout</h1>\n<p>This experiment trains a Deep Q Network (DQN) to play Atari Breakout game on OpenAI Gym. It runs the <a href=\"../game.html\">game environments on multiple processes</a> to sample efficiently.</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n": "<h1>\u30a2\u30bf\u30ea\u30fb\u30d6\u30ec\u30a4\u30af\u30a2\u30a6\u30c8\u306b\u3088\u308bDQN\u5b9f\u9a13</h1>\n<p>\u3053\u306e\u5b9f\u9a13\u3067\u306f\u3001\u30c7\u30a3\u30fc\u30d7Q\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\uff08DQN\uff09\u306bOpenAI Gym\u3067\u30a2\u30bf\u30ea\u30d6\u30ec\u30a4\u30af\u30a2\u30a6\u30c8\u30b2\u30fc\u30e0\u3092\u30d7\u30ec\u30a4\u3059\u308b\u3088\u3046\u306b\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0\u3057\u307e\u3059\u3002<a href=\"../game.html\">\u30b2\u30fc\u30e0\u74b0\u5883\u3092\u8907\u6570\u306e\u30d7\u30ed\u30bb\u30b9\u3067\u5b9f\u884c\u3057\u3066\u52b9\u7387\u7684\u306b\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u3057\u307e\u3059</a>\u3002</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n",2"<h2>Run it</h2>\n": "<h2>\u5b9f\u884c\u3057\u3066\u304f\u3060\u3055\u3044</h2>\n",3"<h2>Trainer</h2>\n": "<h2>\u30c8\u30ec\u30fc\u30ca\u30fc</h2>\n",4"<h3>Destroy</h3>\n<p>Stop the workers</p>\n": "<h3>\u7834\u58ca</h3>\n<p>\u52b4\u50cd\u8005\u3092\u6b62\u3081\u308d</p>\n",5"<h3>Run training loop</h3>\n": "<h3>\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0\u30eb\u30fc\u30d7\u3092\u5b9f\u884c</h3>\n",6"<h3>Sample data</h3>\n": "<h3>\u30b5\u30f3\u30d7\u30eb\u30c7\u30fc\u30bf</h3>\n",7"<h3>Train the model</h3>\n": "<h3>\u30e2\u30c7\u30eb\u306e\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0</h3>\n",8"<h4><span translate=no>_^_0_^_</span>-greedy Sampling</h4>\n<p>When sampling actions we use a <span translate=no>_^_1_^_</span>-greedy strategy, where we take a greedy action with probabiliy <span translate=no>_^_2_^_</span> and take a random action with probability <span translate=no>_^_3_^_</span>. We refer to <span translate=no>_^_4_^_</span> as <span translate=no>_^_5_^_</span>.</p>\n": "<h4><span translate=no>_^_0_^_</span>-\u8caa\u6b32\u306a\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0</h4>\n<p>\u30a2\u30af\u30b7\u30e7\u30f3\u3092\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u3059\u308b\u3068\u304d\u306f\u3001<span translate=no>_^_1_^_</span>-greedy \u30b9\u30c8\u30e9\u30c6\u30b8\u30fc\u3092\u4f7f\u7528\u3057\u307e\u3059\u3002\u3064\u307e\u308a\u3001<span translate=no>_^_2_^_</span>\u78ba\u7387\u306e\u3042\u308b\u8caa\u6b32\u306a\u30a2\u30af\u30b7\u30e7\u30f3\u3092\u5b9f\u884c\u3057\u3001\u78ba\u7387\u306e\u3042\u308b\u30e9\u30f3\u30c0\u30e0\u306a\u30a2\u30af\u30b7\u30e7\u30f3\u3092\u5b9f\u884c\u3057\u307e\u3059\u3002<span translate=no>_^_3_^_</span><span translate=no>_^_4_^_</span>\u3068\u547c\u3073\u307e\u3059<span translate=no>_^_5_^_</span>\u3002</p>\n",9"<p><span translate=no>_^_0_^_</span> for prioritized replay </p>\n": "<p><span translate=no>_^_0_^_</span>\u512a\u5148\u518d\u751f\u7528</p>\n",10"<p><span translate=no>_^_0_^_</span> for replay buffer as a function of updates </p>\n": "<p><span translate=no>_^_0_^_</span>\u66f4\u65b0\u6a5f\u80fd\u3068\u3057\u3066\u306e\u518d\u751f\u30d0\u30c3\u30d5\u30a1\u7528</p>\n",11"<p><span translate=no>_^_0_^_</span>, exploration fraction </p>\n": "<p><span translate=no>_^_0_^_</span>\u3001\u63a2\u67fb\u30d5\u30e9\u30af\u30b7\u30e7\u30f3</p>\n",12"<p>Add a new line to the screen periodically </p>\n": "<p>\u753b\u9762\u306b\u5b9a\u671f\u7684\u306b\u65b0\u3057\u3044\u884c\u3092\u8ffd\u52a0\u3057\u3066\u304f\u3060\u3055\u3044</p>\n",13"<p>Add transition to replay buffer </p>\n": "<p>\u518d\u751f\u30d0\u30c3\u30d5\u30a1\u306b\u30c8\u30e9\u30f3\u30b8\u30b7\u30e7\u30f3\u3092\u8ffd\u52a0</p>\n",14"<p>Calculate gradients </p>\n": "<p>\u52fe\u914d\u306e\u8a08\u7b97</p>\n",15"<p>Calculate priorities for replay buffer <span translate=no>_^_0_^_</span> </p>\n": "<p>\u518d\u751f\u30d0\u30c3\u30d5\u30a1\u306e\u512a\u5148\u5ea6\u3092\u8a08\u7b97 <span translate=no>_^_0_^_</span></p>\n",16"<p>Clip gradients </p>\n": "<p>\u30af\u30ea\u30c3\u30d7\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3</p>\n",17"<p>Collect information from each worker </p>\n": "<p>\u5404\u4f5c\u696d\u8005\u304b\u3089\u60c5\u5831\u3092\u53ce\u96c6\u3059\u308b</p>\n",18"<p>Compute Temporal Difference (TD) errors, <span translate=no>_^_0_^_</span>, and the loss, <span translate=no>_^_1_^_</span>. </p>\n": "<p>\u6642\u5dee (TD) \u8aa4\u5dee<span translate=no>_^_0_^_</span>\u3001\u304a\u3088\u3073\u640d\u5931\u3092\u8a08\u7b97\u3057\u307e\u3059\u3002<span translate=no>_^_1_^_</span></p>\n",19"<p>Configurations </p>\n": "<p>\u30b3\u30f3\u30d5\u30a3\u30ae\u30e5\u30ec\u30fc\u30b7\u30e7\u30f3</p>\n",20"<p>Copy to target network initially </p>\n": "<p>\u6700\u521d\u306b\u30bf\u30fc\u30b2\u30c3\u30c8\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u306b\u30b3\u30d4\u30fc</p>\n",21"<p>Create the experiment </p>\n": "<p>\u5b9f\u9a13\u3092\u4f5c\u6210</p>\n",22"<p>Get <span translate=no>_^_0_^_</span> </p>\n": "<p>\u53d6\u5f97 <span translate=no>_^_0_^_</span></p>\n",23"<p>Get Q_values for the current observation </p>\n": "<p>\u73fe\u5728\u306e\u89b3\u6e2c\u5024\u306e Q_value \u3092\u53d6\u5f97</p>\n",24"<p>Get results after executing the actions </p>\n": "<p>\u30a2\u30af\u30b7\u30e7\u30f3\u3092\u5b9f\u884c\u3057\u305f\u5f8c\u306b\u7d50\u679c\u3092\u53d6\u5f97</p>\n",25"<p>Get the Q-values of the next state for <a href=\"index.html\">Double Q-learning</a>. Gradients shouldn't propagate for these </p>\n": "<p><a href=\"index.html\">\u4e8c\u91cdQ\u5b66\u7fd2\u306e\u6b21\u306e\u72b6\u614b\u306eQ\u5024\u3092\u53d6\u5f97\u3057\u307e\u3059</a>\u3002\u3053\u308c\u3089\u306e\u5834\u5408\u3001\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u306f\u4f1d\u64ad\u3057\u306a\u3044\u306f\u305a\u3067\u3059</p>\n",26"<p>Get the predicted Q-value </p>\n": "<p>\u4e88\u6e2c\u3055\u308c\u305f Q \u5024\u306e\u53d6\u5f97</p>\n",27"<p>Initialize the trainer </p>\n": "<p>\u30c8\u30ec\u30fc\u30ca\u30fc\u3092\u521d\u671f\u5316</p>\n",28"<p>Last 100 episode information </p>\n": "<p>\u6700\u65b0100\u8a71\u306e\u60c5\u5831</p>\n",29"<p>Learning rate. </p>\n": "<p>\u5b66\u7fd2\u7387\u3002</p>\n",30"<p>Mini batch size </p>\n": "<p>\u30df\u30cb\u30d0\u30c3\u30c1\u30b5\u30a4\u30ba</p>\n",31"<p>Model for sampling and training </p>\n": "<p>\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u3068\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0\u7528\u306e\u30e2\u30c7\u30eb</p>\n",32"<p>Number of epochs to train the model with sampled data. </p>\n": "<p>\u30b5\u30f3\u30d7\u30eb\u30c7\u30fc\u30bf\u3092\u4f7f\u7528\u3057\u3066\u30e2\u30c7\u30eb\u3092\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0\u3059\u308b\u30a8\u30dd\u30c3\u30af\u306e\u6570\u3002</p>\n",33"<p>Number of steps to run on each process for a single update </p>\n": "<p>1 \u56de\u306e\u66f4\u65b0\u3067\u5404\u30d7\u30ed\u30bb\u30b9\u3067\u5b9f\u884c\u3059\u308b\u30b9\u30c6\u30c3\u30d7\u306e\u6570</p>\n",34"<p>Number of updates </p>\n": "<p>\u66f4\u65b0\u56de\u6570</p>\n",35"<p>Number of worker processes </p>\n": "<p>\u30ef\u30fc\u30ab\u30fc\u30d7\u30ed\u30bb\u30b9\u306e\u6570</p>\n",36"<p>Periodically update target network </p>\n": "<p>\u30bf\u30fc\u30b2\u30c3\u30c8\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u3092\u5b9a\u671f\u7684\u306b\u66f4\u65b0</p>\n",37"<p>Pick the action based on <span translate=no>_^_0_^_</span> </p>\n": "<p>\u4ee5\u4e0b\u306b\u57fa\u3065\u3044\u3066\u30a2\u30af\u30b7\u30e7\u30f3\u3092\u9078\u629e\u3057\u3066\u304f\u3060\u3055\u3044 <span translate=no>_^_0_^_</span></p>\n",38"<p>Replay buffer with <span translate=no>_^_0_^_</span>. Capacity of the replay buffer must be a power of 2. </p>\n": "<p>\u30ea\u30d7\u30ec\u30a4\u30d0\u30c3\u30d5\u30a1\u306f<span translate=no>_^_0_^_</span>.\u518d\u751f\u30d0\u30c3\u30d5\u30a1\u306e\u5bb9\u91cf\u306f 2 \u306e\u7d2f\u4e57\u3067\u306a\u3051\u308c\u3070\u306a\u308a\u307e\u305b\u3093</p>\u3002\n",39"<p>Run and monitor the experiment </p>\n": "<p>\u5b9f\u9a13\u306e\u5b9f\u884c\u3068\u76e3\u8996</p>\n",40"<p>Run sampled actions on each worker </p>\n": "<p>\u5404\u30ef\u30fc\u30ab\u30fc\u3067\u30b5\u30f3\u30d7\u30eb\u30a2\u30af\u30b7\u30e7\u30f3\u3092\u5b9f\u884c</p>\n",41"<p>Sample <span translate=no>_^_0_^_</span> </p>\n": "<p>[\u30b5\u30f3\u30d7\u30eb] <span translate=no>_^_0_^_</span></p>\n",42"<p>Sample actions </p>\n": "<p>\u30b5\u30f3\u30d7\u30eb\u30a2\u30af\u30b7\u30e7\u30f3</p>\n",43"<p>Sample from priority replay buffer </p>\n": "<p>\u30d7\u30e9\u30a4\u30aa\u30ea\u30c6\u30a3\u30fb\u30ea\u30d7\u30ec\u30a4\u30fb\u30d0\u30c3\u30d5\u30a1\u304b\u3089\u306e\u30b5\u30f3\u30d7\u30eb</p>\n",44"<p>Sample the action with highest Q-value. This is the greedy action. </p>\n": "<p>Q\u5024\u304c\u6700\u3082\u9ad8\u3044\u30a2\u30af\u30b7\u30e7\u30f3\u3092\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u3057\u307e\u3059\u3002\u3053\u308c\u306f\u8caa\u6b32\u306a\u884c\u52d5\u3067\u3059</p>\u3002\n",45"<p>Sample with current policy </p>\n": "<p>\u73fe\u5728\u306e\u30dd\u30ea\u30b7\u30fc\u3092\u542b\u3080\u30b5\u30f3\u30d7\u30eb</p>\n",46"<p>Sampling doesn't need gradients </p>\n": "<p>\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u306b\u306f\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u306f\u5fc5\u8981\u3042\u308a\u307e\u305b\u3093</p>\n",47"<p>Save tracked indicators. </p>\n": "<p>\u8ffd\u8de1\u6307\u6a19\u3092\u4fdd\u5b58\u3057\u307e\u3059\u3002</p>\n",48"<p>Scale observations from <span translate=no>_^_0_^_</span> to <span translate=no>_^_1_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span>\u89b3\u6e2c\u5024\u3092\u304b\u3089\u306b\u30b9\u30b1\u30fc\u30ea\u30f3\u30b0 <span translate=no>_^_1_^_</span></p>\n",49"<p>Select device </p>\n": "<p>\u30c7\u30d0\u30a4\u30b9\u3092\u9078\u629e</p>\n",50"<p>Set learning rate </p>\n": "<p>\u5b66\u7fd2\u7387\u3092\u8a2d\u5b9a</p>\n",51"<p>Start training after the buffer is full </p>\n": "<p>\u30d0\u30c3\u30d5\u30a1\u30fc\u304c\u3044\u3063\u3071\u3044\u306b\u306a\u3063\u305f\u3089\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0\u3092\u958b\u59cb\u3059\u308b</p>\n",52"<p>Stop the workers </p>\n": "<p>\u52b4\u50cd\u8005\u3092\u6b62\u3081\u308d</p>\n",53"<p>Target model updating interval </p>\n": "<p>\u5bfe\u8c61\u30e2\u30c7\u30eb\u306e\u66f4\u65b0\u9593\u9694</p>\n",54"<p>This doesn't need gradients </p>\n": "<p>\u3053\u308c\u306b\u306f\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u306f\u5fc5\u8981\u3042\u308a\u307e\u305b\u3093</p>\n",55"<p>Train the model </p>\n": "<p>\u30e2\u30c7\u30eb\u306e\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0</p>\n",56"<p>Uniformly sample and action </p>\n": "<p>\u30b5\u30f3\u30d7\u30eb\u3068\u30a2\u30af\u30b7\u30e7\u30f3\u3092\u5747\u4e00\u306b</p>\n",57"<p>Update parameters based on gradients </p>\n": "<p>\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u306b\u57fa\u3065\u3044\u3066\u30d1\u30e9\u30e1\u30fc\u30bf\u3092\u66f4\u65b0</p>\n",58"<p>Update replay buffer priorities </p>\n": "<p>\u30ea\u30d7\u30ec\u30a4\u30d0\u30c3\u30d5\u30a1\u306e\u512a\u5148\u9806\u4f4d\u3092\u66f4\u65b0</p>\n",59"<p>Whether to chose greedy action or the random action </p>\n": "<p>\u6b32\u5f35\u308a\u30a2\u30af\u30b7\u30e7\u30f3\u3068\u30e9\u30f3\u30c0\u30e0\u30a2\u30af\u30b7\u30e7\u30f3\u306e\u3069\u3061\u3089\u3092\u9078\u3076\u304b</p>\n",60"<p>Zero out the previously calculated gradients </p>\n": "<p>\u4ee5\u524d\u306b\u8a08\u7b97\u3057\u305f\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u3092\u30bc\u30ed\u306b\u3057\u307e\u3059</p>\n",61"<p>create workers </p>\n": "<p>\u30ef\u30fc\u30ab\u30fc\u3092\u4f5c\u6210</p>\n",62"<p>exploration as a function of updates </p>\n": "<p>\u66f4\u65b0\u6a5f\u80fd\u3068\u3057\u3066\u306e\u63a2\u7d22</p>\n",63"<p>get the initial observations </p>\n": "<p>\u521d\u671f\u89b3\u6e2c\u5024\u3092\u53d6\u5f97</p>\n",64"<p>initialize tensors for observations </p>\n": "<p>\u89b3\u6e2c\u7528\u306e\u30c6\u30f3\u30bd\u30eb\u3092\u521d\u671f\u5316</p>\n",65"<p>learning rate </p>\n": "<p>\u5b66\u7fd2\u7387</p>\n",66"<p>loss function </p>\n": "<p>\u640d\u5931\u95a2\u6570</p>\n",67"<p>number of training iterations </p>\n": "<p>\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0\u306e\u53cd\u5fa9\u56de\u6570</p>\n",68"<p>number of updates </p>\n": "<p>\u66f4\u65b0\u56de\u6570</p>\n",69"<p>number of workers </p>\n": "<p>\u52b4\u50cd\u8005\u306e\u6570</p>\n",70"<p>optimizer </p>\n": "<p>\u30aa\u30d7\u30c6\u30a3\u30de\u30a4\u30b6\u30fc</p>\n",71"<p>reset the workers </p>\n": "<p>\u30ef\u30fc\u30ab\u30fc\u3092\u30ea\u30bb\u30c3\u30c8</p>\n",72"<p>size of mini batch for training </p>\n": "<p>\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0\u7528\u30df\u30cb\u30d0\u30c3\u30c1\u306e\u30b5\u30a4\u30ba</p>\n",73"<p>steps sampled on each update </p>\n": "<p>\u66f4\u65b0\u306e\u305f\u3073\u306b\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u3055\u308c\u308b\u30b9\u30c6\u30c3\u30d7</p>\n",74"<p>target model to get <span translate=no>_^_0_^_</span> </p>\n": "<p>\u53d6\u5f97\u3059\u308b\u5bfe\u8c61\u30e2\u30c7\u30eb <span translate=no>_^_0_^_</span></p>\n",75"<p>update current observation </p>\n": "<p>\u73fe\u5728\u306e\u89b3\u6e2c\u5024\u3092\u66f4\u65b0</p>\n",76"<p>update episode information. collect episode info, which is available if an episode finished; this includes total reward and length of the episode - look at <span translate=no>_^_0_^_</span> to see how it works. </p>\n": "<p>\u30a8\u30d4\u30bd\u30fc\u30c9\u60c5\u5831\u3092\u66f4\u65b0\u3057\u307e\u3059\u3002\u30a8\u30d4\u30bd\u30fc\u30c9\u304c\u7d42\u4e86\u3057\u305f\u5834\u5408\u306b\u5229\u7528\u3067\u304d\u308b\u30a8\u30d4\u30bd\u30fc\u30c9\u60c5\u5831\u3092\u53ce\u96c6\u3057\u307e\u3059\u3002\u3053\u308c\u306b\u306f\u3001\u5408\u8a08\u5831\u916c\u3068\u30a8\u30d4\u30bd\u30fc\u30c9\u306e\u9577\u3055\u304c\u542b\u307e\u308c\u307e\u3059\u3002\u4ed5\u7d44\u307f\u3092\u78ba\u8a8d\u3057\u3066\u307f\u3066\u304f\u3060\u3055\u3044\u3002<span translate=no>_^_0_^_</span></p>\n",77"<p>update target network every 250 update </p>\n": "<p>250 \u56de\u306e\u66f4\u65b0\u3054\u3068\u306b\u30bf\u30fc\u30b2\u30c3\u30c8\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u3092\u66f4\u65b0</p>\n",78"DQN Experiment with Atari Breakout": "\u30a2\u30bf\u30ea\u30fb\u30d6\u30ec\u30a4\u30af\u30a2\u30a6\u30c8\u306b\u3088\u308bDQN\u5b9f\u9a13",79"Implementation of DQN experiment with Atari Breakout": "\u30a2\u30bf\u30ea\u30fb\u30d6\u30ec\u30a4\u30af\u30a2\u30a6\u30c8\u306b\u3088\u308bDQN\u5b9f\u9a13\u306e\u5b9f\u65bd"80}8182