Path: blob/master/site/ja/guide/gpu_performance_analysis.md
37940 views
TensorFlow Profiler ã䜿çšãã TensorFlow GPU ããã©ãŒãã³ã¹ã®æé©å
æŠèŠ
ãã®ã¬ã€ãã§ã¯ãTensorBoard ã§ TensorFlow Profiler ã䜿çšããŠãGPU ã®æŽå¯ãåŸãп倧ã®ããã©ãŒãã³ã¹ãåŒãåºãã1 ã€ä»¥äžã® GPU ãååã«æŽ»çšãããŠããªãå Žåã«ãããã°ããæ¹æ³ã瀺ããŸãã
Profiler ãåããŠäœ¿çšããå Žåã¯ã次ãè¡ããŸãã
Keras ã®äŸãš TensorBoard ã䜿ã£ãŠãTensorFlow Profiler: ã¢ãã«ããã©ãŒãã³ã¹ããããã¡ã€ãªã³ã°ããããŒãããã¯ã䜿ãå§ããã
Profiler ã䜿çšãã TensorFlow ã®ããã©ãŒãã³ã¹æé©åã¬ã€ãã§ããã¹ãïŒCPUïŒã§ TensorFlow ã®ããã©ãŒãã³ã¹ãæé©åããããã«äœ¿çšã§ããããŸããŸãªãããã¡ã€ãªã³ã°ããŒã«ã𿹿³ã«ã€ããŠåŠã³ãŸãã
èšç®ã GPU ã«ãªãããŒãããããšã¯ãç¹ã«å°ããªã¢ãã«ã®å Žåãåžžã«ã¡ãªããããããšã¯éããªãããšã«æ³šæããŠãã ãããæ¬¡ã®çç±ã«ããããªãŒããŒããããçºçããå¯èœæ§ããããŸãã
ãã¹ãïŒCPUïŒãšããã€ã¹ïŒGPUïŒéã®ããŒã¿è»¢é
ãã¹ãã GPU ã«ãŒãã«ãèµ·åãããšãã®é å»¶ã®ãã
ããã©ãŒãã³ã¹æé©åã®ã¯ãŒã¯ãããŒ
ãã®ã¬ã€ãã§ã¯ãåäžã® GPU ããå§ããŠãè€æ°ã® GPU ãåããåäžã®ãã¹ãã«ç§»è¡ããŠãããã©ãŒãã³ã¹ã®åé¡ããããã°ããæ¹æ³ã«ã€ããŠæŠèª¬ããŸãã
次ã®é åºã§ããã©ãŒãã³ã¹ã®åé¡ããããã°ããããšããå§ãããŸãã
1 ã€ã® GPU ã§ããã©ãŒãã³ã¹ãæé©åããŠãããã°ããŸãã
å ¥åãã€ãã©ã€ã³ãããã«ããã¯ã«ãªã£ãŠããªãã確èªããŸãã
1 ã€ã® GPU ã§ããã©ãŒãã³ã¹ããããã°ããŸãã
æ··å粟床ïŒ
fp16ïŒfloat16ïŒã䜿çšïŒãæå¹ã«ãããªãã·ã§ã³ã§ XLA ãæå¹ã«ããŸãã
ãã«ã GPU åäžãã¹ãã§ã®ããã©ãŒãã³ã¹ãæé©åããŠãããã°ããŸãã
ããšãã°ãTensorFlow 忣æŠç¥ã䜿çšããŠãè€æ°ã® GPU ãåããåäžã®ãã¹ãã§ã¢ãã«ããã¬ãŒãã³ã°ããæé©ã§ãªã GPU 䜿çšçã«æ°ä»ããå Žåããã«ã GPU ã·ã¹ãã ããããã°ããåã«ãæåã« 1 ã€ã® GPU ã®ããã©ãŒãã³ã¹ãæé©åããŠãããã°ããå¿ èŠããããŸãã
GPU ã§ããã©ãŒãã³ã¹ã®é«ãã³ãŒããååŸããããã®ããŒã¹ã©ã€ã³ãšããŠããã®ã¬ã€ãã§ã¯æ¢ã« tf.function ã䜿çšããŠããããšãåæãšããŠããŸããKeras Model.compile ããã³ Model.fit API ã¯ãå
éšã§ tf.function ãèªåçã«å©çšããŸããtf.GradientTape ã䜿çšããŠã«ã¹ã¿ã ãã¬ãŒãã³ã°ã«ãŒããäœæããå Žåãtf.function ãæå¹ã«ããæ¹æ³ã«ã€ããŠã¯ãtf.function ã«ããããã©ãŒãã³ã¹ã®æ¹åãã芧ãã ããã
次ã®ã»ã¯ã·ã§ã³ã§ã¯ãããã©ãŒãã³ã¹ã®ããã«ããã¯ãç¹å®ããŠä¿®æ£ããããã«ãäžèšã®ã·ããªãªããšã«æšå¥šãããã¢ãããŒãã«ã€ããŠèª¬æããŸãã
1. 1 ã€ã® GPU ã§ããã©ãŒãã³ã¹ãæé©åãã
çæ³çãªã±ãŒã¹ã§ã¯ãããã°ã©ã ã® GPU 䜿çšçãé«ããCPUïŒãã¹ãïŒãã GPUïŒããã€ã¹ïŒãžã®éä¿¡ãæå°éã§ãããå ¥åãã€ãã©ã€ã³ããã®ãªãŒããŒãããããªãå¿ èŠããããŸãã
ããã©ãŒãã³ã¹ãåæããæåã®ã¹ãããã¯ã1 ã€ã® GPU ã§å®è¡ãããŠããã¢ãã«ã®ãããã¡ã€ã«ãååŸããããšã§ãã
TensorBoard ã® Profiler æŠèŠããŒãžïŒãããã¡ã€ã«å®è¡äžã«ã¢ãã«ãã©ã®ããã«å®è¡ããããã®ãããã¬ãã«ãã¥ãŒã衚瀺ïŒã¯ãããã°ã©ã ãçæ³çãªã·ããªãªããã©ãã ãé¢ããŠãããã瀺ãããšãã§ããŸãã

æŠèŠããŒãžã§æ³šæãã¹ãéèŠãªç¹ã¯æ¬¡ã®ãšããã§ãã
å®éã®ããã€ã¹ã®å®è¡ããã®ã¹ãããæéã®å²å
ããã€ã¹ãšãã¹ãã«é 眮ãããæŒç®ã®å²å
fp16ã䜿çšããã«ãŒãã«ã®æ°
ããã©ãŒãã³ã¹ã®æé©åãå®çŸãããšããããšã¯ã3 ã€ã®ã±ãŒã¹ãã¹ãŠã§ãããã®æ°å€ãæå€§åããããšãæå³ããŸããããã°ã©ã ãæ·±ãçè§£ããã«ã¯ãTensorBoard ã® Profiler ãã¬ãŒã¹ãã¥ãŒã¢ã«ç²ŸéããŠããå¿ èŠããããŸãã以äžã®ã»ã¯ã·ã§ã³ã§ã¯ãããã©ãŒãã³ã¹ã®ããã«ããã¯ã蚺æãããšãã«æ¢ãå¿ èŠãããäžè¬çãªãã¬ãŒã¹ãã¥ãŒã¢ã®ãã¿ãŒã³ãããã€ã瀺ããŸãã
以äžã¯ã1 ã€ã® GPU ã§å®è¡ãããŠããã¢ãã«ãã¬ãŒã¹ãã¥ãŒã®ç»åã§ããTensorFlow Name Scope ã»ã¯ã·ã§ã³ãš TensorFlow Ops ã»ã¯ã·ã§ã³ããããã©ã¯ãŒããã¹ãæå€±é¢æ°ãããã¯ã¯ãŒããã¹/åŸé èšç®ããªããã£ãã€ã¶ã®éã¿å€ã®æŽæ°ãªã©ãã¢ãã«ã®ããŸããŸãªéšåãèå¥ã§ããŸãããŸããCUDA ã¹ããªãŒã ãåç §ããå Stream ã®é£ã® GPU ã§æŒç®ãå®è¡ããããšãã§ããŸããåã¹ããªãŒã ã¯ç¹å®ã®ã¿ã¹ã¯ã«äœ¿çšãããŸãããã®ãã¬ãŒã¹ã§ã¯ãStream#118 ã䜿çšããŠèšç®ã«ãŒãã«ãšããã€ã¹éã®ã³ããŒãèµ·åããŸããStream#119 ã¯ãã¹ãããããã€ã¹ãžã®ã³ããŒã«äœ¿çšãããStream#120 ã¯ããã€ã¹ãããã¹ããžã®ã³ããŒã«äœ¿çšãããŸãã
以äžã®ãã¬ãŒã¹ã¯ãããã©ãŒãã³ã¹ã¢ãã«ã®äžè¬çãªç¹æ§ã瀺ããŠããŸãã

ããšãã°ãGPU èšç®ã¿ã€ã ã©ã€ã³ïŒStream#118ïŒã¯ã®ã£ãããã»ãšãã©ãªããããžãŒãã«èŠããŸãããã¹ãããããã€ã¹ãžã®ã³ããŒïŒã¹ããªãŒã #119ïŒããã³ããã€ã¹ãããã¹ããžã®ã³ããŒïŒã¹ããªãŒã #120ïŒã¯æå°éã§ãããã¹ãããéã®ã®ã£ãããæå°éã§ããããã°ã©ã ã® Profiler ãå®è¡ãããšããã¬ãŒã¹ãã¥ãŒã§ãããã®çæ³çãªç¹æ§ãç¹å®ã§ããªãå ŽåããããŸãããã®ã¬ã€ãã®æ®ãã®éšåã§ã¯ãäžè¬çãªã·ããªãªãšãã®ä¿®æ£æ¹æ³ã«ã€ããŠèª¬æããŸãã
1. å ¥åãã€ãã©ã€ã³ããããã°ãã
GPU ããã©ãŒãã³ã¹ã®ãããã°ã§ã®æåã®ã¹ãããã¯ãããã°ã©ã ãå ¥åããŠã³ããã©ããã倿ããããšã§ãããããææ¡ããæãç°¡åãªæ¹æ³ã¯ãTensorBoard ã§ Profiler ã®å ¥åãã€ãã©ã€ã³ã¢ãã©ã€ã¶ãŒã䜿çšããããšã§ããããã¯ãå ¥åãã€ãã©ã€ã³ã§è²»ããããæéã®æŠèŠãæäŸããŸãã

å ¥åãã€ãã©ã€ã³ãã¹ãããæéã«å€§ãã圱é¿ããå Žåãæ¬¡ã®ã¢ã¯ã·ã§ã³ãå®è¡å¯èœã§ãã
tf.dataåºæã®ã¬ã€ãã䜿çšããŠãå ¥åãã€ãã©ã€ã³ããããã°ããæ¹æ³ãåŠç¿ã§ããŸããå ¥åãã€ãã©ã€ã³ãããã«ããã¯ãã©ããã確èªãããã 1 ã€ã®ç°¡åãªæ¹æ³ã¯ãååŠçãå¿ èŠãšããªããã©ã³ãã ã«çæãããå ¥åããŒã¿ã䜿çšããããšã§ããResNet ã¢ãã«ã§ãã®ææ³ã䜿çšããäŸã次ã«ç€ºããŸããå ¥åãã€ãã©ã€ã³ãæé©ã§ããã°ãå®éã®ããŒã¿ãšçæãããã©ã³ãã /åæããŒã¿ã§åæ§ã®ããã©ãŒãã³ã¹ãåŸãããã¯ãã§ããåæããŒã¿ã®å Žåã®å¯äžã®ãªãŒããŒãããã¯ãããªãã§ããããŠæé©åã§ããå ¥åããŒã¿ã®ã³ããŒã«ãããã®ã§ãã
ããã«ãå ¥åããŒã¿ãã€ãã©ã€ã³ãæé©åããããã®ãã¹ããã©ã¯ãã£ã¹ãã芧ãã ããã
2. 1 ã€ã® GPU ã®ããã©ãŒãã³ã¹ããããã°ãã
GPU 䜿çšçãäœããªãèŠå ã¯ããã€ããããŸãã以äžã¯ããã¬ãŒã¹ãã¥ãŒã¢ãšèãããã解決çã確èªããéã«ããèŠãããããã€ãã®ã·ããªãªã§ãã
1. ã¹ãããéã®ã®ã£ãããåæãã
ããã°ã©ã ãæé©ã«å®è¡ãããŠããªãå Žåã«ãã芳枬ãããã®ã¯ããã¬ãŒãã³ã°ã¹ãããéã®ã®ã£ããã§ãã以äžã®ãã¬ãŒã¹ãã¥ãŒã®ç»åã§ã¯ãã¹ããã 8 ãš 9 ã®éã«å€§ããªã®ã£ãããããããã®é GPU ãã¢ã€ãã«ç¶æ ã«ãªã£ãŠããããšãæå³ããŸãã

ãã¬ãŒã¹ãã¥ãŒã¢ã§ã¹ãããéã«å€§ããªã®ã£ããã衚瀺ãããå Žåã¯ãããã°ã©ã ãå ¥åããŠã³ãã§ããããšã瀺ããŠããå¯èœæ§ããããŸãããã®å Žåãå ¥åãã€ãã©ã€ã³ã®ãããã°ã«é¢ããåã®ã»ã¯ã·ã§ã³ããŸã åç §ããŠããªãå Žåã¯åç §ããå¿ èŠããããŸãã
ãã ããæé©åãããå
¥åãã€ãã©ã€ã³ã䜿çšããŠããCPU ã¹ã¬ããã®ç«¶åã«ãããããã¹ãããã®çµäºãšå¥ã®ã¹ãããã®éå§ã®éã«ã®ã£ãããçããå¯èœæ§ããããŸããtf.data ã¯ãããã¯ã°ã©ãŠã³ãã¹ã¬ãããå©çšããŠãã€ãã©ã€ã³åŠçã䞊ååããŸãããããã®ã¹ã¬ããã¯ãããŒã¿ã®ã³ããŒã GPU æŒç®ã®ã¹ã±ãžã¥ãŒãªã³ã°ãªã©ãåã¹ãããã®éå§æã«çºçãã GPU ãã¹ãåŽã®ã¢ã¯ãã£ããã£ã«å¹²æžããå¯èœæ§ããããŸãã
GPU ã§ãããã®æŒç®ãã¹ã±ãžã¥ãŒã«ãããã¹ãåŽã§å€§ããªã®ã£ããã«æ°ä»ããå Žåã¯ãç°å¢å€æ° TF_GPU_THREAD_MODE=gpu_private ãèšå®ã§ããŸããããã«ãããGPU ã«ãŒãã«ãç¬èªã®å°çšã¹ã¬ããããèµ·åãããtf.data äœæ¥ã®èåŸã§ãã¥ãŒã«å
¥ããããªãããšãä¿èšŒãããŸãã
ã¹ãããéã®ã®ã£ããã¯ãææšã®èšç®ãKeras ã³ãŒã«ããã¯ããŸãã¯ãã¹ãã§å®è¡ããã tf.function ã®å€éšã®æŒç®ã«ãã£ãŠãçºçããå¯èœæ§ããããŸãããããã®æŒç®ã¯ãTensorFlow ã°ã©ãå
ã®æŒç®ã»ã©ããã©ãŒãã³ã¹ãè¯ããããŸãããããã«ããããã®æŒç®ã®äžéšã¯ CPU äžã§å®è¡ãããGPU ãšã®éã§ãã³ãœã«ãã³ããŒããŸãã
å
¥åãã€ãã©ã€ã³ãæé©åããåŸãããã¬ãŒã¹ãã¥ãŒã¢ã®ã¹ãããéã«ã®ã£ãããããããšã«æ°ä»ããå Žåã¯ãã¹ãããéã®ã¢ãã«ã³ãŒãã調ã¹ãŠãã³ãŒã«ããã¯/ææšãç¡å¹ã«ããããšã§ããã©ãŒãã³ã¹ãæ¹åããããã©ããã確èªããå¿
èŠããããŸãããããã®æäœã®è©³çްã®äžéšã¯ããã¬ãŒã¹ãã¥ãŒã¢ã§ãïŒããã€ã¹åŽãšãã¹ãåŽã®äž¡æ¹ã«ïŒè¡šç€ºãããŸãããã®ã·ããªãªã§æšå¥šãããã®ã¯ããããã®æŒç®ã®ãªãŒããŒããããããã¹ãŠã®ã¹ãããã§ã¯ãªãäžå®æ°ã®ã¹ãããã®åŸã«å®è¡ããããšã«ãã£ãŠååŽããããšã§ããtf.keras API ã§ Model.compile ã¡ãœããã䜿çšããå Žåãsteps_per_execution ãã©ã°ãèšå®ãããšããããèªåçã«è¡ãããŸããã«ã¹ã¿ã ãã¬ãŒãã³ã°ã«ãŒãã«ã¯ãtf.while_loop ã䜿çšããŸãã
2. ããé«ãããã€ã¹äœ¿çšçãéæãã
1. å°ã㪠GPU ã«ãŒãã«ãšãã¹ãã«ãŒãã«ã®èµ·åé å»¶
ãã¹ãã¯ã«ãŒãã«ã GPU ã§å®è¡ããããã«ãã¥ãŒã«å ¥ããŸãããã«ãŒãã«ãå®éã« GPU ã§å®è¡ããããŸã§ã«é å»¶ïŒçŽ 20 ïœ 40 ÎŒsïŒã䌎ããŸããçæ³çãªã±ãŒã¹ã§ã¯ããã¹ããããã«å€ãã®ã«ãŒãã«ããšã³ãã¥ãŒããã®ãåŸ ã€ã®ã§ã¯ãªããGPU ãã»ãšãã©ã®æéãå®è¡ã«è²»ããããã«ããã¹ã㯠GPU ã«ååãªæ°ã®ã«ãŒãã«ããšã³ãã¥ãŒããŸãã
TensorBoard ã® Profiler ã®æŠèŠããŒãžã«ã¯ããã¹ããã«ãŒãã«ãèµ·åããã®ãåŸ ã£ãŠããããã« GPU ãã¢ã€ãã«ç¶æ ã ã£ãæéã衚瀺ãããŸããäžã®ç»åã§ã¯ãã«ãŒãã«ãèµ·åãããã®ãåŸ ã£ãŠããã¹ãããæéã®çŽ 10% ã®éãGPU ãã¢ã€ãã«ç¶æ ã«ãªã£ãŠããŸãã

ãã®åãããã°ã©ã ã®ãã¬ãŒã¹ãã¥ãŒã¢ã¯ããã¹ãã GPU ã§ã«ãŒãã«ãèµ·åããããã«ããžãŒç¶æ ã§ããã«ãŒãã«éã«å°ããªã®ã£ããã瀺ããŠããŸãã

GPU ã§å€æ°ã®å°ããªæŒç®ïŒã¹ã«ã©ãŒå ç®ãªã©ïŒãèµ·åãããšããã¹ãã GPU ã«è¿œãã€ããªãå¯èœæ§ããããŸããåããããã¡ã€ã«ã® TensorBoard ã® TensorFlow Stats ããŒã«ã¯ã2.77 ç§ããã 126,224 Mul æŒç®ã瀺ããŠããŸãããããã£ãŠãåã«ãŒãã«ã¯çŽ 21.9 ÎŒs ã§ãããããã¯éåžžã«å°ããïŒèµ·åã¬ã€ãã³ã·ãšã»ãŒåãæéïŒããã¹ãã«ãŒãã«ã®èµ·åé å»¶ãçºçããå¯èœæ§ããããŸãã

äžèšã®ç»åã®ããã«ããã¬ãŒã¹ãã¥ãŒã¢ã GPU äžã®æŒç®éã«å€ãã®å°ããªã®ã£ããã瀺ããŠããå Žåã¯ã次ã®ããšãã§ããŸãã
å°ããªãã³ãœã«ãé£çµãããã¯ãã«åãããæŒç®ã䜿çšãããããã倧ããªããããµã€ãºã䜿çšããŠãèµ·åãããåã«ãŒãã«ãããå€ãã®äœæ¥ãè¡ãããã«ããŸããããã«ãããGPU ãããžãŒç¶æ ã«ãªãæéãé·ããªããŸãã
tf.functionã䜿çšã㊠TensorFlow ã°ã©ããäœæããŠããããšã確èªããŠãæŒç®ãçŽç²ãª Eager Modeã§å®è¡ããŠããªãããšã確èªããŠãã ãããModel.fitã䜿çšããŠããå ŽåïŒtf.GradientTapeã䜿çšããã«ã¹ã¿ã ãã¬ãŒãã³ã°ã«ãŒãã§ã¯ãªãïŒãtf.keras.Model.compileã¯èªåçã«ãããè¡ããŸããtf.function(jit_compile=True)ãŸãã¯èªåã¯ã©ã¹ã¿ãªã³ã°ã§ XLA ã䜿çšããŠã«ãŒãã«ãèåããŸãã詳现ã«ã€ããŠã¯ã以äžã®æ··å粟床㚠XLA ãæå¹ã«ããã»ã¯ã·ã§ã³ã«ç§»åããŠãXLA ãæå¹ã«ããŠããã©ãŒãã³ã¹ãåäžãããæ¹æ³ãåŠç¿ããŠãã ããããã®ç¹åŸŽéã«ãããããã€ã¹ã®äœ¿çšçãé«ããªãå¯èœæ§ããããŸãã
2. TensorFlow æŒç®ã®é 眮
Profiler ã®æŠèŠããŒãžã«ã¯ããã¹ããšããã€ã¹ã«é 眮ãããæŒç®ã®ããŒã»ã³ããŒãžã衚瀺ãããŸãïŒãã¬ãŒã¹ãã¥ãŒã¢ãåç §ããŠãç¹å®ã®æŒç®ã®é 眮ã確èªããããšãã§ããŸãïŒãäžã®ç»åã®ããã«ãããã€ã¹ã«æ¯ã¹ãŠããã¹ãäžã®æŒç®ã®ããŒã»ã³ããŒãžãéåžžã«å°ãããªãããã«ããŸãã

çæ³çã«ã¯ãèšç®éçŽåæŒç®ã®ã»ãšãã©ã GPU ã«é 眮ããå¿ èŠããããŸãã
ã¢ãã«ã®æŒç®ãšãã³ãœã«ãå²ãåœãŠãããŠããããã€ã¹ãèŠã€ããã«ã¯ãããã°ã©ã ã®æåã®ã¹ããŒãã¡ã³ããšã㊠tf.debugging.set_log_device_placement(True) ãèšå®ããŸãã
å Žåã«ãã£ãŠã¯ãæŒç®ãç¹å®ã®ããã€ã¹ã«é
眮ããããã«æå®ããå Žåã§ãããã®å®è£
ããã®æ¡ä»¶ããªãŒããŒã©ã€ãããå¯èœæ§ãããããšã«æ³šæããŠãã ããïŒäŸ: tf.uniqueïŒãåäžã® GPU ãã¬ãŒãã³ã°ã®å Žåã§ããtf.distribute.OneDeviceStrategy ãªã©ã®åæ£ã¹ãã©ããžãŒãæå®ãããšãããã€ã¹äžã§æŒç®ããã確å®çã«é
眮ã§ããŸãã
æŒç®ã®å€§éšåã GPU ã«é 眮ããçç±ã® 1 ã€ã¯ããã¹ããšããã€ã¹éã®éå°ãªã¡ã¢ãªã³ããŒãé²ãããšã§ãïŒãã¹ããšããã€ã¹éã®ã¢ãã«å ¥å/åºåããŒã¿ã®ã¡ã¢ãªã³ããŒãäºæ³ãããŸãïŒãé床ã®ã³ããŒã®äŸã¯ãGPU ã¹ããªãŒã #167ã#168ãããã³ #169 ã«é¢ãã以äžã®ãã¬ãŒã¹ãã¥ãŒã«ç€ºãããŠããŸãã

ãããã®ã³ããŒã GPU ã«ãŒãã«ã®å®è¡ããããã¯ãããšãããã©ãŒãã³ã¹ãäœäžããããšããããŸãããã¬ãŒã¹ãã¥ãŒã¢ã®ã¡ã¢ãªã³ããŒæŒç®ã«ã¯ããããã®ã³ããŒããããã³ãœã«ã®ãœãŒã¹ã§ããæŒç®ã«é¢ãã詳现æ å ±ããããŸãããmemCopy ã æŒç®ã«é¢é£ä»ããã®ã¯å¿ ããã容æã§ã¯ãªãå ŽåããããŸãããã®ãããªå Žåãè¿ãã®æŒç®ã調ã¹ãŠããã¹ãŠã®ã¹ãããã§ã¡ã¢ãªã³ããŒãåãå Žæã§çºçããŠãããã©ããã確èªãããšåœ¹ç«ã¡ãŸãã
3. GPU äžã®ããå¹ççãªã«ãŒãã«
ããã°ã©ã ã® GPU 䜿çšçã蚱容ç¯å²å ã«ãªããšã次ã®ã¹ããããšããŠããã³ãœã«ã³ã¢ãå©çšãããæŒç®ãèåããããšã«ãã£ãŠãGPU ã«ãŒãã«ã®å¹çãé«ããããšãæ€èšããŸãã
1. ãã³ãœã«ã³ã¢ãå©çšãã
ææ°ã® NVIDIA® GPU ã«ã¯ã驿 Œãªã«ãŒãã«ã®ããã©ãŒãã³ã¹ãå€§å¹ ã«åäžãããããšãã§ããç¹æ®ãª ãã³ãœã«ã³ã¢ããããŸãã
TensorBoard ã®GPU ã«ãŒãã«çµ±èšã䜿çšããŠãã©ã® GPU ã«ãŒãã«ããã³ãœã«ã³ã¢ã«é©ããŠããããã©ã®ã«ãŒãã«ããã³ãœã«ã³ã¢ã䜿çšããŠããããèŠèŠåã§ããŸããfp16 ãæå¹ã«ããïŒä»¥äžã®ãæ··å粟床ãæå¹ã«ãããã»ã¯ã·ã§ã³ãåç
§ïŒããšã¯ãããã°ã©ã ã® General Matrix MultiplyïŒGEMMïŒã«ãŒãã«ïŒmatmul opsïŒããã³ãœã«ã³ã¢ãå©çšããããã«ãã 1 ã€ã®æ¹æ³ã§ãã粟床ã fp16 ã§ãå
¥å/åºåãã³ãœã«ã®æ¬¡å
ã 8 ãŸã㯠16 ã§å²ãåããå ŽåïŒint8 ã®å ŽåïŒãGPU ã«ãŒãã«ã¯ãã³ãœã«ã³ã¢ãå¹ççã«äœ¿çšããŸãã
泚æ: cuDNN v7.6.3 以éã§ã¯ããã³ãœã«ã³ã¢ã掻çšããããã«å¿ èŠãªå Žæã«ç³ã¿èŸŒã¿æ¬¡å ãèªåçã«ããã£ã³ã°ãããŸãã
GPU ã§ã«ãŒãã«ãå¹ççã«ããæ¹æ³ã«ã€ããŠã®ãã®ä»ã®è©³çŽ°ãªæšå¥šäºé ã«ã€ããŠã¯ãNVIDIA® ãã£ãŒãã©ãŒãã³ã°ããã©ãŒãã³ã¹ã¬ã€ããã芧ãã ããã
2. èåæŒç®
tf.function(jit_compile=True) ã䜿çšããŠå°ããªæŒç®ãèåãã倧ããªã«ãŒãã«ã圢æããŠããã©ãŒãã³ã¹ã倧å¹
ã«åäžãããŸãã詳现ã«ã€ããŠã¯ãXLA ã¬ã€ããã芧ãã ããã
3. æ··å粟床㚠XLA ãæå¹ã«ãã
äžèšã®æé ãå®è¡ããåŸãæ··å粟床㚠XLA ãæå¹ã«ããããšã¯ãããã©ãŒãã³ã¹ãããã«åäžãããããã«å®è¡ã§ãã 2 ã€ã®ãªãã·ã§ã³ã®æé ã§ããæšå¥šãããã¢ãããŒãã¯ããããã 1 ã€ãã€æå¹ã«ããŠãããã©ãŒãã³ã¹äžã®ã¡ãªãããæåŸ ã©ããã§ããããšã確èªããããšã§ãã
1. æ··å粟床ãæå¹ã«ãã
TensorFlow æ··å粟床ã¬ã€ãã¯ãGPU ã§ fp16 粟床ãæå¹ã«ããæ¹æ³ã瀺ããŠããŸããNVIDIA® GPU ã§ AMP ãæå¹ã«ããŠãã³ãœã«ã³ã¢ã䜿çšããVolta ããã³æ°ãã GPU ã¢ãŒããã¯ãã£ã§ fp32ïŒfloat32ïŒç²ŸåºŠã®ã¿ã䜿çšããå Žåãšæ¯èŒããŠãæå€§ 3 åã®å
šäœçãªã¹ããŒãã¢ãããå®çŸããŸãã
è¡å/ãã³ãœã«ã®æ¬¡å ãããã³ãœã«ã³ã¢ã䜿çšããã«ãŒãã«ãåŒã³åºãããã®èŠä»¶ãæºãããŠããããšã確èªããŠãã ããã粟床ã fp16 ã§ãå ¥åºå次å ã 8 ãŸã㯠16ïŒint8 ã®å ŽåïŒã§å²ãåããå ŽåãGPU ã«ãŒãã«ã¯ãã³ãœã«ã³ã¢ãå¹ççã«äœ¿çšããŸãã
cuDNN v7.6.3 以éã§ã¯ããã³ãœã«ã³ã¢ã掻çšããããã«å¿ èŠãªå Žæã«ç³ã¿èŸŒã¿æ¬¡å ãèªåçã«ããã£ã³ã°ãããããšã«æ³šæããŠãã ããã
fp16 粟床ã®ããã©ãŒãã³ã¹äžã®ã¡ãªãããæå€§åããã«ã¯ã以äžã®ãã¹ããã©ã¯ãã£ã¹ã«åŸã£ãŠãã ããã
1. æé©ãª fp16 ã«ãŒãã«ã䜿çšãã
fp16 ãæå¹ã«ãããšãããã°ã©ã ã®è¡åä¹ç®ïŒGEMMïŒã«ãŒãã«ã¯ããã³ãœã«ã³ã¢ãå©çšãã察å¿ãã fp16 ããŒãžã§ã³ã䜿çšããå¿
èŠããããŸãããã ããå Žåã«ãã£ãŠã¯ãããã°ã©ã ãéå¹ççãªå®è£
ã«ãã©ãŒã«ããã¯ããããããããçºçãããfp16 ãæå¹ã«ããŠãæåŸ
ãããé床åäžãåŸãããŸããã

GPU ã«ãŒãã«çµ±èšããŒãžã«ã¯ãã©ã®æŒç®ããã³ãœã«ã³ã¢ã«é©ããŠããããã©ã®ã«ãŒãã«ãå®éã«å¹ççãªãã³ãœã«ã³ã¢ã䜿çšããŠãããã衚瀺ãããŸãããã£ãŒãã©ãŒãã³ã°ããã©ãŒãã³ã¹ã«é¢ãã NVIDIA® ã¬ã€ãã«ã¯ããã³ãœã«ã³ã¢ã®æŽ»ç𿹿³ã«ã€ããŠã®è¿œå ã®ææ¡ãå«ãŸããŠããŸããããã«ãæŒç®ã«ãããæéãåæžããããã以åã¯ã¡ã¢ãªã«ãã€ã³ããããŠããã«ãŒãã«ã§ã fp16 ã䜿çšããããšã«ããã¡ãªãããèŠãããŸãã
2.åçãšéçæå€±ã¹ã±ãŒãªã³ã°ã®å¯Ÿæ¯
äœç²ŸåºŠã«ããã¢ã³ããŒãããŒãé²ãããã«ãfp16 ã䜿çšããå Žåã¯ãæå€±ã¹ã±ãŒãªã³ã°ãå¿
èŠã§ããæå€±ã¹ã±ãŒãªã³ã°ã«ã¯åçãšéçã® 2 çš®é¡ããããã©ã¡ããæ··å粟床ã¬ã€ãã§è©³ãã説æãããŠããŸããmixed_float16 ããªã·ãŒã䜿çšããŠãKeras ãªããã£ãã€ã¶å
ã§èªåçã«æå€±ã¹ã±ãŒãªã³ã°ãæå¹ã«ããããšãã§ããŸãã
泚æ: Keras æ··å粟床 API ã¯ãããã©ã«ãã§ã¹ã¿ã³ãã¢ãã³ã®ãœããããã¯ã¹æŒç®ïŒKeras æå€±é¢æ°ã®äžéšã§ã¯ãªãæŒç®ïŒã fp16 ãšããŠè©äŸ¡ãããããæ°å€ã®åé¡ãåæã®äœäžã«ã€ãªããå¯èœæ§ããããŸããããã©ãŒãã³ã¹ã®æé©åã«ã¯ããã®ãããªæŒç®ã fp32 ã«ãã£ã¹ãããŸãã
ããã©ãŒãã³ã¹ãæé©åããããšããå Žåãåçæå€±ã¹ã±ãŒãªã³ã°ã«ãã£ãŠããã¹ãã§å®è¡ããã远å ã®æ¡ä»¶ä»ãæŒç®ãå°å ¥ããããã¬ãŒã¹ãã¥ãŒã¢ã®ã¹ãããéã«ã®ã£ãããçããå¯èœæ§ãããããšãèŠããŠããããšãéèŠã§ããäžæ¹ãéçæå€±ã¹ã±ãŒãªã³ã°ã«ã¯ãã®ãããªãªãŒããŒãããããªããæ£ããéçæå€±ã¹ã±ãŒã«å€ãæå®ããå¿ èŠããããããããã©ãŒãã³ã¹ã®ç¹ã§åªãããªãã·ã§ã³ã«ãªãå¯èœæ§ããããŸãã
2. tf.function(jit_compile=True) ãŸãã¯èªåã¯ã©ã¹ã¿ãªã³ã°ã§ XLA ãæå¹ã«ãã
åäžã® GPU ã§æé«ã®ããã©ãŒãã³ã¹ãåŸãããã®æåŸã®ã¹ããããšããŠãXLA ãæå¹ã«ããŠå®éšã§ããŸããããã«ãããæŒç®ãèåãããããã€ã¹ã®äœ¿çšçãåäžããã¡ã¢ãªãããããªã³ããåæžãããŸããããã°ã©ã ã§ tf.function(jit_compile=True) ãŸãã¯èªåã¯ã©ã¹ã¿ãªã³ã°ã䜿çšã㊠XLA ãæå¹ã«ããæ¹æ³ã®è©³çްã«ã€ããŠã¯ãXLA ã¬ã€ããã芧ãã ããã
ã°ããŒãã« JIT ã¬ãã«ã -1ïŒãªãïŒã1ããŸã㯠2 ã«èšå®ã§ããŸããã¬ãã«ãé«ãã»ã©ã¢ã°ã¬ãã·ãã«ãªãã䞊ååŠçãæžããããå€ãã®ã¡ã¢ãªã䜿çšããå¯èœæ§ããããŸããã¡ã¢ãªã«å¶éãããå Žåã¯ãå€ã 1 ã«èšå®ããŸãã XLA ã³ã³ãã€ã©ã¯ãæ°ãã圢ç¶ã«ééãããã³ã«ã«ãŒãã«ãã³ã³ãã€ã«ãç¶ããå¿
èŠãããããã倿°å
¥åãã³ãœã«åœ¢ç¶ãæã€ã¢ãã«ã§ã¯ XLA ãé©åã«æ©èœããªãããšã«æ³šæããŠãã ããã
2. ãã«ã GPU åäžãã¹ãã§ããã©ãŒãã³ã¹ãæé©åãã
tf.distribute.MirroredStrategy API ã䜿çšããŠãåäžãã¹ãäžã® 1 ã€ã® GPU ããè€æ°ã® GPU ã«ã¢ãã« ãã¬ãŒãã³ã°ãã¹ã±ãŒãªã³ã°ã§ããŸããïŒTensorFlow ã䜿çšããŠåæ£ãã¬ãŒãã³ã°ãè¡ãæ¹æ³ã®è©³çްã«ã€ããŠã¯ãTensorFlow ã䜿çšãã忣ãã¬ãŒãã³ã°ãGPU ã䜿çšãããTPUã䜿çšããã¬ã€ããããã³ Keras ã䜿çšãã忣ãã¬ãŒãã³ã°ãã¥ãŒããªã¢ã«ãã芧ãã ãããïŒ
1 ã€ã® GPU ããè€æ°ã® GPU ãžã®ç§»è¡ã¯çæ³çã«ã¯ãã®ãŸãŸã§ã¹ã±ãŒã©ãã«ã§ããã¹ãã§ãããããã©ãŒãã³ã¹ã®åé¡ãçºçããå ŽåããããŸãã
åäžã® GPU ã䜿çšãããã¬ãŒãã³ã°ããåããã¹ãäžã®è€æ°ã® GPU ã«ç§»è¡ããå Žåãçæ³çã«ã¯ãåŸé éä¿¡ã®è¿œå ã®ãªãŒããŒããããšãã¹ãã¹ã¬ããã®äœ¿çšçã®å¢å ã®ã¿ã§ããã©ãŒãã³ã¹ã®ã¹ã±ãŒãªã³ã°ãçµéšããã¯ãã§ãããã®ãªãŒããŒãããã®ãããäŸãã° GPU ã 1 ã€ãã 2 ã€ã«å€æŽããå Žåãæ£ç¢ºã« 2 åã®ã¹ããŒãã¢ããã¯åŸãããŸããã
以äžã®ãã¬ãŒã¹ãã¥ãŒã¯ãè€æ°ã® GPU ã§ãã¬ãŒãã³ã°ããå Žåã®äœåãªéä¿¡ãªãŒããŒãããã®äŸã瀺ããŠããŸããéã¿ã®æŽæ°ãè¡ãåã«ãåŸé ãé£çµããã¬ããªã«éã§äŒéããåå²ããããã®ãªãŒããŒãããããããŸãã

次ã®ãã§ãã¯ãªã¹ãã¯ããã«ã GPU ã·ããªãªã§ããã©ãŒãã³ã¹ãæé©åãããšãã«ããã©ãŒãã³ã¹ãåäžãããã®ã«åœ¹ç«ã¡ãŸãã
ããããµã€ãºãæå€§åããããã«ããŠãã ãããããã«ãããããã€ã¹ã®äœ¿çšçãåäžããè€æ°ã® GPU éã®éä¿¡ã³ã¹ããååŽãããŸããã¡ã¢ãªãããã¡ã€ã©ã䜿çšãããšãããã°ã©ã ãã¡ã¢ãªäœ¿çšçã®ããŒã¯ã«ã©ãã ãè¿ã¥ããŠããããææ¡ããã®ã«åœ¹ç«ã¡ãŸããããããµã€ãºã倧ãããããšåæã«åœ±é¿ãäžããå¯èœæ§ããããŸãããéåžžã¯ããã©ãŒãã³ã¹äžã®ã¡ãªããããããäžåããŸãã
åäžã® GPU ããè€æ°ã® GPU ã«ç§»è¡ããå Žåãåããã¹ãã§ããå€ãã®å ¥åããŒã¿ãåŠçããå¿ èŠããããŸãããã®ãããïŒ1ïŒã®åŸãå ¥åãã€ãã©ã€ã³ã®ããã©ãŒãã³ã¹ãå確èªããããã«ããã¯ã«ãªã£ãŠããªãããšã確èªããããšããå§ãããŸãã
ããã°ã©ã ã®ãã¬ãŒã¹ãã¥ãŒã§ GPU ã¿ã€ã ã©ã€ã³ããã§ãã¯ããŠãäžèŠãª AllReduce åŒã³åºãããªãã確èªããŠãã ããããã®åŒã³åºãã«ããããã¹ãŠã®ããã€ã¹éã§åæãè¡ãããããã§ããäžèšã®ãã¬ãŒã¹ãã¥ãŒã§ã¯ãAllReduce 㯠NCCL ã«ãŒãã«ãä»ããŠå®è¡ãããåã¹ãããã®åŸé ã«å¯ŸããŠå GPU ã§ 1 ã€ã® NCCL åŒã³åºãã®ã¿ãè¡ãããŸãã
æå°åã§ããäžèŠãª D2HãH2Dãããã³ D2D ã³ããŒæäœã確èªããŸãã
ã¹ãããæéããã§ãã¯ããŠãåã¬ããªã«ãåãäœæ¥ãè¡ã£ãŠããããšã確èªããŸããäŸãã°ã1 ã€ã® GPUïŒéåžžã¯
GPU0ïŒããªãŒããŒãµãã¹ã¯ã©ã€ããããããšããããŸããããã¯ããã¹ãã誀ã£ãŠ GPU ã«ããå€ãã®äœæ¥ãè¡ãããšã«ãªãããã§ããæåŸã«ããã¬ãŒã¹ãã¥ãŒã§ãã¹ãŠã® GPU ã®ãã¬ãŒãã³ã°ã¹ãããããã§ãã¯ããŠãé çªã«å®è¡ãããŠããæŒç®ã確èªããŸããããã¯éåžžããã GPU ããå¥ã® GPU ãžã®å¶åŸ¡ã®äŸåé¢ä¿ãããã°ã©ã ã«å«ãŸããŠããå Žåã«çºçããŸãã以åã¯ããã®ç¶æ³ã§ã®ããã©ãŒãã³ã¹ã®ãããã°ã¯åå¥ã«è§£æ±ºãããŠããŸãããããã°ã©ã ã§ãã®åäœã確èªãããå Žåã¯ããã¬ãŒã¹ãã¥ãŒã®ç»åãæ·»ã㊠GitHub ã®èª²é¡ãæåºããŠãã ããã
1. åŸé AllReduce ãæé©åãã
åæã¹ãã©ããžãŒã§ãã¬ãŒãã³ã°ããå Žåãåããã€ã¹ã¯å ¥åããŒã¿ã®äžéšãåãåããŸãã
ã¢ãã«ã®ãã©ã¯ãŒããã¹ãšããã¯ã¯ãŒããã¹ãèšç®ããåŸãåããã€ã¹ã§èšç®ãããåŸé ãéèšããŠåæžããå¿ èŠããããŸãããã®åŸé AllReduce ã¯ãåããã€ã¹ã§ã®åŸé èšç®ã®åŸããªããã£ãã€ã¶ãã¢ãã«ã®éã¿ãæŽæ°ããåã«çºçããŸãã
å GPU ã¯æåã«ã¢ãã«ã¬ã€ã€ãŒå
šäœã§åŸé
ãé£çµããtf.distribute.CrossDeviceOpsïŒtf.distribute.NcclAllReduce ãããã©ã«ãïŒã䜿çšã㊠GPU éã§ããããéä¿¡ããã¬ã€ã€ãŒããšã«åæžããåŸã«åŸé
ãè¿ããŸãã
ãªããã£ãã€ã¶ã¯ããããã®æžå°ããåŸé ã䜿çšããŠãã¢ãã«ã®éã¿ãæŽæ°ããŸããçæ³çã«ã¯ããªãŒããŒããããé²ãããã«ããã®ããã»ã¹ã¯ãã¹ãŠã® GPU ã§åæã«çºçããå¿ èŠããããŸãã
AllReduce ã«ãããæéã¯ã次ãšã»ãŒåãã«ãªããŸãã
ãã®èšç®ã¯ã忣ãã¬ãŒãã³ã°ãžã§ããå®è¡ãããšãã®ããã©ãŒãã³ã¹ãæåŸ
ã©ãããã©ããããŸãã¯ããã«ããã©ãŒãã³ã¹ã®ãããã°ãè¡ãå¿
èŠããããã©ãããçè§£ããããã®ã¯ã€ãã¯ãã§ãã¯ãšããŠåœ¹ç«ã¡ãŸããModel.summary ããã¢ãã«å
ã®ãã©ã¡ãŒã¿ãŒã®æ°ãååŸã§ããŸãã
TensorFlow ã¯åŸé
ã®äŒéã« fp32ïŒfloat32ïŒã䜿çšãããããåã¢ãã«ãã©ã¡ãŒã¿ã®ãµã€ãºã¯ 4 ãã€ãã§ããããšã«æ³šæããŠãã ãããfp16 ãæå¹ã«ããŠããNCCL AllReduce 㯠fp32 ãã©ã¡ãŒã¿ãå©çšããŸãã
ã¹ã±ãŒãªã³ã°ã®ã¡ãªãããåŸãã«ã¯ããããã®ãªãŒããŒãããã«æ¯ã¹ãŠã¹ãããæéãå€§å¹ ã«é·ãããå¿ èŠããããŸãããããå®çŸãã 1 ã€ã®æ¹æ³ã¯ãããããµã€ãºãã¹ãããæéã«åœ±é¿ããããããã倧ããªããããµã€ãºã䜿çšããããšã§ãããéä¿¡ã®ãªãŒããŒãããã«ã¯åœ±é¿ããŸããã
2. GPU ãã¹ãã¹ã¬ããã®ç«¶å
è€æ°ã® GPU ãå®è¡ããŠããå ŽåãCPU ã®ä»äºã¯ãããã€ã¹éã§ GPU ã«ãŒãã«ãå¹ççã«èµ·åããããšã§ããã¹ãŠã®ããã€ã¹ãããžãŒç¶æ ã«ä¿ã€ããšã§ãã
ãã ããCPU ã 1 ã€ã® GPU ã§ã¹ã±ãžã¥ãŒã«ã§ããç¬ç«ããæŒç®ã倿°ããå ŽåãCPU ã¯å€ãã®ãã¹ãã¹ã¬ããã䜿çšã㊠1 ã€ã® GPU ãããžãŒç¶æ ã«ä¿ã¡ãå¥ã® GPU ã§é確å®çãªé åºã§ã«ãŒãã«ãèµ·åããããšã決å®ã§ããŸããããã«ãããã¹ãã¥ãŒãŸãã¯è² ã®ã¹ã±ãŒãªã³ã°ãçºçããããã©ãŒãã³ã¹ã«æªåœ±é¿ãåãŒãå¯èœæ§ããããŸãã
以äžã®ãã¬ãŒã¹ãã¥ãŒã¢ã¯ãGPU1 ãã¢ã€ãã«ç¶æ
ã§ãGPU2 ã®èµ·ååŸã«æŒç®ã®å®è¡ãéå§ãããããCPU ã GPU ã«ãŒãã«ãéå¹ççã«èµ·åããéã®ãªãŒããŒãããã瀺ããŠããŸãã

ãã¹ãã®ãã¬ãŒã¹ãã¥ãŒã¯ããã¹ããã«ãŒãã«ã GPU1 ã§èµ·åããåã« GPU2 ã§èµ·åããŠããããšã瀺ããŠããŸãïŒä»¥äžã® tf_Compute* æŒç®ã¯ CPU ã¹ã¬ããã瀺ããã®ã§ã¯ãªãããšã«æ³šæããŠãã ããïŒã

ããã°ã©ã ã®ãã¬ãŒã¹ãã¥ãŒã§ãã®çš®ã® GPU ã«ãŒãã«ã®ãããçºçããå Žåãæšå¥šãããã¢ã¯ã·ã§ã³ã¯æ¬¡ã®ãšããã§ãã
TensorFlow ç°å¢å€æ°
TF_GPU_THREAD_MODEãgpu_privateã«èšå®ããŸãããã®ç°å¢å€æ°ã¯ãGPU ã®ã¹ã¬ãããéå ¬éã«ããããã«ãã¹ãã«æç€ºããŸããããã©ã«ãã§ã¯ã
TF_GPU_THREAD_MODE=gpu_privateã¯ã¹ã¬ããæ°ã 2 ã«èšå®ããŸããã»ãšãã©ã®å Žåãããã§ååã§ãããã ããTensorFlow ç°å¢å€æ°TF_GPU_THREAD_COUNTãç®çã®ã¹ã¬ããæ°ã«èšå®ããããšã§å€æŽã§ããŸãã