Path: blob/master/site/ja/tutorials/distribute/parameter_server_training.ipynb
37817 views
Copyright 2020 The TensorFlow Authors.
ParameterServerStrategy ã§ãã©ã¡ãŒã¿ãµãŒããŒããã¬ãŒãã³ã°ãã
æŠèŠ
ãã©ã¡ãŒã¿ãµãŒããŒãã¬ãŒãã³ã°ã¯ãè€æ°ã®ãã·ã³ã§ã¢ãã«ãã¬ãŒãã³ã°ãã¹ã±ãŒã«ã¢ããããããã®äžè¬çãªããŒã¿äžŠåæ¹æ³ã§ãã
ãã©ã¡ãŒã¿ãµãŒããŒãã¬ãŒãã³ã° ã¯ã©ã¹ã¿ã¯ãã¯ãŒã«ãŒãšãã©ã¡ãŒã¿ãµãŒããŒã§æ§æãããŸãã倿°ã¯ãã©ã¡ãŒã¿ãµãŒããŒã§äœæãããåã¹ãããã§ã¯ãŒã«ãŒã«ããèªã¿åãããæŽæ°ãããŸãã ããã©ã«ãã§ã¯ãã¯ãŒã«ãŒã¯çžäºã«åæããããšãªãããããã®å€æ°ãåå¥ã«èªã¿åããæŽæ°ããŸãããã®ããããã©ã¡ãŒã¿ãµãŒããŒã¹ã¿ã€ã«ã®ãã¬ãŒãã³ã°ã¯éåæãã¬ãŒãã³ã°ãšåŒã°ããŸãã
TensorFlow 2 ã§ã¯ããã©ã¡ãŒã¿ãµãŒããŒãã¬ãŒãã³ã°ã¯ tf.distribute.ParameterServerStrategy ã¯ã©ã¹ã«ãã£ãŠè¡ãããŸãããã®ã¯ã©ã¹ã¯ãæ°åã®ã¯ãŒã«ãŒã«ã¹ã±ãŒã«ã¢ããããã¯ã©ã¹ã¿ã«ãã¬ãŒãã³ã°ã¹ãããã忣ããŸã (ãã©ã¡ãŒã¿ãµãŒããŒã䌎ã)ã
ãµããŒããããŠãããã¬ãŒãã³ã°æ¹æ³
ãµããŒããããŠããäž»ãªãã¬ãŒãã³ã°æ¹æ³ã¯ 2 ã€ãããŸãã
Keras
Model.fitAPI: é«ã¬ãã«ã®æœè±¡åãšãã¬ãŒãã³ã°ã®åŠçãåžæããå Žåã«äœ¿çšããŸããããã¯ãtf.keras.Modelããã¬ãŒãã³ã°ããŠããå Žåã«äžè¬çã«æšå¥šãããŸããã«ã¹ã¿ã ãã¬ãŒãã³ã°ã«ãŒã: ãã¬ãŒãã³ã°ã«ãŒãã®è©³çްãå®çŸ©ããå Žåã«äœ¿çšããŸã (詳现ã«ã€ããŠã¯ãã«ã¹ã¿ã ãã¬ãŒãã³ã°ããã¬ãŒãã³ã°ã«ãŒããæåããäœæãããMultiWorkerMirroredStrategy ãš Keras ã䜿çšããã«ã¹ã¿ã ãã¬ãŒãã³ã°ã«ãŒãã«é¢ããã¬ã€ã ãåç §ããŠãã ãã)ã
ãžã§ããšã¿ã¹ã¯ã®ã¯ã©ã¹ã¿
éžæãã API (Model.fit ãŸãã¯ã«ã¹ã¿ã ãã¬ãŒãã³ã°ã«ãŒã) ã«é¢ä¿ãªããTensorFlow 2 ã®åæ£ãã¬ãŒãã³ã°ã«ã¯ãè€æ°ã® 'jobs' ããããåãžã§ãã«ã¯ 1 ã€ä»¥äžã® 'task' ãããå ŽåããããŸãã
ãã©ã¡ãŒã¿ãµãŒããŒãã¬ãŒãã³ã°ã䜿çšããå Žåã¯ãæ¬¡ãæšèŠããŸãã
1 ã€ã® ã³ãŒãã£ããŒã¿ãžã§ã (ãžã§ãåã¯
chief)è€æ°ã®ã¯ãŒã«ãŒãžã§ã (ãžã§ãåã¯
worker)è€æ°ã®ãã©ã¡ãŒã¿ãµãŒããŒãžã§ã (ãžã§ãåã¯
ps)
ã³ãŒãã£ããŒã¿ã¯ããªãœãŒã¹ãäœæãããã¬ãŒãã³ã°ã¿ã¹ã¯ããã£ã¹ããããããã§ãã¯ãã€ã³ããæžã蟌ã¿ãã¿ã¹ã¯ã®å€±æã«å¯ŸåŠããŸããã¯ãŒã«ãŒãšãã©ã¡ãŒã¿ãµãŒããŒã¯ãã³ãŒãã£ããŒã¿ããã®ãªã¯ãšã¹ãããªãã¹ã³ãã tf.distribute.Server ã€ã³ã¹ã¿ã³ã¹ãå®è¡ããŸãã
Model.fit API ã䜿çšãããã©ã¡ãŒã¿ãµãŒããŒãã¬ãŒãã³ã°
Model.fit API ã䜿çšãããã©ã¡ãŒã¿ãµãŒããŒãã¬ãŒãã³ã°ã§ã¯ãã³ãŒãã£ããŒã¿ã tf.distribute.ParameterServerStrategy ãªããžã§ã¯ãã䜿çšããå¿
èŠããããŸããModel.fit ãã¹ãã©ããžãŒãªãã§äœ¿çšããå Žåãä»ã®ã¹ãã©ããžãŒã䜿çšããå Žåãšåæ§ã«ãã¯ãŒã¯ãããŒã«ã¯ãã¢ãã«ã®äœæãšã³ã³ãã€ã«ãã³ãŒã«ããã¯ã®æºåãããã³ Model.fit ã®åŒã³åºããå«ãŸããŸãã
ã«ã¹ã¿ã ãã¬ãŒãã³ã°ã«ãŒãã䜿çšãããã©ã¡ãŒã¿ãµãŒããŒãã¬ãŒãã³ã°
ã«ã¹ã¿ã ãã¬ãŒãã³ã°ã«ãŒãã§ã¯ãtf.distribute.coordinator.ClusterCoordinator ã¯ã©ã¹ãã³ãŒãã£ããŒã¿ã«äœ¿çšãããéèŠãªã³ã³ããŒãã³ãã§ãã
ClusterCoordinatorã¯ã©ã¹ã¯ãtf.distribute.ParameterServerStrategyãªããžã§ã¯ããšé£æºããŠåäœããå¿ èŠããããŸãããã®
tf.distribute.Strategyãªããžã§ã¯ãã¯ãã¯ã©ã¹ã¿ã®æ å ±ãæäŸããããã«å¿ èŠã§ãããtf.distribute.Strategy ã䜿çšããã«ã¹ã¿ã ãã¬ãŒãã³ã°ã§ç€ºãããŠããããã«ããã¬ãŒãã³ã°ã¹ããããå®çŸ©ããããã«äœ¿çšãããŸããClusterCoordinatorãªããžã§ã¯ãã¯ããããã®ãã¬ãŒãã³ã° ã¹ãããã®å®è¡ããªã¢ãŒãã¯ãŒã«ãŒã«ãã£ã¹ãããããŸãã
ClusterCoordinator ãªããžã§ã¯ãã«ããæäŸãããæãéèŠãª API 㯠schedule ã§ãã
scheduleAPI ã¯tf.functionããã¥ãŒã«å ¥ããfuture-like ã®RemoteValueãããã«è¿ããŸãããã¥ãŒã«å ¥ãããã颿°ã¯ãããã¯ã°ã©ãŠã³ãã¹ã¬ããã§ãªã¢ãŒãã¯ãŒã«ãŒã«ãã£ã¹ãããããããã®
RemoteValueã¯éåæã§åããããŸããscheduleã¯ã¯ãŒã«ãŒã®å²ãåœãŠãå¿ èŠãšããªããããæž¡ãããtf.functionã¯äœ¿çšå¯èœãªä»»æã®ã¯ãŒã«ãŒã§å®è¡ã§ããŸãã颿°ãå®è¡ãããã¯ãŒã«ãŒãå®äºåã«å©çšã§ããªããªã£ãå Žåãå¥ã®å©çšå¯èœãªã¯ãŒã«ãŒã§å詊è¡ãããŸãã
ãã®ããããããŠé¢æ°ã®å®è¡ãã¢ãããã¯ã§ã¯ãªãããã«ã1 ã€ã®é¢æ°ã®åŒã³åºããè€æ°åå®è¡ãããå ŽåããããŸãã
ClusterCoordinator ã¯ããªã¢ãŒã颿°ã®ãã£ã¹ãããã«å ããŠããã¹ãŠã®ã¯ãŒã«ãŒã§ããŒã¿ã»ãããäœæããã¯ãŒã«ãŒãé害ããå埩ãããšãã«ãããã®ããŒã¿ã»ãããåæ§ç¯ããã®ã«ã圹ç«ã¡ãŸãã
ãã¥ãŒããªã¢ã«ã®ã»ããã¢ãã
ãã¥ãŒããªã¢ã«ã®ã»ã¯ã·ã§ã³ã¯ Model.fit ãšã«ã¹ã¿ã ãã¬ãŒãã³ã°ã«ãŒãåãã«åãããŠããŸãããX ã䜿çšãããã¬ãŒãã³ã°ã以å€ã®ã»ã¯ã·ã§ã³ã¯ãäž¡æ¹ã«é©çšãããŸãã
ã¯ã©ã¹ã¿ã®ã»ããã¢ãã
åè¿°ã®ããã«ããã©ã¡ãŒã¿ãµãŒããŒãã¬ãŒãã³ã°ã¯ã©ã¹ã¿ã«ã¯ããã¬ãŒãã³ã°ããã°ã©ã ãå®è¡ããã³ãŒãã£ããŒã¿ã¿ã¹ã¯ã1 ã€ãŸãã¯è€æ°ã®ã¯ãŒã«ãŒãTensorFlow ãµãŒããŒãå®è¡ãããã©ã¡ãŒã¿ãµãŒããŒã¿ã¹ã¯ (tf.distribute.Server) ãå¿
èŠã§ããå Žåã«ãã£ãŠã¯ããµã€ãã«ãŒè©äŸ¡ãå®è¡ãã远å ã®è©äŸ¡ã¿ã¹ã¯ãå¿
èŠã§ã (以äžã®ãµã€ãã«ãŒè©äŸ¡ã»ã¯ã·ã§ã³ãåç
§ããŠãã ãã)ãããããèšå®ããããã®èŠä»¶ã¯æ¬¡ã®ãšããã§ãã
ã³ãŒãã£ããŒã¿ã¿ã¹ã¯ã¯ããšããªã¥ãšãŒã¿ãé€ãä»ã®ãã¹ãŠã® TensorFlow ãµãŒããŒã®ã¢ãã¬ã¹ãšããŒããç¥ãå¿ èŠããããŸãã
ã¯ãŒã«ãŒãšãã©ã¡ãŒã¿ãµãŒããŒã¯ããªãã¹ã³ããå¿ èŠãããããŒããç¥ãå¿ èŠããããŸããéåžžããããã®ã¿ã¹ã¯ã§ TensorFlow ãµãŒããŒãäœæãããšãã«ãå®å šãªã¯ã©ã¹ã¿ã®æ å ±ãæž¡ããŸãã
ãšããªã¥ãšãŒã¿ã¿ã¹ã¯ã¯ããã¬ãŒãã³ã°ã¯ã©ã¹ã¿ã®èšå®ãç¥ãå¿ èŠã¯ãããŸãããç¥ã£ãŠããå Žåã§ãããã¬ãŒãã³ã°ã¯ã©ã¹ã¿ãžã®æ¥ç¶ã詊ã¿ãã¹ãã§ã¯ãããŸããã
ã¯ãŒã«ãŒãšãã©ã¡ãŒã¿ãµãŒããŒã«ã¯ããããã
"worker"ãš"ps"ã®ã¿ã¹ã¯ã¿ã€ããå¿ èŠã§ããã³ãŒãã£ããŒã¿ã¯ãã¿ã¹ã¯ã¿ã€ããšããŠåŸæ¥ã®"chief"ã䜿çšããå¿ èŠããããŸãã
ãã®ãã¥ãŒããªã¢ã«ã§ã¯ãã€ã³ããã»ã¹ã®ã¯ã©ã¹ã¿ãäœæãããã©ã¡ãŒã¿ãµãŒããŒã®ãã¬ãŒãã³ã°å šäœã Colab ã§å®è¡ã§ããããã«ããŸããå®éã®ã¯ã©ã¹ã¿ã®èšå®æ¹æ³ã«ã€ããŠã¯ãåŸã®ã»ã¯ã·ã§ã³ã§èª¬æããŸãã
ã€ã³ããã»ã¹ ã¯ã©ã¹ã¿
äºåã«ããã€ãã® TensorFlow ãµãŒããŒãäœæããããšããå§ããåŸã§ãããã«æ¥ç¶ããŸããããã¯ããã¥ãŒããªã¢ã«ã®ãã¢ãç®çãšããŠãããå®éã®ãã¬ãŒãã³ã°ã§ã¯ããµãŒããŒã¯ "worker" ããã³ "ps" ãã·ã³ã§èµ·åãããããšã«æ³šæããŠãã ããã
ã€ã³ããã»ã¹ã¯ã©ã¹ã¿ã®ã»ããã¢ããã¯ããŠããããã¹ãã§ãã䜿çšãããŸã (ãã¡ããåç §)ã
ããŒã«ã«ãã¹ãã®ãã 1 ã€ã®ãªãã·ã§ã³ã¯ãããŒã«ã«ãã·ã³ã§ããã»ã¹ãèµ·åããããšã§ãããã®ã¢ãããŒãã®äŸã«ã€ããŠã¯ãKeras ã䜿çšãããã«ãã¯ãŒã«ãŒãã¬ãŒãã³ã°ãåç §ããŠãã ããã
ParameterServerStrategy ãã€ã³ã¹ã¿ã³ã¹åãã
ãã¬ãŒãã³ã° ã³ãŒãã«å
¥ãåã«ãtf.distribute.ParameterServerStrategy ãªããžã§ã¯ããã€ã³ã¹ã¿ã³ã¹åããŸããããã¯ãModel.fit ãšã«ã¹ã¿ã ãã¬ãŒãã³ã°ã«ãŒãã®ã©ã¡ãã䜿çšããŠããå Žåã§ãå¿
èŠã§ããããšã«æ³šæããŠãã ãããvariable_partitioner åŒæ°ã«ã€ããŠã¯ã倿°ã·ã£ãŒãã£ã³ã°ã®ã»ã¯ã·ã§ã³ã§èª¬æããŸãã
ãã¬ãŒãã³ã°ã« GPU ã䜿çšããã«ã¯ãåã¯ãŒã«ãŒã«è¡šç€ºããã GPU ãå²ãåœãŠãŸãã ParameterServerStrategy ã¯ãåã¯ãŒã«ãŒã§å©çšå¯èœãªãã¹ãŠã® GPU ã䜿çšããŸããããã¹ãŠã®ã¯ãŒã«ãŒãåãæ°ã® GPU ãå©çšã§ããå¿
èŠããããšããå¶éããããŸãã
倿°ã®ã·ã£ãŒãã£ã³ã°
倿°ã®ã·ã£ãŒãã£ã³ã°ãšã¯ã倿°ãã·ã£ãŒããšåŒã°ããè€æ°ã®å°ããªå€æ°ã«åå²ããããšã§ãã倿°ã®ã·ã£ãŒãã£ã³ã°ã¯ããããã®ã·ã£ãŒãã«ã¢ã¯ã»ã¹ããéã®ãããã¯ãŒã¯è² è·ã忣ããã®ã«åœ¹ç«ã€å ŽåããããŸãããŸãã1 å°ã®ãã·ã³ã®ã¡ã¢ãªã«åãŸããªãéåžžã«å€§ããªåã蟌ã¿ã䜿çšããå Žåãªã©ãéåžžã®å€æ°ã®èšç®ãšæ ŒçŽãè€æ°ã®ãã©ã¡ãŒã¿ãµãŒããŒã«åæ£ããããšãã§ããŸãã
倿°ã·ã£ãŒãã£ã³ã°ãæå¹ã«ããã«ã¯ãParameterServerStrategy ãªããžã§ã¯ããæ§ç¯ããéã« variable partitioner ãæž¡ããŸããvariable_partitioner ã¯ã倿°ãäœæããããã³ã«åŒã³åºããã倿°ã®å次å
ã«æ²¿ã£ãŠã·ã£ãŒãã®æ°ãè¿ãããšãæåŸ
ãããŸããtf.distribute.experimental.partitioners.MinSizePartitioner ãªã©ãããã«äœ¿ãã variable_partitioner ãããã€ãæäŸãããŠããŸããtf.distribute.experimental.partitioners.MinSizePartitioner ã®ãããªãµã€ãºããŒã¹ã®ããŒãã£ã·ã§ããŒã䜿çšããŠãã¢ãã«ã®ãã¬ãŒãã³ã°éåºŠã«æªåœ±é¿ãåãŒãå¯èœæ§ã®ããå°ããªå€æ°ã®ããŒãã£ã·ã§ãã³ã°ãé¿ããããšããå§ãããŸãã
variable_partitioner ãæž¡ãããStrategy.scope ã®ããäžã«å€æ°ãäœæãããšããã®å€æ°ã¯ variables ããããã£ãæã€ã³ã³ããã¿ã€ãã«ãªããã·ã£ãŒãã®ãªã¹ããžã®ã¢ã¯ã»ã¹ãæäŸããŸããã»ãšãã©ã®å Žåããã®ã³ã³ããã¯ããã¹ãŠã®ã·ã£ãŒããé£çµããããšã«ãã£ãŠèªåçã«ãã³ãœã«ã«å€æãããã®ã§ãéåžžã®å€æ°ãšããŠäœ¿çšã§ããŸããäžæ¹ãtf.nn.embedding_lookup ãªã©ã®äžéšã® TensorFlow ã¡ãœããã¯ããã®ã³ã³ããã¿ã€ãã®å¹ççãªå®è£
ãæäŸãããããã®ã¡ãœããã§ã¯èªåé£çµãåé¿ãããŸãã
詳现ã«ã€ããŠã¯ãtf.distribute.ParameterServerStrategy ã® API ããã¥ã¡ã³ããåç
§ããŠãã ããã
Model.fit ã§ãã¬ãŒãã³ã°ãã
Keras ã¯ãModel.fit ãä»ããŠäœ¿ãããããã¬ãŒãã³ã° API ãæäŸããŸããããã¯ãå
éšã§ãã¬ãŒãã³ã°ã«ãŒããåŠçãããªãŒããŒã©ã€ãå¯èœãªæè»ãª train_step ã TensorBoard ã®ãã§ãã¯ãã€ã³ãã®ä¿åããµããªãŒã®ä¿åãªã©ã®æ©èœãæäŸããã³ãŒã«ããã¯ãåããŠããŸããModel.fit ã䜿çšãããšãã¹ãã©ããžãŒãªããžã§ã¯ããç°¡åã«äº€æããã ãã§ãåããã¬ãŒãã³ã°ã³ãŒããä»ã®ã¹ãã©ããžãŒã§äœ¿çšã§ããŸãã
å ¥åããŒã¿
tf.distribute.ParameterServerStrategy ã䜿çšãã Keras Model.fit ã§ã¯ãtf.data.Datasetãtf.distribute.DistributedDataset ã®åœ¢åŒã®å
¥åããŒã¿ã䜿ããŸãããŸãã¯ãtf.keras.utils.experimental.DatasetCreator ã®Dataset ã¯äœ¿ããããæšå¥šããããªãã·ã§ã³ã§ãããã ããDataset ã䜿çšããŠã¡ã¢ãªã®åé¡ãçºçããå Žåã¯ãåŒã³åºãå¯èœãª dataset_fn åŒæ°ãæå®ã㊠DatasetCreator ã䜿çšããå¿
èŠãããå ŽåããããŸã (詳现ã«ã€ããŠã¯ãtf .keras.utils.experimental.DatasetCreator API ããã¥ã¡ã³ããåç
§ããŠãã ãã)ã
ããŒã¿ã»ããã tf.data.Dataset ã«å€æããå Žåã¯ã以äžã®äŸã§ç€ºãããŠããããã«ãDataset.shuffle ãš Dataset.repeat ã䜿çšããå¿
èŠããããŸãã
ãã©ã¡ãŒã¿ãµãŒããŒãã¬ãŒãã³ã°ã䜿çšãã Keras
Model.fitã§ã¯ãç°ãªãæ¹æ³ã§ã·ã£ããã«ãããå Žåãé€ããŠãåã¯ãŒã«ãŒãåãããŒã¿ã»ãããåãåãããšãåæãšããŠããŸãããããã£ãŠãDataset.shuffleãåŒã³åºãããšã§ãããŒã¿ãããåçã«ã€ãã¬ãŒã·ã§ã³ã§ããŸããã¯ãŒã«ãŒã¯åæããªããããããŒã¿ã»ããã®åŠçã®çµäºæãç°ãªãå ŽåããããŸãã
Dataset.repeatã䜿çšãããšãã©ã¡ãŒã¿ãµãŒããŒãã¬ãŒãã³ã°ã§ãšããã¯ãç°¡åã«å®çŸ©ã§ããŸããããã¯ãåŒæ°ãªãã§åŒã³åºãããå Žåã«ããŒã¿ã»ãããç¡æéã«ç¹°ãè¿ããModel.fitåŒã³åºãã§steps_per_epochåŒæ°ãæå®ããŸãã
shuffle ãš repeat ã®è©³çްã«ã€ããŠã¯ãtf.data ã¬ã€ãã®ããã¬ãŒãã³ã° ã¯ãŒã¯ãããŒãã»ã¯ã·ã§ã³ãåç
§ããŠãã ããã
代ããã« tf.keras.utils.experimental.DatasetCreator ã§ããŒã¿ã»ãããäœæãããšãdataset_fn ã®ã³ãŒãã¯ãåã¯ãŒã«ãŒãã·ã³ã®å
¥åããã€ã¹ (é垞㯠CPU) ã§åŒã³åºãããŸãã
ã¢ãã«ã®æ§ç¯ãšã³ã³ãã€ã«
ãŸããtf.keras.Model (ãã¢çšã®èªæãª tf.keras.models.Sequential ã¢ãã«) ãäœæããæ¬¡ã« Model.compile ãåŒã³åºããŠããªããã£ãã€ã¶ãŒãã¡ããªãã¯ãªã©ã®ã³ã³ããŒãã³ããããã³ steps_per_execution ãªã©ã®ãã®ä»ã®ãã©ã¡ãŒã¿ãçµã¿èŸŒã¿ãŸãã
ã³ãŒã«ããã¯ãšãã¬ãŒãã³ã°
å®éã®ãã¬ãŒãã³ã°ã®ããã« Keras Model.fit ãåŒã³åºãåã«ã次ã®ãããªäžè¬çãªã¿ã¹ã¯ã«å¿
èŠãªã³ãŒã«ããã¯ãæºåããŸãã
tf.keras.callbacks.ModelCheckpoint: åãšããã¯åŸãªã©ãç¹å®ã®é »åºŠã§ã¢ãã«ãä¿åããŸããtf.keras.callbacks.BackupAndRestore: ã¯ã©ã¹ã¿ã䜿çšã§ããªããªã£ãå Žå (ã¢ããŒããããªãšã³ãã·ã§ã³ãªã©)ãã¢ãã«ãšãã®æç¹ã®ãšããã¯çªå·ãããã¯ã¢ããããããšã§èé害æ§ãæäŸããŸãããã®åŸããžã§ãã®å€±æããã®åéæã«ãã¬ãŒãã³ã°ç¶æ ã埩å ããäžæããããšããã¯ã®æåãããã¬ãŒãã³ã°ãç¶è¡ã§ããŸããtf.keras.callbacks.TensorBoard: ãµããªãŒãã¡ã€ã«ã«ã¢ãã« ãã°ã宿çã«æžã蟌ã¿ãŸããããã¯ãTensorBoard ããŒã«ã§èŠèŠåã§ããŸãã
泚æ: ããã©ãŒãã³ã¹ãç¶æããããã«ãParameterServerStrategy ã§äœ¿çšããå Žåãã«ã¹ã¿ã ã³ãŒã«ããã¯ã§ãããã¬ãã«ã®ã³ãŒã«ããã¯ããªãŒããŒã©ã€ãããããšã¯ã§ããŸãããã«ã¹ã¿ã ã³ãŒã«ããã¯ããšããã¯ã¬ãã«ã®åŒã³åºãã«å€æŽããsteps_per_epoch ãé©åãªå€ã«èª¿æŽããŠãã ããããŸããsteps_per_epoch ã¯ãParameterServerStrategy ãšäœµçšããå ŽåãModel.fit ã«å¿
é ã®åŒæ°ã§ãã
ClusterCoordinator ã§çŽæ¥äœ¿çšãã (ãªãã·ã§ã³)
Model.fit ãã¬ãŒãã³ã°ãéžæããå Žåã§ããå¿
èŠã«å¿ã㊠tf.distribute.coordinator.ClusterCoordinator ãªããžã§ã¯ããã€ã³ã¹ã¿ã³ã¹åããŠãã¯ãŒã«ãŒã§å®è¡ããä»ã®é¢æ°ãã¹ã±ãžã¥ãŒã«ã§ããŸãã詳现ãšäŸã«ã€ããŠã¯ãã«ã¹ã¿ã ãã¬ãŒãã³ã°ã«ãŒãã䜿ã£ããã¬ãŒãã³ã°ã®ã»ã¯ã·ã§ã³ãåç
§ããŠãã ããã
ã«ã¹ã¿ã ãã¬ãŒãã³ã°ã«ãŒãã䜿ã£ããã¬ãŒãã³ã°
tf.distribute.Strategy ã§ã«ã¹ã¿ã ãã¬ãŒãã³ã°ã«ãŒãã䜿çšãããšããã¬ãŒãã³ã°ã«ãŒããéåžžã«æè»ã«å®çŸ©ã§ããŸããäžã§ (strategy ãšããŠ) å®çŸ©ããã ParameterServerStrategy ã䜿çšããŠãtf.distribute.coordinator.ClusterCoordinator ã䜿çšããŠããã¬ãŒãã³ã°ã¹ãããã®å®è¡ããªã¢ãŒãã¯ãŒã«ãŒã«ãã£ã¹ãããã§ããŸãã
次ã«ãä»ã® tf.distribute.Strategy ã®ãã¬ãŒãã³ã°ã«ãŒãã§è¡ã£ãããã«ãã¢ãã«ãäœæããããŒã¿ã»ãããå®çŸ©ããã¹ããã颿°ãå®çŸ©ããŸãã詳现ã«ã€ããŠã¯ãtf.distribute.Strategy ã䜿çšããã«ã¹ã¿ã ãã¬ãŒãã³ã° ãã¥ãŒããªã¢ã«ãåç
§ããŠãã ããã
å¹ççã«ããŒã¿ã»ãããããªãã§ããããã«ã¯ã以äžã®ãªã¢ãŒãã¯ãŒã«ãŒã«ãã¬ãŒãã³ã°ã¹ãããããã£ã¹ãããããã»ã¯ã·ã§ã³ã§èª¬æãããŠãããæšå¥šããã忣ããŒã¿ã»ããäœæ API ã䜿çšããŠãã ããããŸããã¯ãŒã«ãŒã«å²ãåœãŠããã GPU ãæå€§éã«æŽ»çšããããã«ãworker_fn å
ã§ Strategy.run ãåŒã³åºããŠãã ããã æ®ãã®ã¹ãããã¯ããã¬ãŒãã³ã°ã§ GPU ã䜿çšããå Žåã§ã䜿çšããªãå Žåã§ãåãã§ãã
æ¬¡ã®æé ã§ãããã®ã³ã³ããŒãã³ããäœæããŸãã
ããŒã¿ã®ã»ããã¢ãã
ãŸããããŒã¿ã»ãããäœæãã颿°ãäœæããŸãã
Keras ååŠçã¬ã€ã€ãŒãŸã㯠Tensorflow 倿ã¬ã€ã€ãŒã§ããŒã¿ãååŠçããå Žåã¯ãä»ã® Keras ã¬ã€ã€ãŒã«å¯ŸããŠè¡ãããã«ãããã®ã¬ã€ã€ãŒã dataset_fn ã®å€ãããã³ãStrategy.scope ã®äžã«äœæããŸããããã¯ãdataset_fn ã tf.function ã«ã©ãããããåã¯ãŒã«ãŒã§å®è¡ãããŠããŒã¿ãã€ãã©ã€ã³ãçæãããããã§ãã
äžèšã®æé ã«åŸããã«ã¬ã€ã€ãŒãäœæãããšãtf.function ããã³ãŒãã£ããŒã¿ã«ãªããããã Tensorflow ç¶æ
ãäœæãããã¯ãŒã«ãŒã§ãããã«ã¢ã¯ã»ã¹ãããšãã³ãŒãã£ããŒã¿ãšã¯ãŒã«ãŒã®éã§ç¹°ãè¿ã RPC åŒã³åºããçºçããé床ã倧å¹
ã«äœäžããå¯èœæ§ããããŸãã
Strategy.scope ã®äžã«ã¬ã€ã€ãŒãé
眮ãããšã代ããã«ãã¹ãŠã®ã¯ãŒã«ãŒã«ã¬ã€ã€ãŒãäœæãããtf.data.Dataset.map ãä»ã㊠dataset_fn å
ã«å€æãé©çšããŸãã忣å
¥åã«ããããŒã¿ã®ååŠçã®è©³çްã«ã€ããŠã¯ã忣å
¥åãã¥ãŒããªã¢ã«ã®ããŒã¿ã®ååŠçãåç
§ããŠãã ããã
ããŒã¿ã»ããã§ãã€ãµã³ãã«ãçæããŸãã
次ã«ãdataset_fn ã«ã©ããããããã¬ãŒãã³ã°ããŒã¿ã»ãããäœæããŸãã
ã¢ãã«ãæ§ç¯ãã
次ã«ãã¢ãã«ãšãã®ä»ã®ãªããžã§ã¯ããäœæããŸããå¿
ã Strategy.scope ã®äžã«ãã¹ãŠã®å€æ°ãäœæããŸãã
FixedShardsPartitioner ã®äœ¿çšã«ããããã¹ãŠã®å€æ°ã 2 ã€ã®ã·ã£ãŒãã«åå²ãããåã·ã£ãŒããç°ãªããã©ã¡ãŒã¿ãµãŒããŒã«å²ãåœãŠãããããšã確èªããŸãã
ãã¬ãŒãã³ã°ã¹ããããå®çŸ©ãã
3 çªç®ã«ãtf.function ã«ã©ããããããã¬ãŒãã³ã°ã¹ããããäœæããŸãã
äžèšã®ãã¬ãŒãã³ã°ã¹ããã颿°ã§ã¯ãstep_fn ã«ããã Strategy.run ãš Strategy.reduce ã®åŒã³åºãã§ã¯ãŒã«ãŒããšã«è€æ°ã® GPU ããµããŒãã§ããŸããã¯ãŒã«ãŒã« GPU ãå²ãåœãŠãããŠããå ŽåãStrategy.run ã¯è€æ°ã®ã¬ããªã«ïŒGPUïŒã§ããŒã¿ã»ããã忣ããŸãããããã® tf.nn.compute_average_loss() ãžã®åæåŒã³åºãã¯ãã¯ãŒã«ãŒã®åèšæ°ã«é¢ä¿ãªãã1 ã€ã®ã¯ãŒã«ãŒã®ã¬ããªã«ïŒGPUïŒéã§æå€±ã®å¹³åãèšç®ããŸãã
ãªã¢ãŒãã¯ãŒã«ãŒã«ãã¬ãŒãã³ã°ã¹ãããããã£ã¹ããããã
ãã¹ãŠã®èšç®ã ParameterServerStrategy ã«ãã£ãŠå®çŸ©ãããåŸãtf.distribute.coordinator.ClusterCoordinator ã¯ã©ã¹ã䜿çšããŠãªãœãŒã¹ãäœæãããã¬ãŒãã³ã°ã¹ãããããªã¢ãŒãã¯ãŒã«ãŒã«åæ£ããŸãã
ãŸããClusterCoordinator ãªããžã§ã¯ããäœæããã¹ãã©ããžãŒãªããžã§ã¯ããæž¡ããŸãã
次ã«ãClusterCoordinator.create_per_worker_dataset API ã䜿çšããŠãã¯ãŒã«ãŒããšã®ããŒã¿ã»ãããšå埩åãäœæããŸããããã«ãããããŒã¿ã»ããããã¹ãŠã®ã¯ãŒã«ãŒã«è€è£œãããŸãã以äžã® per_worker_dataset_fn ã§ã¯ãdataset_fn ã strategy.distribute_datasets_from_function ã«ã©ããããŠãGPU ãžå¹ççã«ããªãã§ãããå®è¡ã§ããããã«ããããšãæšèŠããŸãã
æåŸã®ã¹ãããã¯ãCluster Coordinator.schedule ã䜿çšããŠèšç®ããªã¢ãŒãã¯ãŒã«ãŒã«åæ£ããããšã§ãã
scheduleã¡ãœããã¯tf.functionããã¥ãŒã«å ¥ããfuture-like ã®RemoteValueãããã«è¿ããŸãããã¥ãŒã«å ¥ãããã颿°ã¯ããã¯ã°ã©ãŠã³ãã¹ã¬ããã§ãªã¢ãŒãã¯ãŒã«ãŒã«ãã£ã¹ããããããRemoteValueã¯éåæã§å ¥åãããŸããjoinã¡ãœãã (ClusterCoordinator.join) ã¯ãã¹ã±ãžã¥ãŒã«ããããã¹ãŠã®é¢æ°ãå®è¡ããããŸã§åŸ æ©ããããã«äœ¿çšããŸãã
Remote Value ã®çµæãååŸããæ¹æ³ã¯æ¬¡ã®ãšããã§ãã
ãŸãã¯ããã¹ãŠã®ã¹ããããèµ·åããŠãå®äºããã®ãåŸ ã£ãŠããéã«äœããè¡ãããšãã§ããŸãã
ãã®ç¹å®ã®äŸã®å®å šãªãã¬ãŒãã³ã°ãšãµãŒãã³ã°ã®ã¯ãŒã¯ãããŒã«ã€ããŠã¯ããã®ãã¹ããåç §ããŠãã ããã
ããŒã¿ã»ããäœæã®è©³çް
äžèšã®ã³ãŒãã®ããŒã¿ã»ããã¯ãClusterCoordinator.create_per_worker_dataset API ã䜿çšããŠäœæãããŸããã¯ãŒã«ãŒããšã« 1 ã€ã®ããŒã¿ã»ãããäœæããã³ã³ãããªããžã§ã¯ããè¿ããŸãããã®äžã§ iter ã¡ãœãããåŒã³åºããŠãã¯ãŒã«ãŒããšã®å埩åãäœæã§ããŸããã¯ãŒã«ãŒããšã®å埩åã«ã¯ãã¯ãŒã«ãŒããšã« 1 ã€ã®å埩åãå«ãŸããç¹å®ã®ã¯ãŒã«ãŒã§é¢æ°ãå®è¡ãããåã«ãClusterCoordinator.schedule ã¡ãœããã«æž¡ããã颿°ã®å
¥ååŒæ°ã§ãã¯ãŒã«ãŒã®å¯Ÿå¿ããã¹ã©ã€ã¹ã眮ãæããããŸãã
ClusterCoordinator.schedule ã¡ãœããã¯ãã¯ãŒã«ãŒãåçã§ãç°ãªãã¯ãŒã«ãŒã®ããŒã¿ã»ãããåãã§ãããšæ³å®ããŠããŸã (ãã ããç°ãªãæ¹æ³ã§ã·ã£ããã«ãããå¯èœæ§ããããŸã)ããã®ãããããŒã¿ã»ãããã OutOfRangeError ãåãåãããšã«äŸåãããããŒã¿ã»ãããç¹°ãè¿ããæéæ°ã®ã¹ããããã¹ã±ãžã¥ãŒã«ããããšãæšèŠããŸãã
ãã 1 ã€ã®éèŠãªæ³šæç¹ã¯ãtf.data ããŒã¿ã»ããã¯ãã¿ã¹ã¯å¢çãè¶ããæé»çãªã·ãªã¢ã«åãšéã·ãªã¢ã«åããµããŒãããŠããªããšããããšã§ãããã®ãããClusterCoordinator.create_per_worker_dataset ã«æž¡ããã颿°å
ã§ããŒã¿ã»ããå
šäœãäœæããããšãéèŠã§ããcreate_per_worker_dataset API ã¯ãtf.data.Dataset ãŸã㯠tf.distribute.DistributedDataset ãå
¥åãšããŠçŽæ¥åãåãããšãã§ããŸãã
è©äŸ¡
tf.distribute.ParameterServerStrategy ãã¬ãŒãã³ã°ã§è©äŸ¡ãå®è¡ãã 2 ã€ã®äž»ãªæ¹æ³ã¯ãã€ã³ã©ã€ã³è©äŸ¡ãšãµã€ãã«ãŒè©äŸ¡ã§ãã以äžã«èª¬æããããã«ãããããã«é·æãšçæããããŸããç¹ã«ãã ããããªãå Žåã¯ãã€ã³ã©ã€ã³è©äŸ¡æ¹æ³ãæšèŠããŸããModel.fit ã䜿çšããŠãããŠãŒã¶ãŒã®å ŽåãModel.evaluate ã¯å
éšã§ã€ã³ã©ã€ã³ïŒåæ£ïŒè©äŸ¡ã䜿çšããŠããŸãã
ã€ã³ã©ã€ã³è©äŸ¡
ã€ã³ã©ã€ã³è©äŸ¡ã§ã¯ãã³ãŒãã£ããŒã¿ããã¬ãŒãã³ã°ãšè©äŸ¡ã亀äºã«è¡ããŸãã
ã€ã³ã©ã€ã³è©äŸ¡ã«ã¯ã以äžã®ãããªããã€ãã®å©ç¹ããããŸãã
åäžã®ã¿ã¹ã¯ã§ã¯ä¿æã§ããªãå€§èŠæš¡ãªè©äŸ¡ã¢ãã«ãšè©äŸ¡ããŒã¿ã»ããããµããŒãã§ããŸãã
è©äŸ¡çµæã䜿çšããŠã次ã®ãšããã¯ã®ãã¬ãŒãã³ã°ã«é¢ããæ±ºå®ãäžãããšãã§ããŸã (ãã¬ãŒãã³ã°ãæ©æã«åæ¢ãããã©ãããªã©)ã
ã€ã³ã©ã€ã³è©äŸ¡ãå®è£ ããã«ã¯ãçŽæ¥è©äŸ¡ãšåæ£è©äŸ¡ã® 2 ã€ã®æ¹æ³ããããŸãã
çŽæ¥è©äŸ¡: å°èŠæš¡ãªã¢ãã«ãšè©äŸ¡ããŒã¿ã»ããã®å Žåãã³ãŒãã£ããŒã¿ã¯ãã³ãŒãã£ããŒã¿äžã®è©äŸ¡ããŒã¿ã»ããã䜿çšããŠã忣ã¢ãã«ã§çŽæ¥è©äŸ¡ãå®è¡ã§ããŸãã
忣è©äŸ¡: ã³ãŒãã£ããŒã¿ã§çŽæ¥å®è¡ããããšãäžå¯èœãªå€§èŠæš¡ãªã¢ãã«ãŸãã¯ããŒã¿ã»ããã®å Žåãã³ãŒãã£ããŒã¿ã¿ã¹ã¯ã¯ã
ClusterCoordinator.schedule/ClusterCoordinator.joinã¡ãœãããä»ããŠè©äŸ¡ã¿ã¹ã¯ãã¯ãŒã«ãŒã«åæ£ã§ããŸãã
1 åéãã®è©äŸ¡ãæå¹ã«ãã
tf.distribute.coordinator.ClusterCoordinator ã® schedule ãš join ã¡ãœããã¯ãããã©ã«ãã§ãè©äŸ¡ä¿èšŒãŸã㯠1 åéãã®ã»ãã³ãã£ã¯ã¹ããµããŒãããŠããŸãããèšãæãããšãäžèšã®äŸã§ã¯ãããŒã¿ã»ããå
ã®ãã¹ãŠã®è©äŸ¡äŸãã¡ããã© 1 åå®è¡ãããä¿èšŒããªããè©äŸ¡ãããªããã®ãæ°åè©äŸ¡ããããã®ãããããšããããšã§ãã
ãšããã¯éã§ã®è©äŸ¡ã®åæ£ã軜æžããæ©æåæ¢ããã€ããŒãã©ã¡ãŒã¿ã®ãã¥ãŒãã³ã°ãªã©ã®æ¹æ³ã§è¡ãããã¢ãã«ã®éžæãæ¹åããã«ã¯ã1 åéãã®è©äŸ¡ã奜ãŸããå¯èœæ§ããããŸãã1 åéãã®è©äŸ¡ã¯ã以äžã®ããã«æ§ã ãªæ¹æ³ã§æå¹ã«ã§ããŸãã
Model.fit/.evaluateã¯ãŒã¯ãããŒã䜿çšãããšãModel.compileã«åŒæ°ã远å ããããšã§æå¹ã«ã§ããŸããããã¥ã¡ã³ãã§pss_evaluation_shardsåŒæ°ãã芧ãã ãããtf.dataãµãŒãã¹ API ã¯ãParameterServerStrategyã䜿çšããå Žåã« 1 åéãã®è©äŸ¡ãæäŸã§ããŸãïŒtf.data.experimental.serviceAPI ããã¥ã¡ã³ãã®åçã·ã£ãŒãã£ã³ã°ã»ã¯ã·ã§ã³ãã芧ãã ããïŒããµã€ãã«ãŒè©äŸ¡ã¯åäžã®ãã·ã³äžã§å®è¡ããããããããã©ã«ãã§ 1 åéãã®è©äŸ¡ãæäŸããŸãããã ãã倿°ã®ã¯ãŒã«ãŒã«åæ£ãããè©äŸ¡ããã£ãããããããå€§å¹ ã«äœéãªå ŽåããããŸãã
Model.compile ã䜿çšããæåã®ãªãã·ã§ã³ã¯ãã»ãšãã©ã®ãŠãŒã¶ãŒã«ææ¡ããããœãªã¥ãŒã·ã§ã³ã§ãã
1 åéãã®è©äŸ¡ã«ã¯ã以äžã®ãããªå¶éããããŸãã
1 åéãã®è©äŸ¡ä¿èšŒã§ã«ã¹ã¿ã 忣è©äŸ¡ã«ãŒããèšè¿°ããããšã¯ãµããŒããããŠããŸããããã®ãµããŒããå¿ èŠãªå Žåã¯ãGitHub 課é¡ãæåºããŠãã ããã
Layer.add_metricAPI ã䜿çšããã¡ããªã¯ã¹ã®èšç®ã¯ãèªåçã«åŠçãããŸãããããããè©äŸ¡ããé€å€ããããMetricãªããžã§ã¯ãã«çµã¿èŸŒãããã«äœãçŽãå¿ èŠããããŸãã
ãµã€ãã«ãŒè©äŸ¡
ãµã€ãã«ãŒè©äŸ¡ã¯ãtf.distribute.ParameterServerStrategy ãã¬ãŒãã³ã°ã§è©äŸ¡ã«ãŒããå®çŸ©ããŠå®è¡ããå¥ã®æ¹æ³ã§ãææ°ã®ãã§ãã¯ãã€ã³ãã§ãã§ãã¯ãã€ã³ããç¹°ãè¿ãèªã¿åãè©äŸ¡ãå®è¡ããå°çšã®è©äŸ¡ã¿ã¹ã¯ãäœæããŸããïŒãã§ãã¯ãã€ã³ãã®è©³çްã«ã€ããŠã¯ããã®ã¬ã€ããåç
§ããŠãã ããïŒãã³ãŒãã£ããŒã¿ãŒã¿ã¹ã¯ãšã¯ãŒã«ãŒã¿ã¹ã¯ã¯è©äŸ¡ã«æéãè²»ãããªããããååŸ©åæ°ãäžå®ã§ããã°ãå
šäœã®ãã¬ãŒãã³ã°æéã¯ä»ã®è©äŸ¡æ¹æ³ã䜿çšãããããçããªããŸãããã ããè©äŸ¡ãããªã¬ãŒããã«ã¯ã远å ã®ãšããªã¥ãšãŒã¿ã¿ã¹ã¯ãšå®æçãªãã§ãã¯ãã€ã³ããå¿
èŠã§ãã
ãµã€ãã«ãŒè©äŸ¡ã®è©äŸ¡ã«ãŒããäœæããã«ã¯ã次㮠2 ã€ã®ãªãã·ã§ã³ããããŸãã
tf.keras.utils.SidecarEvaluatorAPI ã䜿çšãããã«ã¹ã¿ã è©äŸ¡ã«ãŒããäœæããã
ãªãã·ã§ã³ 1 ã®è©³çްã«ã€ããŠã¯ãtf.keras.utils.SidecarEvaluator API ããã¥ã¡ã³ããåç
§ããŠãã ããã
ãµã€ãã«ãŒè©äŸ¡ã¯ãåäžã®ã¿ã¹ã¯ã§ã®ã¿ãµããŒããããŠããŸãã ããã¯ã次ã®ããšãæå³ããŸãã
åãµã³ãã«ã 1 åè©äŸ¡ãããããšãä¿èšŒãããŸãããšããªã¥ãšãŒã¿ãããªãšã³ãããŸãã¯åèµ·åãããå Žåãææ°ã®ãã§ãã¯ãã€ã³ãããè©äŸ¡ã«ãŒããåèµ·åããåèµ·ååã«è¡ãããéšåçãªè©äŸ¡ã®é²è¡ç¶æ³ã¯ç Žæ£ãããŸãã
ãã ããåäžã®ã¿ã¹ã¯ã§è©äŸ¡ãå®è¡ãããšãå®å šãªè©äŸ¡ã«æéããããå¯èœæ§ããããŸãã
ã¢ãã«ã®ãµã€ãºã倧ããããŠãšããªã¥ãšãŒã¿ã®ã¡ã¢ãªã«åãŸããªãå Žåãåäžã®ãµã€ãã«ãŒè©äŸ¡ã¯é©çšãããŸããã
ãã 1 ã€ã®æ³šæç¹ã¯ãtf.keras.utils.SidecarEvaluator ã®å®è£
ãšä»¥äžã®ã«ã¹ã¿ã è©äŸ¡ã«ãŒãããäžéšã®ãã§ãã¯ãã€ã³ããã¹ãããããå¯èœæ§ããããšããããšã§ããå©çšå¯èœãªææ°ã®ãã§ãã¯ãã€ã³ãã¯ãåžžã«ååŸãããè©äŸ¡ãšããã¯äžã«è€æ°ã®ãã§ãã¯ãã€ã³ãããã¬ãŒãã³ã°ã¯ã©ã¹ã¿ããçæãããããã§ãããã¹ãŠã®ãã§ãã¯ãã€ã³ããè©äŸ¡ããã«ã¹ã¿ã è©äŸ¡ã«ãŒããäœæã§ããŸããããã®ãã¥ãŒããªã¢ã«ã§ã¯æ±ããŸãããäžæ¹ãè©äŸ¡ã®å®è¡ã«ãããæéããããã§ãã¯ãã€ã³ãã®çæé »åºŠãäœãå Žåã¯ãã¢ã€ãã«ç¶æ
ã«ãªãå¯èœæ§ããããŸãã
ã«ã¹ã¿ã è©äŸ¡ã«ãŒãã䜿çšãããšãè©äŸ¡ãããã§ãã¯ãã€ã³ããéžæããããè©äŸ¡ãšãšãã«å®è¡ãã远å ã®ããžãã¯ãæäŸããããããªã©ã詳现ãå¶åŸ¡ã§ããŸãã以äžã¯ãã«ã¹ã¿ã ãµã€ãã«ãŒè©äŸ¡ã«ãŒãã®äŸã§ãã
çŸå®äžçã®ã¯ã©ã¹ã¿
泚æ: ãã®ã»ã¯ã·ã§ã³ã¯ããã®ããŒãžã®ãã¥ãŒããªã¢ã«ã³ãŒããå®è¡ããããã«ã¯å¿ èŠãããŸããã
å®éã®éçšç°å¢ã§ã¯ããã¹ãŠã®ã¿ã¹ã¯ãããŸããŸãªãã·ã³ã®ããŸããŸãªããã»ã¹ã§å®è¡ããŸããåã¿ã¹ã¯ã§ã¯ã©ã¹ã¿æ
å ±ãæ§æããæãç°¡åãªæ¹æ³ã¯ã"TF_CONFIG" ç°å¢å€æ°ãèšå®ããtf.distribute.cluster_resolver.TFConfigClusterResolver ã䜿çšã㊠"TF_CONFIG" ãè§£æããããšã§ãã
"TF_CONFIG" ç°å¢å€æ°ã®äžè¬çãªèª¬æã«ã€ããŠã¯ã忣ãã¬ãŒãã³ã°ã¬ã€ãã®ãTF_CONFIG ç°å¢å€æ°ã®èšå®ããåç
§ããŠãã ããã
Kubernetes ããã®ä»ã®æ§æãã³ãã¬ãŒãã䜿çšããŠãã¬ãŒãã³ã°ã¿ã¹ã¯ãéå§ãããšããããã®ãã³ãã¬ãŒãã«ãã âTF_CONFIG" ãæ¢ã«èšå®ãããŠããå¯èœæ§ããããŸãã
"TF_CONFIG" ç°å¢å€æ°ã®èšå®
3 ã€ã®ã¯ãŒã«ãŒãš 2 ã€ã®ãã©ã¡ãŒã¿ãµãŒããŒããããšããŸããã¯ãŒã«ãŒ 1 ã® "TF_CONFIG" ã¯æ¬¡ã®ããã«ãªããŸãã
ãšããªã¥ãšãŒã¿ã® "TF_CONFIG" ã¯æ¬¡ã®ãšããã§ãã
äžèšã®ãšããªã¥ãšãŒã¿ã® "TF_CONFIG" æååã® "cluster" ã®éšåã¯ãªãã·ã§ã³ã§ã
ãã¹ãŠã®ã¿ã¹ã¯ã§åããã€ããªã䜿çšããå Žå
åäžã®ãã€ããªã䜿çšããŠããããã¹ãŠã®ã¿ã¹ã¯ãå®è¡ããå Žåã¯ãæåã«ããã°ã©ã ãããŸããŸãªããŒã«ã«åå²ãããå¿ èŠããããŸãã
次ã®ã³ãŒãã¯ãTensorFlow ãµãŒããŒãèµ·åããŠåŸ
æ©ããŸããããã¯ã"worker" ããã³ "ps" ããŒã«ã«åœ¹ç«ã¡ãŸãã
ã¿ã¹ã¯ã®é害ã®åŠç
ã¯ãŒã«ãŒã®é害
tf.distribute.coordinator.ClusterCoordinator ã«ã¹ã¿ã ãã¬ãŒãã³ã°ã«ãŒããš Model.fit ã¢ãããŒãã®äž¡æ¹ããã¯ãŒã«ãŒã®é害ã«å¯Ÿããçµã¿èŸŒã¿ã®ãã©ãŒã«ããã¬ã©ã³ã¹ãæäŸããŸããã¯ãŒã«ãŒã®åŸ©æ§æã«ãClusterCoordinator ã¯ã¯ãŒã«ãŒã§ããŒã¿ã»ããã®åäœæãåŒã³åºããŸãã
ãã©ã¡ãŒã¿ãµãŒããŒãŸãã¯ã³ãŒãã£ããŒã¿ã®é害
ã³ãŒãã£ããŒã¿ããã©ã¡ãŒã¿ãµãŒããŒãšã©ãŒãæ€åºãããšãããã« UnavailableError ãŸã㯠AbortedError ãçºçããŸãããã®å Žåãã³ãŒãã£ããŒã¿ãåèµ·åã§ããŸãããŸããã³ãŒãã£ããŒã¿èªäœãå©çšã§ããªããªãå¯èœæ§ãããã®ã§ããã¬ãŒãã³ã°ã®é²è¡ç¶æ³ã倱ããªãããã«ããããã®ããŒã«ã䜿çšããããšãæšèŠããŸãã
Model.fitã®å Žåãé²è¡ç¶æ³ã®ä¿åãšåŸ©å ãèªåçã«åŠçããBackupAndRestoreã³ãŒã«ããã¯ã䜿çšããå¿ èŠããããŸããäŸã«ã€ããŠã¯ãäžèšã®ã³ãŒã«ããã¯ãšãã¬ãŒãã³ã° ã»ã¯ã·ã§ã³ãåç §ããŠãã ããã
ã«ã¹ã¿ã ãã¬ãŒãã³ã°ã«ãŒãã®å Žåãã¢ãã«å€æ°ã宿çã«ãã§ãã¯ãã€ã³ããããã§ãã¯ãã€ã³ããããå Žåã¯ããã¬ãŒãã³ã°ãéå§ããåã«ã¢ãã«å€æ°ãèªã¿èŸŒãå¿ èŠããããŸãããªããã£ãã€ã¶ããã§ãã¯ãã€ã³ããããŠããå Žåããã¬ãŒãã³ã°ã®é²è¡ç¶æ³ã¯
optimizer.iterationsããããããæšæž¬ã§ããŸãã
RemoteValue ã®ãã§ãã
颿°ãæ£åžžã«å®è¡ãããå ŽåãRemoteValue ã®ãã§ããã¯ç¢ºå®ã«æåããŸããããã¯ãçŸåšã颿°ãå®è¡ãããåŸãæ»ãå€ãããã«ã³ãŒãã£ããŒã¿ã«ã³ããŒãããããã§ããã³ããŒäžã«ã¯ãŒã«ãŒã«é害ãçºçããå Žåã颿°ã¯å¥ã®äœ¿çšå¯èœãªã¯ãŒã«ãŒã§å詊è¡ãããŸãããããã£ãŠãããã©ãŒãã³ã¹ãæé©åããã«ã¯ãæ»ãå€ãªãã§é¢æ°ãã¹ã±ãžã¥ãŒã«ããŸãã
ãšã©ãŒã¬ããŒã
ã³ãŒãã£ããŒã¿ã¯ããã©ã¡ãŒã¿ãµãŒããŒããã® UnavailableError ãªã©ã®ãšã©ãŒããtf.debugging.check_numerics ããã® InvalidArgument ãªã©ã®ä»ã®ã¢ããªã±ãŒã·ã§ã³ãšã©ãŒã確èªãããšããšã©ãŒãçºçããåã«ãä¿çäžããã³ãã¥ãŒã«å
¥ãããããã¹ãŠã®é¢æ°ããã£ã³ã»ã«ããŸãã察å¿ãã RemoteValue ããã§ãããããšãCancelledError ãçºçããŸãã
ãšã©ãŒãçºçããåŸãã³ãŒãã£ããŒã¿ã¯åããšã©ãŒãŸãã¯ãã£ã³ã»ã«ããã颿°ããã®ãšã©ãŒãçºçããŸããã
ããã©ãŒãã³ã¹ã®æ¹å
tf.distribute.ParameterServerStrategy ãš tf.distribute.coordinator.ClusterCoordinator ã§ãã¬ãŒãã³ã°ãããšãã«ããã©ãŒãã³ã¹ã®åé¡ãçºçããããšããããŸãã
äžè¬çã«ããã©ã¡ãŒã¿ãµãŒããŒã®è² è·ãäžåè¡¡ã§ãããè² è·ã®é«ãäžéšã®ãã©ã¡ãŒã¿ãµãŒããŒãå¶é容éã«éããå Žåã«çºçããŸãã ãŸããè€æ°ã®æ ¹æ¬åå ãååšããå ŽåããããŸãããã®åé¡ã軜æžããç°¡åãªæ¹æ³ã¯æ¬¡ã®ãšããã§ãã
ParameterServerStrategyãæ§ç¯ãããšãã«variable_partitionerãæå®ããŠãå€§èŠæš¡ãªã¢ãã«ã®å€æ°ãåå²ããŸããæ¬¡ã®ããã«ããŠããã¹ãŠã®ãã©ã¡ãŒã¿ãµãŒããŒã§å¿ èŠãªãããã¹ããã倿°ã 1 ã€ã®ã¹ãããã§äœæããããšã¯é¿ããŠãã ããã
ãªããã£ãã€ã¶ã§äžå®ã®åŠç¿çãŸãã¯ãµãã¯ã©ã¹
tf.keras.optimizers.schedules.LearningRateScheduleã䜿çšããŸããããã¯ãããã©ã«ãã®åäœã§ã¯ãåŠç¿çã¯ç¹å®ã®ãã©ã¡ãŒã¿ãµãŒããŒã«é 眮ããã倿°ã«ãªããåã¹ãããã§ä»ã®ãã¹ãŠã®ãã©ã¡ãŒã¿ãµãŒããŒã«ãã£ãŠèŠæ±ãããããã§ããtf.keras.optimizers.legacy.Optimizerã䜿çšããŸãïŒæšæºã®tf.keras.optimizers.Optimizerã§ã¯ããããã¹ããã倿°ã«ãªãå¯èœæ§ããããŸãïŒã倧ããªèªåœã¯ãKeras ã®ååŠçã¬ã€ã€ãŒã«æž¡ãåã«ã·ã£ããã«ããŸãã
ãã 1 ã€ã®ããã©ãŒãã³ã¹ã®åé¡ã®åå ã¯ãã³ãŒãã£ããŒã¿ã§ãã schedule/join ã®å®è£
㯠Python ããŒã¹ã§ãããããã¹ã¬ããã®ãªãŒããŒããããçºçããå ŽåããããŸãããŸããã³ãŒãã£ããŒã¿ãšã¯ãŒã«ãŒéã®åŸ
ã¡æéãé·ããªãå¯èœæ§ããããŸãããã®ãããªå Žåã¯ã次ã®ããã«ããŸãã
Model.fitã§ã¯ãModel.compileã§æäŸãããsteps_per_executionåŒæ°ã 1 ãã倧ããå€ã«èšå®ããŸããã«ã¹ã¿ã ãã¬ãŒãã³ã°ã«ãŒãã§ã¯ãè€æ°ã®ã¹ãããã 1 ã€ã®
tf.functionã«ãŸãšããããšãã§ããŸãã
ä»åŸã©ã€ãã©ãªãããã«æé©åãããã«ã€ããŠãã»ãšãã©ã®ãŠãŒã¶ãŒã¯ã¹ããããæåã§ãŸãšããå¿ èŠããªããªãããšã§ãããã
ãŸããäžèšã®ã¿ã¹ã¯ã®é害ã®åŠçã»ã¯ã·ã§ã³ã§èª¬æããããã«ãããã©ãŒãã³ã¹ãåäžãããããã«ãæ»ãå€ãªãã§é¢æ°ãã¹ã±ãžã¥ãŒã«ããããšãã§ããŸãã
æ¢ç¥ã®å¶é
æ¢ç¥ã®å¶éã®ã»ãšãã©ã¯ãäžèšã®ã»ã¯ã·ã§ã³ã§æ¢ã«èª¬æãããŠããŸãããã®ã»ã¯ã·ã§ã³ã§ã¯ãæŠèŠã説æããŸãã
ParameterServerStrategy å
šè¬
os.environment["grpc_fail_fast"]="use_caller"ã¯ããã©ãŒã«ããã¬ã©ã³ã¹ãé©åã«æ©èœãããããã«ãã³ãŒãã£ããŒã¿ãå«ããã¹ãŠã®ã¿ã¹ã¯ã§å¿ èŠã§ããåæãã©ã¡ãŒã¿ãµãŒããŒãã¬ãŒãã³ã°ã¯ãµããŒããããŠããŸããã
éåžžãããã©ãŒãã³ã¹ãæé©åããã«ã¯ãè€æ°ã®ã¹ãããã 1 ã€ã®é¢æ°ã«ãŸãšããå¿ èŠããããŸãã
åå²ããã倿°ãå«ã
tf.saved_model.loadçµç±ã§ã® saved_model ã®èªã¿èŸŒã¿ã¯ãµããŒããããŠããŸãããæ³šæ: TensorFlow Serving ã䜿çšãããã®ãã㪠saved_model ã®èªã¿èŸŒã¿ã¯æ©èœããããšãæåŸ ãããŠããŸã (詳现ã«ã€ããŠã¯ããµãŒãã³ã°ã®ãã¥ãŒããªã¢ã«ãåç §ããŠãã ãã)ãã³ãŒãã£ããŒã¿ã¿ã¹ã¯ãåèµ·åããã«ãã©ã¡ãŒã¿ãµãŒããŒã®é害ããå埩ããããšã§ããŸããã
tf.keras.layers.IntegerLookupãtf.keras.layers.StringLookupãtf.keras.layers.TextVectorizationããªã©ã®äžéšã® Keras ååŠçã¬ã€ã€ãŒã§äžè¬çã«äœ¿çšãããtf.lookup.StaticHashTableã¯ãStrategy.scopeã®äžã«é 眮ããå¿ èŠããããŸããããããªããšããªãœãŒã¹ãã³ãŒãã£ããŒã¿ã«é 眮ãããã¯ãŒã«ãŒããã³ãŒãã£ããŒã¿ãžã®ã«ãã¯ã¢ãã RPC ãããã©ãŒãã³ã¹ã«åœ±é¿ãäžããŸãã
Model.fit ã®ã¿
Model.fitã«ã¯steps_per_epochåŒæ°ãå¿ èŠã§ãããšããã¯ã§é©åãªééãæäŸããå€ãéžæããŸããParameterServerStrategyã¯ãããã©ãŒãã³ã¹äžã®çç±ããããããã¬ãã«ã®åŒã³åºããæã€ã«ã¹ã¿ã ã³ãŒã«ããã¯ããµããŒãããŠããŸããããããã®åŒã³åºããé©åã«éžæãããsteps_per_epochãæã€ãšããã¯ã¬ãã«ã®åŒã³åºãã«å€æããŠãsteps_per_epochã®ã¹ãããæ°ããšã«åŒã³åºãããããã«ããå¿ èŠããããŸãããããã¬ãã«ã®åŒã³åºãã¯ããã©ãŒãã³ã¹ãåäžããããã«å€æŽãããŠããã®ã§ãçµã¿èŸŒã¿ã®ã³ãŒã«ããã¯ã¯åœ±é¿ãåããŸãããParameterServerStrategyã®ãããã¬ãã«ã®åŒã³åºãã®ãµããŒãã¯èšç»ãããŠããŸããåãçç±ã§ãä»ã®ã¹ãã©ããžãŒãšã¯ç°ãªãã鲿ããŒãšææšã¯ãšããã¯å¢çã§ã®ã¿ãã°ã«èšé²ãããŸãã
run_eagerlyã¯ããµããŒããããŠããŸããã
ã«ã¹ã¿ã ãã¬ãŒãã³ã°ã«ãŒãã®ã¿
ClusterCoordinator.scheduleã¯äžè¬ã«ããŒã¿ã»ããã®è©äŸ¡ä¿èšŒããµããŒãããŠããŸããããè©äŸ¡ä¿èšŒã¯Model.fit/.evaluateãéããŠå¯èœã§ãã1 åéãã®è©äŸ¡ãæå¹ã«ãããã芧ãã ãããClusterCoordinator.create_per_worker_datasetã callable ãšå ¥åãšããŠäœ¿çšãããå Žåãæž¡ããã颿°å ã§ããŒã¿ã»ããå šäœãäœæããå¿ èŠããããŸããtf.data.Optionsã¯ãClusterCoordinator.create_per_worker_datasetã«ããäœæãããããŒã¿ã»ããã§ã¯ç¡èŠãããŸãã
TensorFlow.org ã§è¡šç€º
Google Colabã§å®è¡
GitHubã§ãœãŒã¹ã衚瀺
ããŒãããã¯ãããŠã³ããŒã