Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/main/examples/accelerate_examples/simple_cv_example.ipynb
Views: 2542
Launching Multi-Node Training from a Jupyter Environment
Using the
notebook_launcher
to use Accelerate from inside a Jupyter Notebook
General Overview
This notebook covers how to run the cv_example.py
script as a Jupyter Notebook and train it on a distributed system. It will also cover the few specific requirements needed for ensuring your environment is configured properly, your data has been prepared properly, and finally how to launch training.
Configuring the Environment
Before any training can be performed, an accelerate config file must exist in the system. Usually this can be done by running the following in a terminal:
However, if general defaults are fine and you are not running on a TPU, accelerate has a utility to quickly write your GPU configuration into a config file via write_basic_config
.
The following cell will restart Jupyter after writing the configuration, as CUDA code was called to perform this. CUDA can't be initialized more than once (once for the single-GPU's notebooks use by default, and then what would be again when notebook_launcher
is called). It's fine to debug in the notebook and have calls to CUDA, but remember that in order to finally train a full cleanup and restart will need to be performed, such as what is shown below:
Preparing the Dataset and Model
Next you should prepare your dataset. As mentioned at earlier, great care should be taken when preparing the DataLoaders
and model to make sure that nothing is put on any GPU.
If you do, it is recommended to put that specific code into a function and call that from within the notebook launcher interface, which will be shown later.
Make sure the dataset is downloaded based on the directions here
First we'll create a function to extract the class name based on a file:
In the case here, the label is beagle
:
Next we'll create a Dataset
class:
And build our dataset
Note: This will be stored inside of a function as we'll be setting our seed during training.
Writing the Training Function
Now we can build our training loop. notebook_launcher
works by passing in a function to call that will be ran across the distributed system.
Here is a basic training loop for our animal classification problem:
All that's left is to use the notebook_launcher
.
We pass in the function, the arguments (as a tuple), and the number of processes to train on. (See the documentation for more information)
And that's it!
Conclusion
This notebook showed how to perform distributed training from inside of a Jupyter Notebook. Some key notes to remember:
Make sure to save any code that use CUDA (or CUDA imports) for the function passed to
notebook_launcher
Set the
num_processes
to be the number of devices used for training (such as number of GPUs, CPUs, TPUs, etc)