Path: blob/master/notebooks/tutorials/colab_intro.ipynb
1192 views
Introduction to colab
Kevin Murphy, August 2021.
Colab is Google's version of Jupyter notebooks, but has the following advantages:
it runs in the cloud, not locally, so you can use it from a cheap laptop, such as a Chromebook.
The notebook is saved in your Google drive, so you can share your notebook with someone else and work on it collaboratively.
it has nearly all of the packages you need for doing ML pre-installed
it gives you free access to a GPU or TPU
it has a file editor, so you can separate your code from the output of your code, as with other IDEs, such as Jupyter lab.
it has various other useful features, such as collapsible sections (cf. code folding), and ways to specify parameters to your functions via various GUI widgets for use by non-programmers. (You can automatically execute parameterized notebooks with different parameters using papermill.)
More details can be found in the official introduction. Below we describe a few more tips and tricks, focusing on methods that I have found useful when developing the book. (More advanced tricks can be found in this blog post and this blog post.)
How to import and use standard libraries
Colab comes with most of the packages we need pre-installed. You can see them all using this command.
To install a new package called 'foo', use the following (see this page for details):
Numpy
Pandas
Sklearn
JAX
Tensorflow
PyTorch
Plotting
Colab has excellent support for plotting. We give some examples below.
Static plots
Colab lets you make static plots using matplotlib, as shown below. Note that plots are displayed inline by default, so
is not needed.
Seaborn
Seaborn is a library that makes matplotlib results look prettier. We can also update font size for plots, to make them more suitable for inclusion in papers.
Interactive plots
Colab also lets you create interactive plots using various javascript libraries - see here for details.
Below we illustrate how to use the bokeh library to create an interactive plot of a pandas time series, where if you mouse over the plot, it shows the corresponding (x,y) coordinates. (Another option is plotly.)
We can also make plots that can you pan and zoom into.
Viewing an image file
You can either use PIL or OpenCV to display (and manipulate) images. According to this notebook, OpenCV is faster, but for a small number of images, it doesn't really matter.
Visualizing arrays
If you use imshow, be careful of aliasing which can occur for certain figure sizes.
You can solve this by specifying interpolation=nearest
:
Alternatively, you can call matshow
, which is an alias for imshow with interpolation=nearest
:
Graphviz
You can use graphviz to layout nodes of a graph and draw the structure.
Progress bar
Filing system issues
Details here:
Many other sources.
Accessing local files
Clicking on the file folder icon on the left hand side of colab lets you browse local files. Right clicking on a filename lets you download it to your local machine. Double clicking on a file will open it in the file viewer/ editor, which appears on the right hand side.
The result should look something like this:
You can also use standard unix commands to manipulate files, as we show below.
However, !cd does not work. You need to use the magic %cd.
To make a new (local) file in colab's editor, first create the file with the operating system, and then view it using colab.
If you make changes to a file containing code, the new version of the file will not be noticed unless you use the magic below.
Syncing with Google drive
Files that you generate in, or upload to, colab are ephemeral, since colab is a temporary environment with an idle timeout of 90 minutes and an absolute timeout of 12 hours (24 hours for Colab pro). To save any files permanently, you need to mount your google drive folder as we show below. (Executing this command will open a new window in your browser - you need cut and paste the password that is shown into the prompt box.)
To ensure that local changes are detected by colab, use this piece of magic.
Uploading data to colab from your local machine
Downloading data from colab to your local machine
Loading data from the web into colab
You can use wget
--2021-07-19 17:42:46-- https://raw.githubusercontent.com/probml/probml-data/main/data/timemachine.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 178887 (175K) [text/plain]
Saving to: ‘timemachine.txt’
timemachine.txt 100%[===================>] 174.69K --.-KB/s in 0.03s
2021-07-19 17:42:46 (6.65 MB/s) - ‘timemachine.txt’ saved [178887/178887]
Loading code from the web into colab
We can also download python code and run it locally.
Viewing all your notebooks
You can see the list of colab notebooks that you have saved as shown below.
Working with github
You can open any jupyter notebook stored in github in a colab by replacing https://github.com/probml/.../intro.ipynb with https://colab.research.google.com/github/probml/.../intro.ipynb (see this blog post.
It is possible to download code (or data) from githib into a local directory on this virtual machine. It is also possible to upload local files back to github, although that is more complex. See details below.
Cloning a repo from github
You can clone a public github repo into your local colab VM, as we show below, using the repo for this book as an example. (To clone a private repo, you need to specify your password, as explained here. Alternatively you can use the ssh method we describe below.)
We can run any script as shown below. (Note we first have to define the environment variable for where the figures will be stored.)
We can also import code, as we show below.
Pushing local files back to github
You can easily save your entire colab notebook to github by choosing 'Save a copy in github' under the File menu in the top left. But if you want to save individual files (eg code that you edited in the colab file editor, or a bunch of images or data files you created), the process is more complex. There are several possible methods, described here and here. Below we describe two other solutions.
Use the local terminal
Suppose you clone a repo using something like
This will create a folder called content/pyprobml
. You can access this in colab as usual. Now suppose you want to save your edits, eg to a file called foo.txt
. Follow the steps below.
Open a terminal window inside colab. (Or run the commands below in the colab notebook, but prefixed with !. For some reason this fails with the final push command...). Then do the following
git config --global user.email "[email protected]"
git config --global user.name"Kevin Murphy"
cd /content/pyprobml
echo 'this is a test' > foo.txt
# if haven't yet modified the filegit add foo.txt
git commit -m "message"
git push
Github will ask for your username and password. Instead of a password, enter your personal access token.
Deprecated method
To avoid having to type your PAT every time, you can use the method below. However, as of May 2023, this no longer seems to work.
You first need to do some setup to create SSH keys on your current colab VM (virtual machine), manually add the keys to your github account, and then copy the keys to your mounted google drive so you can reuse the same keys in the future. This only has to be done once.
After setup, you can use the git_ssh function we define below to securely execute git commands. This works by copying your SSH keys from your google drive to the current colab VM, executing the git command, and then deleting the keys from the VM for safety.
To get started, run these commands in your colab. (The commands need to be uncommented.) The cat command will display your public key in the colab window. Cut and paste this and manually add to your github account following these instructions.
Test it worked.
Host key verification failed.
Finally, save the generated keys to your Google drive
Let us check that we can see our SSH keys in our mounted google drive.
The following function lets you securely doing a git command via SSH. It copies the keys from your google drive to the local VM, excecutes the command, then removes the keys.
--2023-05-26 21:45:47-- https://raw.githubusercontent.com/probml/pyprobml/master/deprecated/scripts/colab_utils.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2267 (2.2K) [text/plain]
Saving to: ‘colab_utils.py’
colab_utils.py 100%[===================>] 2.21K --.-KB/s in 0s
2023-05-26 21:45:47 (27.1 MB/s) - ‘colab_utils.py’ saved [2267/2267]
Below we clone the pyprobml repo to this colab VM using out github credentials, so we can later check stuff back in. This is just an example - you should edit the reponame
, username
and email
variables.
Let's check that we can see this repo in our local drive.
Now we create a dummy file inside our local copy of this repo, and push it back to the github (public) version of the repo.
We can check that it worked by visiting this page on github (note the time stamp on the top right):
Finally we clean up our mess.
Software engineering tools
Pros and Cons of notebooks
Joel Grus has argued that notebooks are bad for developing complex software, because they encourage creating monolithic notebooks instead of factoring out code into separate, well-tested files.
Jeremy Howard has responded to Joel's critiques here. In particular, the FastAI organization has created nbdev which has various tools that make notebooks more useful.
Recommended workflow
My recommended workflow is to use the notebook as a development environment, and the convert the code to a set of files, that can run from the command line independently of the notebook (which is useful for parallel experiments, etc).
In ogther words, develop your code in the colab in the usual way, and when it is working, to factor out the core code into separate files. You can edit these files locally in the colab editor or some other file editor (see below), and then call them from a colab cell. This lets you separate your code from the output of your code, as with other IDEs, such as Jupyter lab.
To run a function defined in a local file inside colab, just import it. For example, suppose we have created the file /content/pyprobml/scripts/fit_flax.py; we can use this idiom to run its test suite:
If you make local edits, you want to be sure that you always import the latest version of the file (not a cached version). So you need to use this piece of colab magic first:
When the code is running, save it to github (see details above).
File editors
Colab editor
Colab has a simple file editor, illustrated below for an example file. This lets you separate your code from the output of your code, as with other IDEs, such as Jupyter lab.
You can click on a class name when holding Ctrl and the source code will open in the file viewer. (h/t Amit Choudhary's blog.
VScode
The default colab file editor is very primitive. Fortunately you can run VScode in your browser and connect it to the colab machine via ssh, as explained in this article and this article.
The above method can be quite 'laggy'. An alternative is to access the VM running colab directly via ssh, using these instructions. You can then run VScode locally (on your laptop) and connect to the remote machie using these instructions.
Avoiding problems with global state
One of the main drawbacks of colab is that all variables are globally visible, so you may accidently write a function that depends on the current state of the notebook, but which is not passed in as an argument. Such a function may fail if used in a different context.
One solution to this is to put most of your code in files, and then have the notebook simply import the code and run it, like you would from the command line. Then you can always run the notebook from scratch, to ensure consistency.
Another solution is to use the localscope package can catch some of these errors.
Collecting localscope
Downloading https://files.pythonhosted.org/packages/71/29/c3010c332c7175fe48060b1113e32f2831bab2202428d2cc29686685302f/localscope-0.1.3.tar.gz
Building wheels for collected packages: localscope
Building wheel for localscope (setup.py) ... done
Created wheel for localscope: filename=localscope-0.1.3-cp36-none-any.whl size=4068 sha256=7a5d6718e16dbff82fe94e1229d233a19ef52280ff0d4fc48ef62a2ba41d5855
Stored in directory: /root/.cache/pip/wheels/89/57/33/ce153d31de05d74323324df0f45a08ea99e92300e549da5154
Successfully built localscope
Installing collected packages: localscope
Successfully installed localscope-0.1.3
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-8-549f0808922e> in <module>()
1 a = 'hello world'
----> 2 @localscope
3 def myfun():
4 print(a)
5
/usr/local/lib/python3.6/dist-packages/localscope/__init__.py in localscope(func, predicate, allowed, allow_closure, _globals)
115 value = _globals[name]
116 if not predicate(value):
--> 117 raise ValueError(f'`{name}` is not a permitted global')
118 elif instruction.opname == 'STORE_DEREF':
119 allowed.append(name)
ValueError: `a` is not a permitted global
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-9-5cfa07ed9c85> in <module>()
2 return 42
3
----> 4 @localscope
5 def myfun3():
6 return myfun2()
/usr/local/lib/python3.6/dist-packages/localscope/__init__.py in localscope(func, predicate, allowed, allow_closure, _globals)
115 value = _globals[name]
116 if not predicate(value):
--> 117 raise ValueError(f'`{name}` is not a permitted global')
118 elif instruction.opname == 'STORE_DEREF':
119 allowed.append(name)
ValueError: `myfun2` is not a permitted global
Argparse
Often code is designed to be run from the command line, and can be configured by passing in arguments and flags. To make this work in colab, you have to use parse_known_args
, as in the example below.
YAML files
We show how to create a config file locally, and then pass it to your code.
Hardware accelerators
By default, Colab runs on a CPU, but you can select GPU or TPU for extra speed, as we show below. To get access to more powerful machines (with faster processors, more memory, and longer idle timeouts), you can subscript to Colab Pro. At the time of writing (Jan 2021), the cost is $10/month (USD). This is a good deal if you use GPUs a lot.
CPUs
To see what devices you have, use this command.
Memory
GPUs
If you select the 'Runtime' menu at top left, and then select 'Change runtime type' and then select 'GPU', you can get free access to a GPU.
To see what kind of GPU you are using, see below.