GitHub Repository: probml/pyprobml
Path: blob/master/deprecated/notebooks/caliban.ipynb
¹¹⁹² views

Kernel: Python 3

Running parallel jobs on Google Cloud using Caliban

Caliban is a package that makes it easy to run embarassingly parallel jobs on Google Cloud Platform (GCP) from your laptop. (Caliban bundles your code into a Docker image, and then runs it on Cloud AI Platform, which is a VM on top of GCP.)

In [42]:

import json
import pandas as pd
import glob
from IPython.display import display
import numpy as np
import matplotlib as plt

Installation

The details on how to install and run Caliban can be found here. Below we give a very brief summary. Do these steps on your laptop, outside of this colab.

install docker and test using docker run hello-world
pip install caliban
setup GCP

Launch jobs on GCP

Do these steps on your laptop, outside of this colab.

create a requirements.txt file containing packages you need to be installed in GCP Docker image. Example:

numpy
scipy
#sympy
matplotlib
#torch # 776MB  slow
#torchvision
tensorflow_datasets
jupyter
ipywidgets
seaborn
pandas
keras
sklearn
#ipympl 
jax
flax
 
# below is jaxlib with GPU support
 
# CUDA 10.0
#tensorflow-gpu==2.0
#https://storage.googleapis.com/jax-releases/cuda100/jaxlib-0.1.47-cp36-none-linux_x86_64.whl
#https://storage.googleapis.com/jax-releases/cuda100/jaxlib-0.1.47-cp37-none-linux_x86_64.whl
 
# CUDA 10.1
#tensorflow-gpu==2.1
#https://storage.googleapis.com/jax-releases/cuda101/jaxlib-0.1.47-cp37-none-linux_x86_64.whl
 
tensorflow==2.1  # 421MB slow
https://storage.googleapis.com/jax-releases/cuda101/jaxlib-0.1.60+cuda101-cp37-none-manylinux2010_x86_64.whl
 
# jaxlib with CPU support
#tensorflow
#jaxlib

create script that you want to run in parallel, eg caliban_test.py
create config.json file with the list of flag combinations you want to pass to the script. For example the following file says to run 2 versions of the script, with flags --ndims 10 --prefix "***" and --ndims 100 --prefix "***". (The prefix flag is for pretty printing.)

{"ndims": [10, 100],
"prefix": "***" }

launch jobs on GCP, giving them a common name using the xgroup flag.

cp ~/github/pyprobml/scripts/caliban_test.py .
caliban cloud --experiment_config config.json --xgroup mygroup --gpu_spec 2xV100  caliban_test.py

You can specify the kind of machines you want to use as explained here. If you omit "--gpu_spec", it defaults to n1-standard-8 with a single P100 GPU.

open the URL that it prints to monitor progress. Example:

Visit https://console.cloud.google.com/ai-platform/jobs/?projectId=probml to see the status of all jobs.

You should see something like this:

Monitor your jobs by clicking on 'view logs'. You should see something like this:
When jobs are done, download the log files using caliban_save_logs.py. Example:

python ~/github/pyprobml/scripts/caliban_save_logs.py --xgroup mygroup

Upload the log files to Google drive and parse them inside colab using python code below.

Parse the log files

In [137]:

!rm -rf pyprobml # Remove any old local directory to ensure fresh install
!git clone https://github.com/probml/pyprobml

Out[137]:

Cloning into 'pyprobml'...
remote: Enumerating objects: 24, done.
remote: Counting objects: 100% (24/24), done.
remote: Compressing objects: 100% (21/21), done.
remote: Total 6409 (delta 8), reused 13 (delta 3), pack-reused 6385
Receiving objects: 100% (6409/6409), 249.32 MiB | 29.15 MiB/s, done.
Resolving deltas: 100% (3571/3571), done.
Checking out files: 100% (738/738), done.

In [138]:

import pyprobml.scripts.probml_tools as pml
pml.test()

Out[138]:

welcome to python probabilistic ML library

In [ ]:

import pyprobml.scripts.caliban_logs_parse as parse

In [146]:

import glob
logdir = 'https://github.com/probml/pyprobml/tree/master/data/Logs'
fnames = glob.glob(f'{logdir}/*.config')
print(fnames) # empty

Out[146]:

[]

In [147]:

from google.colab import drive
drive.mount('/content/gdrive')

logdir = '/content/gdrive/MyDrive/Logs'
fnames = glob.glob(f'{logdir}/*.config')
print(fnames)

Out[147]:

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
['/content/gdrive/MyDrive/Logs/caliban_kpmurphy_20210209_172547_1.config', '/content/gdrive/MyDrive/Logs/caliban_kpmurphy_20210209_172548_2.config']

In [148]:

configs_df = parse.parse_configs(logdir)
display(configs_df)

for n in [1,2]:
  print(get_args(configs_df, n))

Out[148]:

reading  /content/gdrive/MyDrive/Logs/caliban_kpmurphy_20210209_172547_1.config
reading  /content/gdrive/MyDrive/Logs/caliban_kpmurphy_20210209_172548_2.config

['--ndims', '10']
['--ndims', '100']

In [140]:

logdir = '/content/gdrive/MyDrive/Logs'
#df1 = log_file_to_pandas('/content/gdrive/MyDrive/Logs/caliban_kpmurphy_20210208_194505_1.log')
logs_df = parse.parse_logs(logdir)
display(logs_df.sample(n=5))

Out[140]:

reading  /content/gdrive/MyDrive/Logs/caliban_kpmurphy_20210209_172547_1.log
reading  /content/gdrive/MyDrive/Logs/caliban_kpmurphy_20210209_172548_2.log

In [141]:

print(parse.get_log_messages(logs_df, 1))

Out[141]:

[['Validating job requirements...']
 ['Job creation request has been successfully validated.']
 ['Job caliban_kpmurphy_20210209_172547_1 is queued.']
 ['INFO:root:python caliban_test.py --ndims 10\n']
 ["2021-02-10 01:28:58.986485: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64\n"]
 ['2021-02-10 01:28:58.986516: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.\n']
 ['2021-02-10 01:29:02.426093: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set\n']
 ['2021-02-10 01:29:02.426243: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1\n']
 ['2021-02-10 01:29:02.426486: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n']
 ['2021-02-10 01:29:02.427176: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: \n']
 ['pciBusID: 0000:00:04.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0\n']
 ['coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s\n']
 ['2021-02-10 01:29:02.427362: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n']
 ['2021-02-10 01:29:02.428102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties: \n']
 ['pciBusID: 0000:00:05.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0\n']
 ['coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s\n']
 ["2021-02-10 01:29:02.428386: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64\n"]
 ["2021-02-10 01:29:02.428602: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64\n"]
 ["2021-02-10 01:29:02.428771: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64\n"]
 ['2021-02-10 01:29:02.430711: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10\n']
 ['2021-02-10 01:29:02.431307: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10\n']
 ['2021-02-10 01:29:02.431359: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10\n']
 ["2021-02-10 01:29:02.431555: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64\n"]
 ["2021-02-10 01:29:02.431737: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64\n"]
 ['2021-02-10 01:29:02.431757: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.\n']
 ['Skipping registering GPU devices...\n']
 ['*** jax version 0.2.9\n']
 ['*** jax backend gpu\n']
 ['*** [GpuDevice(id=0), GpuDevice(id=1)]\n']
 ['*** ndims = 10\n']
 ['-1.0 5.6525435\n']
 ['tf version 2.4.1\n']
 ['TF backend\n']
 ["[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]\n"]
 ['*** flax version 0.3.0\n']
 ['*** MLP(\n']
 ['    # attributes\n']
 ['    features = [10, 10]\n']
 [')\n']]

In [142]:

print(parse.get_log_messages(logs_df, 2))

Out[142]:

[['Validating job requirements...']
 ['Job creation request has been successfully validated.']
 ['Job caliban_kpmurphy_20210209_172548_2 is queued.']
 ['This job is number 1 in the queue and requires 8.0 N1/E2 CPUs, 2 V100 accelerators, 100Gb standard disks and 0Gb ssd disks. The project is using 8.0 N1/E2 CPUs out of 450.0 N1/E2, 8.0 C2, 8.0 N2, 800.0 preemptible allowed, 2 V100 accelerators out of 0 A100, 0 TPU_V2_POD, 0 TPU_V3_POD, 16 TPU_V2, 16 TPU_V3, 2 V100, 30 K80, 30 P100, 4 P4, 6 T4 allowed, 100Gb standard disks out of 180000 allowed and 0Gb ssd disks out of 75000 allowed across all regions.The project is using 8.0 N1/E2 CPUs out of 450.0 N1/E2, 8.0 C2, 8.0 N2, 800.0 preemptible allowed, 2 V100 accelerators out of 0 A100, 0 TPU_V2_POD, 0 TPU_V3_POD, 16 TPU_V2, 16 TPU_V3, 2 P4, 2 V100, 30 K80, 30 P100, 6 T4 allowed, 100Gb standard disks out of 180000 allowed and 0Gb ssd disks out of 75000 allowed in the region us-central1.']
 ['INFO:root:python caliban_test.py --ndims 100\n']
 ["2021-02-10 01:35:02.630394: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64\n"]
 ['2021-02-10 01:35:02.630430: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.\n']
 ['2021-02-10 01:35:06.340820: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set\n']
 ['2021-02-10 01:35:06.340947: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1\n']
 ['2021-02-10 01:35:06.341216: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n']
 ['2021-02-10 01:35:06.341889: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: \n']
 ['pciBusID: 0000:00:04.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0\n']
 ['coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s\n']
 ['2021-02-10 01:35:06.342074: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n']
 ['2021-02-10 01:35:06.342853: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties: \n']
 ['pciBusID: 0000:00:05.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0\n']
 ['coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s\n']
 ["2021-02-10 01:35:06.343169: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64\n"]
 ["2021-02-10 01:35:06.343422: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64\n"]
 ["2021-02-10 01:35:06.343587: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64\n"]
 ['2021-02-10 01:35:06.346030: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10\n']
 ['2021-02-10 01:35:06.346487: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10\n']
 ['2021-02-10 01:35:06.346528: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10\n']
 ["2021-02-10 01:35:06.346676: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64\n"]
 ["2021-02-10 01:35:06.346837: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64\n"]
 ['2021-02-10 01:35:06.346859: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.\n']
 ['Skipping registering GPU devices...\n']
 ['*** jax version 0.2.9\n']
 ['*** jax backend gpu\n']
 ['*** [GpuDevice(id=0), GpuDevice(id=1)]\n']
 ['*** ndims = 100\n']
 ['1.0 180.60815\n']
 ['tf version 2.4.1\n']
 ['TF backend\n']
 ["[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]\n"]
 ['*** flax version 0.3.0\n']
 ['*** MLP(\n']
 ['    # attributes\n']
 ['    features = [100, 100]\n']
 [')\n']]

In [ ]:

Running parallel jobs on Google Cloud using Caliban

Installation

Launch jobs on GCP

Parse the log files

Product

Resources

Company