Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
probml
GitHub Repository: probml/pyprobml
Path: blob/master/deprecated/notebooks/caliban.ipynb
1192 views
Kernel: Python 3

Open In Colab

Running parallel jobs on Google Cloud using Caliban

Caliban is a package that makes it easy to run embarassingly parallel jobs on Google Cloud Platform (GCP) from your laptop. (Caliban bundles your code into a Docker image, and then runs it on Cloud AI Platform, which is a VM on top of GCP.)

import json import pandas as pd import glob from IPython.display import display import numpy as np import matplotlib as plt

Installation

The details on how to install and run Caliban can be found here. Below we give a very brief summary. Do these steps on your laptop, outside of this colab.

Launch jobs on GCP

Do these steps on your laptop, outside of this colab.

  • create a requirements.txt file containing packages you need to be installed in GCP Docker image. Example:

numpy scipy #sympy matplotlib #torch # 776MB slow #torchvision tensorflow_datasets jupyter ipywidgets seaborn pandas keras sklearn #ipympl jax flax # below is jaxlib with GPU support # CUDA 10.0 #tensorflow-gpu==2.0 #https://storage.googleapis.com/jax-releases/cuda100/jaxlib-0.1.47-cp36-none-linux_x86_64.whl #https://storage.googleapis.com/jax-releases/cuda100/jaxlib-0.1.47-cp37-none-linux_x86_64.whl # CUDA 10.1 #tensorflow-gpu==2.1 #https://storage.googleapis.com/jax-releases/cuda101/jaxlib-0.1.47-cp37-none-linux_x86_64.whl tensorflow==2.1 # 421MB slow https://storage.googleapis.com/jax-releases/cuda101/jaxlib-0.1.60+cuda101-cp37-none-manylinux2010_x86_64.whl # jaxlib with CPU support #tensorflow #jaxlib
  • create script that you want to run in parallel, eg caliban_test.py

  • create config.json file with the list of flag combinations you want to pass to the script. For example the following file says to run 2 versions of the script, with flags --ndims 10 --prefix "***" and --ndims 100 --prefix "***". (The prefix flag is for pretty printing.)

{"ndims": [10, 100], "prefix": "***" }
  • launch jobs on GCP, giving them a common name using the xgroup flag.

cp ~/github/pyprobml/scripts/caliban_test.py . caliban cloud --experiment_config config.json --xgroup mygroup --gpu_spec 2xV100 caliban_test.py

You can specify the kind of machines you want to use as explained here. If you omit "--gpu_spec", it defaults to n1-standard-8 with a single P100 GPU.

  • open the URL that it prints to monitor progress. Example:

Visit https://console.cloud.google.com/ai-platform/jobs/?projectId=probml to see the status of all jobs.

You should see something like this:

  • Monitor your jobs by clicking on 'view logs'. You should see something like this:

  • When jobs are done, download the log files using caliban_save_logs.py. Example:

python ~/github/pyprobml/scripts/caliban_save_logs.py --xgroup mygroup
  • Upload the log files to Google drive and parse them inside colab using python code below.

Parse the log files

!rm -rf pyprobml # Remove any old local directory to ensure fresh install !git clone https://github.com/probml/pyprobml
Cloning into 'pyprobml'... remote: Enumerating objects: 24, done. remote: Counting objects: 100% (24/24), done. remote: Compressing objects: 100% (21/21), done. remote: Total 6409 (delta 8), reused 13 (delta 3), pack-reused 6385 Receiving objects: 100% (6409/6409), 249.32 MiB | 29.15 MiB/s, done. Resolving deltas: 100% (3571/3571), done. Checking out files: 100% (738/738), done.
import pyprobml.scripts.probml_tools as pml pml.test()
welcome to python probabilistic ML library
import pyprobml.scripts.caliban_logs_parse as parse
import glob logdir = 'https://github.com/probml/pyprobml/tree/master/data/Logs' fnames = glob.glob(f'{logdir}/*.config') print(fnames) # empty
[]
from google.colab import drive drive.mount('/content/gdrive') logdir = '/content/gdrive/MyDrive/Logs' fnames = glob.glob(f'{logdir}/*.config') print(fnames)
Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True). ['/content/gdrive/MyDrive/Logs/caliban_kpmurphy_20210209_172547_1.config', '/content/gdrive/MyDrive/Logs/caliban_kpmurphy_20210209_172548_2.config']
configs_df = parse.parse_configs(logdir) display(configs_df) for n in [1,2]: print(get_args(configs_df, n))
reading /content/gdrive/MyDrive/Logs/caliban_kpmurphy_20210209_172547_1.config reading /content/gdrive/MyDrive/Logs/caliban_kpmurphy_20210209_172548_2.config
['--ndims', '10'] ['--ndims', '100']
logdir = '/content/gdrive/MyDrive/Logs' #df1 = log_file_to_pandas('/content/gdrive/MyDrive/Logs/caliban_kpmurphy_20210208_194505_1.log') logs_df = parse.parse_logs(logdir) display(logs_df.sample(n=5))
reading /content/gdrive/MyDrive/Logs/caliban_kpmurphy_20210209_172547_1.log reading /content/gdrive/MyDrive/Logs/caliban_kpmurphy_20210209_172548_2.log
print(parse.get_log_messages(logs_df, 1))
[['Validating job requirements...'] ['Job creation request has been successfully validated.'] ['Job caliban_kpmurphy_20210209_172547_1 is queued.'] ['INFO:root:python caliban_test.py --ndims 10\n'] ["2021-02-10 01:28:58.986485: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64\n"] ['2021-02-10 01:28:58.986516: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.\n'] ['2021-02-10 01:29:02.426093: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set\n'] ['2021-02-10 01:29:02.426243: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1\n'] ['2021-02-10 01:29:02.426486: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n'] ['2021-02-10 01:29:02.427176: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: \n'] ['pciBusID: 0000:00:04.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0\n'] ['coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s\n'] ['2021-02-10 01:29:02.427362: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n'] ['2021-02-10 01:29:02.428102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties: \n'] ['pciBusID: 0000:00:05.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0\n'] ['coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s\n'] ["2021-02-10 01:29:02.428386: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64\n"] ["2021-02-10 01:29:02.428602: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64\n"] ["2021-02-10 01:29:02.428771: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64\n"] ['2021-02-10 01:29:02.430711: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10\n'] ['2021-02-10 01:29:02.431307: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10\n'] ['2021-02-10 01:29:02.431359: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10\n'] ["2021-02-10 01:29:02.431555: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64\n"] ["2021-02-10 01:29:02.431737: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64\n"] ['2021-02-10 01:29:02.431757: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.\n'] ['Skipping registering GPU devices...\n'] ['*** jax version 0.2.9\n'] ['*** jax backend gpu\n'] ['*** [GpuDevice(id=0), GpuDevice(id=1)]\n'] ['*** ndims = 10\n'] ['-1.0 5.6525435\n'] ['tf version 2.4.1\n'] ['TF backend\n'] ["[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]\n"] ['*** flax version 0.3.0\n'] ['*** MLP(\n'] [' # attributes\n'] [' features = [10, 10]\n'] [')\n']]
print(parse.get_log_messages(logs_df, 2))
[['Validating job requirements...'] ['Job creation request has been successfully validated.'] ['Job caliban_kpmurphy_20210209_172548_2 is queued.'] ['This job is number 1 in the queue and requires 8.0 N1/E2 CPUs, 2 V100 accelerators, 100Gb standard disks and 0Gb ssd disks. The project is using 8.0 N1/E2 CPUs out of 450.0 N1/E2, 8.0 C2, 8.0 N2, 800.0 preemptible allowed, 2 V100 accelerators out of 0 A100, 0 TPU_V2_POD, 0 TPU_V3_POD, 16 TPU_V2, 16 TPU_V3, 2 V100, 30 K80, 30 P100, 4 P4, 6 T4 allowed, 100Gb standard disks out of 180000 allowed and 0Gb ssd disks out of 75000 allowed across all regions.The project is using 8.0 N1/E2 CPUs out of 450.0 N1/E2, 8.0 C2, 8.0 N2, 800.0 preemptible allowed, 2 V100 accelerators out of 0 A100, 0 TPU_V2_POD, 0 TPU_V3_POD, 16 TPU_V2, 16 TPU_V3, 2 P4, 2 V100, 30 K80, 30 P100, 6 T4 allowed, 100Gb standard disks out of 180000 allowed and 0Gb ssd disks out of 75000 allowed in the region us-central1.'] ['INFO:root:python caliban_test.py --ndims 100\n'] ["2021-02-10 01:35:02.630394: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64\n"] ['2021-02-10 01:35:02.630430: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.\n'] ['2021-02-10 01:35:06.340820: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set\n'] ['2021-02-10 01:35:06.340947: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1\n'] ['2021-02-10 01:35:06.341216: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n'] ['2021-02-10 01:35:06.341889: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: \n'] ['pciBusID: 0000:00:04.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0\n'] ['coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s\n'] ['2021-02-10 01:35:06.342074: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n'] ['2021-02-10 01:35:06.342853: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties: \n'] ['pciBusID: 0000:00:05.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0\n'] ['coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s\n'] ["2021-02-10 01:35:06.343169: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64\n"] ["2021-02-10 01:35:06.343422: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64\n"] ["2021-02-10 01:35:06.343587: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64\n"] ['2021-02-10 01:35:06.346030: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10\n'] ['2021-02-10 01:35:06.346487: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10\n'] ['2021-02-10 01:35:06.346528: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10\n'] ["2021-02-10 01:35:06.346676: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64\n"] ["2021-02-10 01:35:06.346837: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64\n"] ['2021-02-10 01:35:06.346859: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.\n'] ['Skipping registering GPU devices...\n'] ['*** jax version 0.2.9\n'] ['*** jax backend gpu\n'] ['*** [GpuDevice(id=0), GpuDevice(id=1)]\n'] ['*** ndims = 100\n'] ['1.0 180.60815\n'] ['tf version 2.4.1\n'] ['TF backend\n'] ["[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]\n"] ['*** flax version 0.3.0\n'] ['*** MLP(\n'] [' # attributes\n'] [' features = [100, 100]\n'] [')\n']]