Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: huggingface/notebooks
Path: blob/main/diffusers/reinforcement_learning_with_diffusers.ipynb
Views: ²⁵³⁵

Kernel: Python 3

Introduction

This notebook is designed to run inference on the Diffuser planning model for model-based RL. The notebook is modified from the authors' original. For those new to reinforcement learning, consider checking out the HuggingFace Reinforcement Learning Course for a primer.

Colab made by Nathan Lambert and Ben Glickenhaus.

diffusers_library

Installing Packages

`apt-get install` requirements

These requirements primarily pertain to install mujoco and run it in the colab. Source was inspired by this (fairly recent) demo.

In [ ]:

# installations primiarly needed for Mujoco
!apt-get install -y \
    libgl1-mesa-dev \
    libgl1-mesa-glx \
    libglew-dev \
    libosmesa6-dev \
    software-properties-common

!apt-get install -y patchelf

Reading package lists... Done
Building dependency tree       
Reading state information... Done
libglew-dev is already the newest version (2.0.0-5).
libgl1-mesa-dev is already the newest version (20.0.8-0ubuntu1~18.04.1).
libgl1-mesa-glx is already the newest version (20.0.8-0ubuntu1~18.04.1).
libosmesa6-dev is already the newest version (20.0.8-0ubuntu1~18.04.1).
software-properties-common is already the newest version (0.96.24.32.18).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 27 not upgraded.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
patchelf is already the newest version (0.9-1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 27 not upgraded.

Install Diffusers

In [ ]:

%cd /content

# install latest HF diffusers
!rm -rf /content/diffusers/
!git clone -b rl https://github.com/huggingface/diffusers.git
!pip install -q /content/diffusers 
!pip install -q datasets transformers

/content
Found existing installation: diffusers 0.5.0.dev0
Uninstalling diffusers-0.5.0.dev0:
  Successfully uninstalled diffusers-0.5.0.dev0
Cloning into 'diffusers'...
remote: Enumerating objects: 10356, done.
remote: Counting objects: 100% (502/502), done.
remote: Compressing objects: 100% (251/251), done.
remote: Total 10356 (delta 277), reused 384 (delta 201), pack-reused 9854
Receiving objects: 100% (10356/10356), 7.81 MiB | 17.77 MiB/s, done.
Resolving deltas: 100% (6885/6885), done.
  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
    Preparing wheel metadata ... done
  Building wheel for diffusers (PEP 517) ... done

In [ ]:

`pip install` requirements

In [ ]:

# primarily RL-sepcific requirements
%pip install -f https://download.pytorch.org/whl/torch_stable.html \
                free-mujoco-py \
                einops \
                gym==0.24.1 \
                protobuf==3.20.1 \
                git+https://github.com/rail-berkeley/d4rl.git \
                mediapy \
                Pillow==9.0.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting git+https://github.com/rail-berkeley/d4rl.git
  Cloning https://github.com/rail-berkeley/d4rl.git to /tmp/pip-req-build-7j2y8u6t
  Running command git clone -q https://github.com/rail-berkeley/d4rl.git /tmp/pip-req-build-7j2y8u6t
Requirement already satisfied: free-mujoco-py in /usr/local/lib/python3.7/dist-packages (2.1.6)
Requirement already satisfied: einops in /usr/local/lib/python3.7/dist-packages (0.5.0)
Requirement already satisfied: gym in /usr/local/lib/python3.7/dist-packages (0.24.1)
Requirement already satisfied: protobuf==3.20.1 in /usr/local/lib/python3.7/dist-packages (3.20.1)
Requirement already satisfied: mediapy in /usr/local/lib/python3.7/dist-packages (1.1.2)
Requirement already satisfied: Pillow==9.0.0 in /usr/local/lib/python3.7/dist-packages (9.0.0)
Collecting mjrl@ git+https://github.com/aravindr93/mjrl@master#egg=mjrl
  Cloning https://github.com/aravindr93/mjrl (to revision master) to /tmp/pip-install-g98wzheg/mjrl_0abe064c9aa541e98742a70535434798
  Running command git clone -q https://github.com/aravindr93/mjrl /tmp/pip-install-g98wzheg/mjrl_0abe064c9aa541e98742a70535434798
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from D4RL==1.1) (1.21.6)
Requirement already satisfied: mujoco_py in /usr/local/lib/python3.7/dist-packages (from D4RL==1.1) (2.1.2.14)
Requirement already satisfied: pybullet in /usr/local/lib/python3.7/dist-packages (from D4RL==1.1) (3.2.5)
Requirement already satisfied: h5py in /usr/local/lib/python3.7/dist-packages (from D4RL==1.1) (3.1.0)
Requirement already satisfied: termcolor in /usr/local/lib/python3.7/dist-packages (from D4RL==1.1) (2.0.1)
Requirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from D4RL==1.1) (7.1.2)
Requirement already satisfied: dm_control>=1.0.3 in /usr/local/lib/python3.7/dist-packages (from D4RL==1.1) (1.0.8)
Requirement already satisfied: gym-notices>=0.0.4 in /usr/local/lib/python3.7/dist-packages (from gym) (0.0.8)
Requirement already satisfied: cloudpickle>=1.2.0 in /usr/local/lib/python3.7/dist-packages (from gym) (1.5.0)
Requirement already satisfied: importlib-metadata>=4.8.0 in /usr/local/lib/python3.7/dist-packages (from gym) (4.13.0)
Requirement already satisfied: absl-py>=0.7.0 in /usr/local/lib/python3.7/dist-packages (from dm_control>=1.0.3->D4RL==1.1) (1.3.0)
Requirement already satisfied: lxml in /usr/local/lib/python3.7/dist-packages (from dm_control>=1.0.3->D4RL==1.1) (4.9.1)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from dm_control>=1.0.3->D4RL==1.1) (2.23.0)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from dm_control>=1.0.3->D4RL==1.1) (4.64.1)
Requirement already satisfied: pyparsing<3.0.0 in /usr/local/lib/python3.7/dist-packages (from dm_control>=1.0.3->D4RL==1.1) (2.4.7)
Requirement already satisfied: mujoco>=2.3.0 in /usr/local/lib/python3.7/dist-packages (from dm_control>=1.0.3->D4RL==1.1) (2.3.0)
Requirement already satisfied: pyopengl>=3.1.4 in /usr/local/lib/python3.7/dist-packages (from dm_control>=1.0.3->D4RL==1.1) (3.1.6)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from dm_control>=1.0.3->D4RL==1.1) (1.7.3)
Requirement already satisfied: dm-tree!=0.1.2 in /usr/local/lib/python3.7/dist-packages (from dm_control>=1.0.3->D4RL==1.1) (0.1.7)
Requirement already satisfied: glfw in /usr/local/lib/python3.7/dist-packages (from dm_control>=1.0.3->D4RL==1.1) (1.12.0)
Requirement already satisfied: dm-env in /usr/local/lib/python3.7/dist-packages (from dm_control>=1.0.3->D4RL==1.1) (1.5)
Requirement already satisfied: labmaze in /usr/local/lib/python3.7/dist-packages (from dm_control>=1.0.3->D4RL==1.1) (1.0.5)
Requirement already satisfied: setuptools!=50.0.0 in /usr/local/lib/python3.7/dist-packages (from dm_control>=1.0.3->D4RL==1.1) (57.4.0)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=4.8.0->gym) (3.9.0)
Requirement already satisfied: typing-extensions>=3.6.4 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=4.8.0->gym) (4.1.1)
Requirement already satisfied: Cython<0.30.0,>=0.29.24 in /usr/local/lib/python3.7/dist-packages (from free-mujoco-py) (0.29.32)
Requirement already satisfied: cffi<2.0.0,>=1.15.0 in /usr/local/lib/python3.7/dist-packages (from free-mujoco-py) (1.15.1)
Requirement already satisfied: fasteners==0.15 in /usr/local/lib/python3.7/dist-packages (from free-mujoco-py) (0.15)
Requirement already satisfied: imageio<3.0.0,>=2.9.0 in /usr/local/lib/python3.7/dist-packages (from free-mujoco-py) (2.9.0)
Requirement already satisfied: monotonic>=0.1 in /usr/local/lib/python3.7/dist-packages (from fasteners==0.15->free-mujoco-py) (1.6)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from fasteners==0.15->free-mujoco-py) (1.15.0)
Requirement already satisfied: pycparser in /usr/local/lib/python3.7/dist-packages (from cffi<2.0.0,>=1.15.0->free-mujoco-py) (2.21)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from mediapy) (3.2.2)
Requirement already satisfied: ipython in /usr/local/lib/python3.7/dist-packages (from mediapy) (7.9.0)
Requirement already satisfied: cached-property in /usr/local/lib/python3.7/dist-packages (from h5py->D4RL==1.1) (1.5.2)
Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.7/dist-packages (from ipython->mediapy) (5.1.1)
Requirement already satisfied: pickleshare in /usr/local/lib/python3.7/dist-packages (from ipython->mediapy) (0.7.5)
Requirement already satisfied: backcall in /usr/local/lib/python3.7/dist-packages (from ipython->mediapy) (0.2.0)
Requirement already satisfied: jedi>=0.10 in /usr/local/lib/python3.7/dist-packages (from ipython->mediapy) (0.18.1)
Requirement already satisfied: pexpect in /usr/local/lib/python3.7/dist-packages (from ipython->mediapy) (4.8.0)
Requirement already satisfied: pygments in /usr/local/lib/python3.7/dist-packages (from ipython->mediapy) (2.6.1)
Requirement already satisfied: prompt-toolkit<2.1.0,>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from ipython->mediapy) (2.0.10)
Requirement already satisfied: decorator in /usr/local/lib/python3.7/dist-packages (from ipython->mediapy) (4.4.2)
Requirement already satisfied: parso<0.9.0,>=0.8.0 in /usr/local/lib/python3.7/dist-packages (from jedi>=0.10->ipython->mediapy) (0.8.3)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.7/dist-packages (from prompt-toolkit<2.1.0,>=2.0.0->ipython->mediapy) (0.2.5)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib->mediapy) (0.11.0)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->mediapy) (2.8.2)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->mediapy) (1.4.4)
Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.7/dist-packages (from pexpect->ipython->mediapy) (0.7.0)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->dm_control>=1.0.3->D4RL==1.1) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->dm_control>=1.0.3->D4RL==1.1) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->dm_control>=1.0.3->D4RL==1.1) (1.25.11)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->dm_control>=1.0.3->D4RL==1.1) (2022.9.24)

Import D4RL to initialize Mujoco

Mujoco is a physics simulator used extensively in reinforcement learning research. Here, we import D4RL (a library of datasets and environments for Offline RL), which results in the building of Mujoco.

In [ ]:

## cythonize mujoco-py at first import
import d4rl

Warning: Gym version v0.24.1 has a number of critical issues with `gym.make` such that environment observation and action spaces are incorrectly evaluated, raising incorrect errors and warning . It is recommend to downgrading to v0.23.1 or upgrading to v0.25.1
Warning: Flow failed to import. Set the environment variable D4RL_SUPPRESS_IMPORT_ERROR=1 to suppress this message.
No module named 'flow'
Warning: CARLA failed to import. Set the environment variable D4RL_SUPPRESS_IMPORT_ERROR=1 to suppress this message.
No module named 'carla'
/usr/local/lib/python3.7/dist-packages/gym/envs/registration.py:416: UserWarning: WARN: The `registry.env_specs` property along with `EnvSpecTree` is deprecated. Please use `registry` directly as a dictionary instead.
  "The `registry.env_specs` property along with `EnvSpecTree` is deprecated. Please use `registry` directly as a dictionary instead."

Environment & Model Setup

In this section, we will create the environment, handle the data, and run the diffusion model.

Imports

In [ ]:

import torch
import tqdm
import numpy as np
import gym

Create environment

This colab is designed to run with pretrained models from the hopper environment. As more models are trained, this can be extended.

In [ ]:

env_name = "hopper-medium-v2"
env = gym.make(env_name)
data = env.get_dataset() # dataset is only used for normalization in this colab

/usr/local/lib/python3.7/dist-packages/gym/envs/mujoco/mujoco_env.py:47: UserWarning: WARN: This version of the mujoco environments depends on the mujoco-py bindings, which are no longer maintained and may stop working. Please upgrade to the v4 versions of the environments (which depend on the mujoco python bindings instead), unless you are trying to precisely replicate previous works).
  "This version of the mujoco environments depends "
/usr/local/lib/python3.7/dist-packages/gym/spaces/box.py:112: UserWarning: WARN: Box bound precision lowered by casting to float32
  logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
/usr/local/lib/python3.7/dist-packages/gym/utils/passive_env_checker.py:70: UserWarning: WARN: Agent's minimum action space value is -infinity. This is probably too low.
  "Agent's minimum action space value is -infinity. This is probably too low."
/usr/local/lib/python3.7/dist-packages/gym/utils/passive_env_checker.py:74: UserWarning: WARN: Agent's maximum action space value is infinity. This is probably too high
  "Agent's maximum action space value is infinity. This is probably too high"
/usr/local/lib/python3.7/dist-packages/gym/utils/passive_env_checker.py:98: UserWarning: WARN: We recommend you to use a symmetric and normalized Box action space (range=[-1, 1]) https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html
  "We recommend you to use a symmetric and normalized Box action space (range=[-1, 1]) "
load datafile:  19%|█▉        | 4/21 [00:00<00:03,  5.38it/s]/usr/local/lib/python3.7/dist-packages/h5py/_hl/dataset.py:767: DeprecationWarning: Passing None into shape arguments as an alias for () is deprecated.
  arr = numpy.ndarray(selection.mshape, dtype=new_dtype)
load datafile: 100%|██████████| 21/21 [00:01<00:00, 15.70it/s]

Define constants

In [ ]:

# Cuda settings for colab
torch.cuda.get_device_name(0)
DEVICE = 'cuda:0'
DTYPE = torch.float

# diffusion model settings
n_samples = 4   # number of trajectories planned via diffusion
horizon = 128   # length of sampled trajectories
state_dim = env.observation_space.shape[0] 
action_dim = env.action_space.shape[0]
num_inference_steps = 20 # number of difusion steps

Helper functions

normalize scales the state values corresponding to the training data-set in D4RL,
de_normalize unscales the data for correct rendering,
to_torch handles casting to torch for both numpy arrays and dicts (used for conditionning the model, see reset_x0).

In [ ]:

def normalize(x_in, data, key):
    means = data[key].mean(axis=0)
    stds = data[key].std(axis=0)
    return (x_in - means) / stds


def de_normalize(x_in, data, key):
    means = data[key].mean(axis=0)
    stds = data[key].std(axis=0)
    return x_in * stds + means
	
def to_torch(x_in, dtype=None, device=None):
	dtype = dtype or DTYPE
	device = device or DEVICE
	if type(x_in) is dict:
		return {k: to_torch(v, dtype, device) for k, v in x_in.items()}
	elif torch.is_tensor(x_in):
		return x_in.to(device).type(dtype)
	return torch.tensor(x_in, dtype=dtype, device=device)

Sample env. initial state

In [ ]:

## Can set environment seed for debugging
# torch.manual_seed(0)
# np.random.seed(0)
# env.seed(1996)

obs = env.reset()
obs_raw = obs

# normalize observations for forward passes
obs = normalize(obs, data, 'observations')

Run the Diffusion Process -- from Scratch

Initialize model

In this section, we create a scheduler and load a pretrained model from the Hub. An important detail in the RL application space is to save conditions which will allow the model to optimize trajectories only from the current state (which is cruical to making decisions!).

In [ ]:

from diffusers import DDPMScheduler, UNet1DModel

# Two generators for different parts of the diffusion loop to work in colab
generator = torch.Generator(device='cuda')
generator_cpu = torch.Generator(device='cpu')

scheduler = DDPMScheduler(num_train_timesteps=100,beta_schedule="squaredcos_cap_v2")

# The horizion represents the length of trajectories used in training.
network = UNet1DModel.from_pretrained("bglick13/hopper-medium-v2-value-function-hor32", subfolder="unet").to(device=DEVICE)

Planning helper function

reset_x0 is used to constrain the diffusion process to trajectories starting at the current state of the agent. Without this, the diffusion process would generate arbitrary high-reward trajectories, rather than trajectories beginning at the current state.

In [ ]:

def reset_x0(x_in, cond, act_dim):
	for key, val in cond.items():
		x_in[:, key, act_dim:] = val.clone()
	return x_in

Setup for denoising

conditions is the variable used to hold the first state of the planned trajectories to the current state (it is passed into reset_x0).

In [ ]:

## add a batch dimension and repeat for multiple samples
## [ observation_dim ] --> [ n_samples x observation_dim ]
obs = obs[None].repeat(n_samples, axis=0)
conditions = {
    0: to_torch(obs, device=DEVICE)
  }

# constants for inference
batch_size = len(conditions[0])
shape = (batch_size, horizon, state_dim+action_dim)

Sample initial noise

In [ ]:

# sample random initial noise vector
x1 = torch.randn(shape, device=DEVICE, generator=generator)

# this model is conditioned from an initial state, so you will see this function
#  multiple times to change the initial state of generated data to the state 
#  generated via env.reset() above or env.step() below
x = reset_x0(x1, conditions, action_dim)

# convert a np observation to torch for model forward pass
x = to_torch(x)

Generate trajectories

The diffusion process for trajectories has 4 central components:

sampling an predicted original sample from the model (note that this model directly predicts the sample, rather than the error term epsilon used in many diffusion models),
use the scheduler to predict the sample at the previous timestep,
[optional] add posterior noise to the sample,
condition the trajectory to constrain the initial state.

In [ ]:

eta = 1.0 # noise factor for sampling reconstructed state

# run the diffusion process
# for i in tqdm.tqdm(reversed(range(num_inference_steps)), total=num_inference_steps):
for i in tqdm.tqdm(scheduler.timesteps):

    # create batch of timesteps to pass into model
    timesteps = torch.full((batch_size,), i, device=DEVICE, dtype=torch.long)
    
    # 1. generate prediction from model
    with torch.no_grad():
      residual = network(x.permute(0, 2, 1), timesteps).sample
      residual = residual.permute(0, 2, 1) # needed to match model params to original 

    # 2. use the model prediction to reconstruct an observation (de-noise)
    obs_reconstruct = scheduler.step(residual, i, x, predict_epsilon=False)["prev_sample"]

    # 3. [optional] add posterior noise to the sample
    if eta > 0:
      noise = torch.randn(obs_reconstruct.shape, generator=generator_cpu).to(obs_reconstruct.device)
      posterior_variance = scheduler._get_variance(i) # * noise
      # no noise when t == 0
      # NOTE: original implementation missing sqrt on posterior_variance
      obs_reconstruct = obs_reconstruct + int(i>0) * (0.5 * posterior_variance) * eta* noise  # MJ had as log var, exponentiated

    # 4. apply conditions to the trajectory
    obs_reconstruct_postcond = reset_x0(obs_reconstruct, conditions, action_dim)
    x = to_torch(obs_reconstruct_postcond)

100%|██████████| 100/100 [00:01<00:00, 78.56it/s]

In [ ]:

x.shape

torch.Size([4, 128, 14])

Render the samples

Renderering Tools

Rendering from Mujoco is historically not easy. Here is a modified version from the original paper. Additionally, a TODO is to investigate this web-based viewer.

Video helpers

In [ ]:

import os
import mediapy as media

def to_np(x_in):
	if torch.is_tensor(x_in):
		x_in = x_in.detach().cpu().numpy()
	return x_in

# from MJ's Diffuser code 
# https://github.com/jannerm/diffuser/blob/76ae49ae85ba1c833bf78438faffdc63b8b4d55d/diffuser/utils/colab.py#L79
def mkdir(savepath):
    """
        returns `True` iff `savepath` is created
    """
    if not os.path.exists(savepath):
        os.makedirs(savepath)
        return True
    else:
        return False


def show_sample(renderer, observations, filename='sample.mp4', savebase='/content/videos'):
    '''
    observations : [ batch_size x horizon x observation_dim ]
    '''

    mkdir(savebase)
    savepath = os.path.join(savebase, filename)

    images = []
    for rollout in observations:
        ## [ horizon x height x width x channels ]
        img = renderer._renders(rollout, partial=True)
        images.append(img)

    ## [ horizon x height x (batch_size * width) x channels ]
    images = np.concatenate(images, axis=2)

    media.show_video(images, codec='h264', fps=60)

Renderer helpers

These functions involve setting the state of the environment and reading it out in a pixel form.

In [ ]:

# Code adapted from Michael Janner
# source: https://github.com/jannerm/diffuser/blob/main/diffuser/utils/rendering.py
import mujoco_py as mjc

def env_map(env_name):
    '''
        map D4RL dataset names to custom fully-observed
        variants for rendering
    '''
    if 'halfcheetah' in env_name:
        return 'HalfCheetahFullObs-v2'
    elif 'hopper' in env_name:
        return 'HopperFullObs-v2'
    elif 'walker2d' in env_name:
        return 'Walker2dFullObs-v2'
    else:
        return env_name

def get_image_mask(img):
    background = (img == 255).all(axis=-1, keepdims=True)
    mask = ~background.repeat(3, axis=-1)
    return mask

def atmost_2d(x):
    while x.ndim > 2:
        x = x.squeeze(0)
    return x

def set_state(env, state):
    qpos_dim = env.sim.data.qpos.size
    qvel_dim = env.sim.data.qvel.size
    if not state.size == qpos_dim + qvel_dim:
        warnings.warn(
            f'[ utils/rendering ] Expected state of size {qpos_dim + qvel_dim}, '
            f'but got state of size {state.size}')
        state = state[:qpos_dim + qvel_dim]

    env.set_state(state[:qpos_dim], state[qpos_dim:])

Rendering class

Use the previously defined helpers to programatically render pixel sequences from a trajectory of states. This class takes the re-scaled outputs of the diffusion process and visualizes them.

In [ ]:

class MuJoCoRenderer:
    '''
        default mujoco renderer
    '''

    def __init__(self, env):
        if type(env) is str:
            env = env_map(env)
            self.env = gym.make(env)
        else:
            self.env = env
        ## - 1 because the envs in renderer are fully-observed
        ## @TODO : clean up
        self.observation_dim = np.prod(self.env.observation_space.shape) - 1
        self.action_dim = np.prod(self.env.action_space.shape)
        try:
            self.viewer = mjc.MjRenderContextOffscreen(self.env.sim)
        except:
            print('[ utils/rendering ] Warning: could not initialize offscreen renderer')
            self.viewer = None

    def pad_observation(self, observation):
        state = np.concatenate([
            np.zeros(1),
            observation,
        ])
        return state

    def pad_observations(self, observations):
        qpos_dim = self.env.sim.data.qpos.size
        ## xpos is hidden
        xvel_dim = qpos_dim - 1
        xvel = observations[:, xvel_dim]
        xpos = np.cumsum(xvel) * self.env.dt
        states = np.concatenate([
            xpos[:,None],
            observations,
        ], axis=-1)
        return states

    def render(self, observation, dim=256, partial=False, qvel=True, render_kwargs=None, conditions=None):

        if type(dim) == int:
            dim = (dim, dim)

        if self.viewer is None:
            return np.zeros((*dim, 3), np.uint8)

        if render_kwargs is None:
            xpos = observation[0] if not partial else 0
            render_kwargs = {
                'trackbodyid': 2,
                'distance': 3,
                'lookat': [xpos, -0.5, 1],
                'elevation': -20
            }

        for key, val in render_kwargs.items():
            if key == 'lookat':
                self.viewer.cam.lookat[:] = val[:]
            else:
                setattr(self.viewer.cam, key, val)

        if partial:
            state = self.pad_observation(observation)
        else:
            state = observation

        qpos_dim = self.env.sim.data.qpos.size
        if not qvel or state.shape[-1] == qpos_dim:
            qvel_dim = self.env.sim.data.qvel.size
            state = np.concatenate([state, np.zeros(qvel_dim)])

        set_state(self.env, state)

        self.viewer.render(*dim)
        data = self.viewer.read_pixels(*dim, depth=False)
        data = data[::-1, :, :]
        return data

    def _renders(self, observations, **kwargs):
        images = []
        for observation in observations:
            img = self.render(observation, **kwargs)
            images.append(img)
        return np.stack(images, axis=0)

    def renders(self, samples, partial=False, **kwargs):
        if partial:
            samples = self.pad_observations(samples)
            partial = False

        sample_images = self._renders(samples, partial=partial, **kwargs)

        composite = np.ones_like(sample_images[0]) * 255

        for img in sample_images:
            mask = get_image_mask(img)
            composite[mask] = img[mask]

        return composite

    def __call__(self, *args, **kwargs):
        return self.renders(*args, **kwargs)

Show Plans

This section renders 4 trajectories chosen from the same initial state in the environment.

Initialize renderer class for the environment

In [ ]:

render = MuJoCoRenderer(env)

Show the video

Show the states generated by the diffusion model in the real environment. Not that the actions are dropped from the data.

In [ ]:

de_normalized = de_normalize(to_np(x[:,:,action_dim:]), data, 'observations')
show_sample(render, de_normalized)

Run Value Guided Diffusion -- with Pipeline

In this section, we repeat the above code, but we use a pre-trained pipeline in Diffusers!

In [ ]:

from diffusers import ValueGuidedRLPipeline

In [ ]:

env_name = "hopper-medium-v2"
env = gym.make(env_name)
data = env.get_dataset()  # dataset is only used for normalization in this colab
render = MuJoCoRenderer(env)

/usr/local/lib/python3.7/dist-packages/gym/envs/mujoco/mujoco_env.py:47: UserWarning: WARN: This version of the mujoco environments depends on the mujoco-py bindings, which are no longer maintained and may stop working. Please upgrade to the v4 versions of the environments (which depend on the mujoco python bindings instead), unless you are trying to precisely replicate previous works).
  "This version of the mujoco environments depends "
/usr/local/lib/python3.7/dist-packages/gym/spaces/box.py:112: UserWarning: WARN: Box bound precision lowered by casting to float32
  logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
/usr/local/lib/python3.7/dist-packages/gym/utils/passive_env_checker.py:70: UserWarning: WARN: Agent's minimum action space value is -infinity. This is probably too low.
  "Agent's minimum action space value is -infinity. This is probably too low."
/usr/local/lib/python3.7/dist-packages/gym/utils/passive_env_checker.py:74: UserWarning: WARN: Agent's maximum action space value is infinity. This is probably too high
  "Agent's maximum action space value is infinity. This is probably too high"
/usr/local/lib/python3.7/dist-packages/gym/utils/passive_env_checker.py:98: UserWarning: WARN: We recommend you to use a symmetric and normalized Box action space (range=[-1, 1]) https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html
  "We recommend you to use a symmetric and normalized Box action space (range=[-1, 1]) "
load datafile:  19%|█▉        | 4/21 [00:00<00:03,  5.16it/s]/usr/local/lib/python3.7/dist-packages/h5py/_hl/dataset.py:767: DeprecationWarning: Passing None into shape arguments as an alias for () is deprecated.
  arr = numpy.ndarray(selection.mshape, dtype=new_dtype)
load datafile: 100%|██████████| 21/21 [00:01<00:00, 15.05it/s]

In [ ]:

state_dim = env.observation_space.shape[0]
action_dim = env.action_space.shape[0]
DEVICE = "cuda"

Load the pipeline!

In [ ]:

pipeline = ValueGuidedRLPipeline.from_pretrained(
        "bglick13/hopper-medium-v2-value-function-hor32",
        env=env,
    )

The config attributes {'args': ['diffusers', 'DDPMScheduler'], 'kwargs': ['diffusers', 'PNDMScheduler']} were passed to ValueGuidedDiffuserPipeline, but are not expected and will be ignored. Please verify your model_index.json configuration file.
load datafile: 100%|██████████| 21/21 [00:01<00:00, 13.65it/s]

In [ ]:

env.seed(0)
obs = env.reset()
total_reward = 0
total_score = 0
T = 100
rollout = [obs.copy()]
trajectories = []
y_maxes = [0]
for t in tqdm.tqdm(range(T)):
    # normalize observations for forward passes
    denorm_actions = pipeline(obs, planning_horizon=32)

    # execute action in environment
    next_observation, reward, terminal, _ = env.step(denorm_actions)
    score = env.get_normalized_score(total_reward)
    
    # update return
    total_reward += reward
    total_score += score
    print(
        f"Step: {t}, Reward: {reward}, Total Reward: {total_reward}, Score: {score}, Total Score:"
        f" {total_score}"
    )
    # save observations for rendering
    rollout.append(next_observation.copy())

    obs = next_observation

/usr/local/lib/python3.7/dist-packages/gym/core.py:201: DeprecationWarning: WARN: Function `env.seed(seed)` is marked as deprecated and will be removed in the future. Please use `env.reset(seed=seed)` instead.
  "Function `env.seed(seed)` is marked as deprecated and will be removed in the future. "
/usr/local/lib/python3.7/dist-packages/gym/utils/passive_env_checker.py:217: UserWarning: WARN: Future gym versions will require that `Env.reset` can be passed a `seed` instead of using `Env.seed` for resetting the environment random number generator. 
  "Future gym versions will require that `Env.reset` can be passed a `seed` instead of using `Env.seed` for resetting the environment random number generator. "
/usr/local/lib/python3.7/dist-packages/gym/utils/passive_env_checker.py:229: UserWarning: WARN: Future gym versions will require that `Env.reset` can be passed `return_info` to return information from the environment resetting.
  "Future gym versions will require that `Env.reset` can be passed `return_info` to return information from the environment resetting."
/usr/local/lib/python3.7/dist-packages/gym/utils/passive_env_checker.py:234: UserWarning: WARN: Future gym versions will require that `Env.reset` can be passed `options` to allow the environment initialisation to be passed additional information.
  "Future gym versions will require that `Env.reset` can be passed `options` to allow the environment initialisation to be passed additional information."
  0%|          | 0/100 [00:00<?, ?it/s]
  0%|          | 0/20 [00:00<?, ?it/s]
  5%|▌         | 1/20 [00:00<00:13,  1.39it/s]
 15%|█▌        | 3/20 [00:00<00:04,  4.17it/s]
 25%|██▌       | 5/20 [00:01<00:02,  6.59it/s]
 35%|███▌      | 7/20 [00:01<00:01,  8.64it/s]
 45%|████▌     | 9/20 [00:01<00:01, 10.20it/s]
 55%|█████▌    | 11/20 [00:01<00:00, 11.50it/s]
 65%|██████▌   | 13/20 [00:01<00:00, 12.48it/s]
 75%|███████▌  | 15/20 [00:01<00:00, 13.27it/s]
 85%|████████▌ | 17/20 [00:01<00:00, 13.62it/s]
100%|██████████| 20/20 [00:02<00:00,  9.93it/s]
  1%|          | 1/100 [00:02<03:21,  2.03s/it]

Step: 0, Reward: 0.9613453417633487, Total Reward: 0.9613453417633487, Score: 0.006228869141685884, Total Score: 0.006228869141685884

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.76it/s]
 20%|██        | 4/20 [00:00<00:01, 15.77it/s]
 30%|███       | 6/20 [00:00<00:00, 15.71it/s]
 40%|████      | 8/20 [00:00<00:00, 15.19it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.52it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.21it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.33it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.33it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.48it/s]
100%|██████████| 20/20 [00:01<00:00, 15.31it/s]
  2%|▏         | 2/100 [00:03<02:38,  1.62s/it]

Step: 1, Reward: 0.9905865761281842, Total Reward: 1.951931917891533, Score: 0.006524252144941468, Total Score: 0.012753121286627353

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 16.27it/s]
 20%|██        | 4/20 [00:00<00:01, 14.22it/s]
 30%|███       | 6/20 [00:00<00:00, 14.68it/s]
 40%|████      | 8/20 [00:00<00:00, 14.77it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.20it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.26it/s]
 70%|███████   | 14/20 [00:00<00:00, 14.85it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.19it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.27it/s]
100%|██████████| 20/20 [00:01<00:00, 14.93it/s]
  3%|▎         | 3/100 [00:04<02:25,  1.50s/it]

Step: 2, Reward: 1.09243849732667, Total Reward: 3.044370415218203, Score: 0.00682861981088834, Total Score: 0.019581741097515693

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.02it/s]
 20%|██        | 4/20 [00:00<00:01, 14.26it/s]
 30%|███       | 6/20 [00:00<00:00, 14.72it/s]
 40%|████      | 8/20 [00:00<00:00, 14.73it/s]
 50%|█████     | 10/20 [00:00<00:00, 14.49it/s]
 60%|██████    | 12/20 [00:00<00:00, 14.61it/s]
 70%|███████   | 14/20 [00:00<00:00, 14.49it/s]
 80%|████████  | 16/20 [00:01<00:00, 14.08it/s]
 90%|█████████ | 18/20 [00:01<00:00, 14.02it/s]
100%|██████████| 20/20 [00:01<00:00, 14.21it/s]
  4%|▍         | 4/100 [00:06<02:21,  1.47s/it]

Step: 3, Reward: 1.1971688596663932, Total Reward: 4.241539274884596, Score: 0.007164282501696702, Total Score: 0.026746023599212396

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 14.14it/s]
 20%|██        | 4/20 [00:00<00:01, 13.20it/s]
 30%|███       | 6/20 [00:00<00:01, 13.67it/s]
 40%|████      | 8/20 [00:00<00:00, 13.73it/s]
 50%|█████     | 10/20 [00:00<00:00, 13.29it/s]
 60%|██████    | 12/20 [00:00<00:00, 13.36it/s]
 70%|███████   | 14/20 [00:01<00:00, 13.45it/s]
 80%|████████  | 16/20 [00:01<00:00, 13.17it/s]
 90%|█████████ | 18/20 [00:01<00:00, 12.93it/s]
100%|██████████| 20/20 [00:01<00:00, 13.21it/s]
  5%|▌         | 5/100 [00:07<02:21,  1.49s/it]

Step: 4, Reward: 1.216392029159964, Total Reward: 5.4579313040445605, Score: 0.007532124647292049, Total Score: 0.03427814824650444

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 13.49it/s]
 20%|██        | 4/20 [00:00<00:01, 12.85it/s]
 30%|███       | 6/20 [00:00<00:01, 12.57it/s]
 40%|████      | 8/20 [00:00<00:00, 12.54it/s]
 50%|█████     | 10/20 [00:00<00:00, 12.54it/s]
 60%|██████    | 12/20 [00:00<00:00, 12.67it/s]
 70%|███████   | 14/20 [00:01<00:00, 12.66it/s]
 80%|████████  | 16/20 [00:01<00:00, 12.56it/s]
 90%|█████████ | 18/20 [00:01<00:00, 12.55it/s]
100%|██████████| 20/20 [00:01<00:00, 12.57it/s]
  6%|▌         | 6/100 [00:09<02:24,  1.53s/it]

Step: 5, Reward: 1.1837624373676066, Total Reward: 6.641693741412167, Score: 0.007905873304616767, Total Score: 0.042184021551121206

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 13.02it/s]
 20%|██        | 4/20 [00:00<00:01, 12.83it/s]
 30%|███       | 6/20 [00:00<00:01, 12.89it/s]
 40%|████      | 8/20 [00:00<00:00, 12.84it/s]
 50%|█████     | 10/20 [00:00<00:00, 12.43it/s]
 60%|██████    | 12/20 [00:00<00:00, 12.67it/s]
 70%|███████   | 14/20 [00:01<00:00, 12.68it/s]
 80%|████████  | 16/20 [00:01<00:00, 12.55it/s]
 90%|█████████ | 18/20 [00:01<00:00, 12.85it/s]
100%|██████████| 20/20 [00:01<00:00, 12.76it/s]
  7%|▋         | 7/100 [00:10<02:24,  1.55s/it]

Step: 6, Reward: 1.316867289864599, Total Reward: 7.9585610312767665, Score: 0.008269596192428782, Total Score: 0.05045361774354999

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 13.11it/s]
 20%|██        | 4/20 [00:00<00:01, 14.07it/s]
 30%|███       | 6/20 [00:00<00:00, 14.43it/s]
 40%|████      | 8/20 [00:00<00:00, 14.70it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.02it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.08it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.09it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.20it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.00it/s]
100%|██████████| 20/20 [00:01<00:00, 14.78it/s]
  8%|▊         | 8/100 [00:12<02:17,  1.50s/it]

Step: 7, Reward: 1.458532898214762, Total Reward: 9.417093929491529, Score: 0.00867421688186361, Total Score: 0.0591278346254136

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.47it/s]
 20%|██        | 4/20 [00:00<00:01, 15.45it/s]
 30%|███       | 6/20 [00:00<00:00, 15.04it/s]
 40%|████      | 8/20 [00:00<00:00, 15.08it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.36it/s]
 60%|██████    | 12/20 [00:00<00:00, 14.88it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.34it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.40it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.55it/s]
100%|██████████| 20/20 [00:01<00:00, 15.27it/s]
  9%|▉         | 9/100 [00:13<02:11,  1.44s/it]

Step: 8, Reward: 1.5655737561538712, Total Reward: 10.9826676856454, Score: 0.009122365751063418, Total Score: 0.06825020037647701

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.68it/s]
 20%|██        | 4/20 [00:00<00:01, 15.32it/s]
 30%|███       | 6/20 [00:00<00:00, 15.33it/s]
 40%|████      | 8/20 [00:00<00:00, 14.49it/s]
 50%|█████     | 10/20 [00:00<00:00, 14.81it/s]
 60%|██████    | 12/20 [00:00<00:00, 14.81it/s]
 70%|███████   | 14/20 [00:00<00:00, 14.88it/s]
 80%|████████  | 16/20 [00:01<00:00, 14.93it/s]
 90%|█████████ | 18/20 [00:01<00:00, 14.73it/s]
100%|██████████| 20/20 [00:01<00:00, 14.82it/s]
 10%|█         | 10/100 [00:14<02:07,  1.42s/it]

Step: 9, Reward: 1.6407680269786085, Total Reward: 12.623435712624008, Score: 0.009603403998008702, Total Score: 0.07785360437448571

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 14.32it/s]
 20%|██        | 4/20 [00:00<00:01, 14.64it/s]
 30%|███       | 6/20 [00:00<00:00, 14.61it/s]
 40%|████      | 8/20 [00:00<00:00, 14.82it/s]
 50%|█████     | 10/20 [00:00<00:00, 14.95it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.11it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.02it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.12it/s]
 90%|█████████ | 18/20 [00:01<00:00, 14.38it/s]
100%|██████████| 20/20 [00:01<00:00, 14.64it/s]
 11%|█         | 11/100 [00:16<02:05,  1.41s/it]

Step: 10, Reward: 1.7056611798297208, Total Reward: 14.329096892453729, Score: 0.010107546439231438, Total Score: 0.08796115081371715

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.00it/s]
 20%|██        | 4/20 [00:00<00:01, 15.09it/s]
 30%|███       | 6/20 [00:00<00:00, 14.94it/s]
 40%|████      | 8/20 [00:00<00:00, 14.94it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.10it/s]
 60%|██████    | 12/20 [00:00<00:00, 14.90it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.13it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.33it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.36it/s]
100%|██████████| 20/20 [00:01<00:00, 15.11it/s]
 12%|█▏        | 12/100 [00:17<02:02,  1.39s/it]

Step: 11, Reward: 1.7747508738212663, Total Reward: 16.103847766274995, Score: 0.010631627952863603, Total Score: 0.09859277876658075

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.79it/s]
 20%|██        | 4/20 [00:00<00:01, 15.72it/s]
 30%|███       | 6/20 [00:00<00:00, 15.92it/s]
 40%|████      | 8/20 [00:00<00:00, 15.27it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.51it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.35it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.33it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.35it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.29it/s]
100%|██████████| 20/20 [00:01<00:00, 15.36it/s]
 13%|█▎        | 13/100 [00:19<01:59,  1.37s/it]

Step: 12, Reward: 1.818083363081429, Total Reward: 17.921931129356423, Score: 0.01117693796828244, Total Score: 0.10976971673486319

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 14.58it/s]
 20%|██        | 4/20 [00:00<00:01, 15.13it/s]
 30%|███       | 6/20 [00:00<00:01,  8.76it/s]
 40%|████      | 8/20 [00:00<00:01, 10.59it/s]
 50%|█████     | 10/20 [00:00<00:00, 11.91it/s]
 60%|██████    | 12/20 [00:00<00:00, 12.94it/s]
 70%|███████   | 14/20 [00:01<00:00, 13.73it/s]
 80%|████████  | 16/20 [00:01<00:00,  9.57it/s]
 90%|█████████ | 18/20 [00:01<00:00, 10.75it/s]
100%|██████████| 20/20 [00:01<00:00, 11.48it/s]
 14%|█▍        | 14/100 [00:20<02:07,  1.49s/it]

Step: 13, Reward: 1.816563587926592, Total Reward: 19.738494717283015, Score: 0.011735562325863405, Total Score: 0.1215052790607266

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.79it/s]
 20%|██        | 4/20 [00:00<00:01, 13.70it/s]
 30%|███       | 6/20 [00:00<00:01,  9.58it/s]
 40%|████      | 8/20 [00:00<00:01, 11.24it/s]
 50%|█████     | 10/20 [00:00<00:00, 12.52it/s]
 60%|██████    | 12/20 [00:00<00:00, 13.38it/s]
 70%|███████   | 14/20 [00:01<00:00, 14.02it/s]
 80%|████████  | 16/20 [00:01<00:00, 14.46it/s]
 90%|█████████ | 18/20 [00:01<00:00, 14.58it/s]
100%|██████████| 20/20 [00:01<00:00, 13.37it/s]
 15%|█▌        | 15/100 [00:22<02:07,  1.50s/it]

Step: 14, Reward: 1.8434045339664338, Total Reward: 21.58189925124945, Score: 0.012293719717277263, Total Score: 0.13379899877800386

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.44it/s]
 20%|██        | 4/20 [00:00<00:01, 15.39it/s]
 30%|███       | 6/20 [00:00<00:00, 15.36it/s]
 40%|████      | 8/20 [00:00<00:00, 15.42it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.40it/s]
 60%|██████    | 12/20 [00:00<00:00, 14.90it/s]
 70%|███████   | 14/20 [00:00<00:00, 14.35it/s]
 80%|████████  | 16/20 [00:01<00:00, 14.59it/s]
 90%|█████████ | 18/20 [00:01<00:00, 14.46it/s]
100%|██████████| 20/20 [00:01<00:00, 14.72it/s]
 16%|█▌        | 16/100 [00:23<02:02,  1.46s/it]

Step: 15, Reward: 1.8763610133755022, Total Reward: 23.45826026462495, Score: 0.012860124258707918, Total Score: 0.14665912303671177

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 14.40it/s]
 20%|██        | 4/20 [00:00<00:01, 14.16it/s]
 30%|███       | 6/20 [00:00<00:01, 13.98it/s]
 40%|████      | 8/20 [00:00<00:00, 13.70it/s]
 50%|█████     | 10/20 [00:00<00:00, 13.87it/s]
 60%|██████    | 12/20 [00:00<00:00, 13.83it/s]
 70%|███████   | 14/20 [00:01<00:00, 13.29it/s]
 80%|████████  | 16/20 [00:01<00:00, 13.54it/s]
 90%|█████████ | 18/20 [00:01<00:00, 13.50it/s]
100%|██████████| 20/20 [00:01<00:00, 13.47it/s]
 17%|█▋        | 17/100 [00:25<02:02,  1.47s/it]

Step: 16, Reward: 1.8580889190962222, Total Reward: 25.316349183721172, Score: 0.013436655009151793, Total Score: 0.16009577804586356

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 12.85it/s]
 20%|██        | 4/20 [00:00<00:01, 12.72it/s]
 30%|███       | 6/20 [00:00<00:01, 12.70it/s]
 40%|████      | 8/20 [00:00<00:00, 13.17it/s]
 50%|█████     | 10/20 [00:00<00:00, 13.36it/s]
 60%|██████    | 12/20 [00:00<00:00, 13.32it/s]
 70%|███████   | 14/20 [00:01<00:00, 13.10it/s]
 80%|████████  | 16/20 [00:01<00:00, 12.81it/s]
 90%|█████████ | 18/20 [00:01<00:00, 12.93it/s]
100%|██████████| 20/20 [00:01<00:00, 12.89it/s]
 18%|█▊        | 18/100 [00:26<02:03,  1.50s/it]

Step: 17, Reward: 1.833873406539959, Total Reward: 27.15022259026113, Score: 0.014007571475269827, Total Score: 0.17410334952113338

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 12.75it/s]
 20%|██        | 4/20 [00:00<00:01, 12.71it/s]
 30%|███       | 6/20 [00:00<00:01, 12.53it/s]
 40%|████      | 8/20 [00:00<00:00, 12.35it/s]
 50%|█████     | 10/20 [00:00<00:00, 12.43it/s]
 60%|██████    | 12/20 [00:00<00:00, 12.47it/s]
 70%|███████   | 14/20 [00:01<00:00, 12.67it/s]
 80%|████████  | 16/20 [00:01<00:00, 12.87it/s]
 90%|█████████ | 18/20 [00:01<00:00, 12.97it/s]
100%|██████████| 20/20 [00:01<00:00, 12.61it/s]
 19%|█▉        | 19/100 [00:28<02:04,  1.54s/it]

Step: 18, Reward: 1.7923612227351886, Total Reward: 28.942583812996318, Score: 0.01457104748215484, Total Score: 0.18867439700328822

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 13.42it/s]
 20%|██        | 4/20 [00:00<00:01, 13.16it/s]
 30%|███       | 6/20 [00:00<00:01, 13.34it/s]
 40%|████      | 8/20 [00:00<00:00, 13.64it/s]
 50%|█████     | 10/20 [00:00<00:00, 13.32it/s]
 60%|██████    | 12/20 [00:00<00:00, 13.95it/s]
 70%|███████   | 14/20 [00:01<00:00, 14.17it/s]
 80%|████████  | 16/20 [00:01<00:00, 14.50it/s]
 90%|█████████ | 18/20 [00:01<00:00, 14.89it/s]
100%|██████████| 20/20 [00:01<00:00, 14.15it/s]
 20%|██        | 20/100 [00:29<02:00,  1.51s/it]

Step: 19, Reward: 1.7772864941994806, Total Reward: 30.719870307195798, Score: 0.015121768453995469, Total Score: 0.20379616545728368

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.84it/s]
 20%|██        | 4/20 [00:00<00:01, 15.14it/s]
 30%|███       | 6/20 [00:00<00:00, 15.10it/s]
 40%|████      | 8/20 [00:00<00:00, 15.06it/s]
 50%|█████     | 10/20 [00:00<00:00, 14.75it/s]
 60%|██████    | 12/20 [00:00<00:00, 14.87it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.11it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.36it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.49it/s]
100%|██████████| 20/20 [00:01<00:00, 15.08it/s]
 21%|██        | 21/100 [00:31<01:55,  1.46s/it]

Step: 20, Reward: 1.7596802514374612, Total Reward: 32.47955055863326, Score: 0.01566785756422019, Total Score: 0.21946402302150386

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 16.74it/s]
 20%|██        | 4/20 [00:00<00:01, 15.76it/s]
 30%|███       | 6/20 [00:00<00:00, 15.25it/s]
 40%|████      | 8/20 [00:00<00:00, 15.38it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.50it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.49it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.44it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.51it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.61it/s]
100%|██████████| 20/20 [00:01<00:00, 15.32it/s]
 22%|██▏       | 22/100 [00:32<01:50,  1.42s/it]

Step: 21, Reward: 1.7213467511037084, Total Reward: 34.200897309736966, Score: 0.0162085369796795, Total Score: 0.23567256000118336

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 16.04it/s]
 20%|██        | 4/20 [00:00<00:01, 15.48it/s]
 30%|███       | 6/20 [00:00<00:00, 15.66it/s]
 40%|████      | 8/20 [00:00<00:00, 15.56it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.75it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.70it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.73it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.52it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.72it/s]
100%|██████████| 20/20 [00:01<00:00, 15.56it/s]
 23%|██▎       | 23/100 [00:33<01:46,  1.38s/it]

Step: 22, Reward: 1.681650865266795, Total Reward: 35.88254817500376, Score: 0.016737438042488645, Total Score: 0.25240999804367203

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.76it/s]
 20%|██        | 4/20 [00:00<00:01, 15.29it/s]
 30%|███       | 6/20 [00:00<00:00, 15.15it/s]
 40%|████      | 8/20 [00:00<00:00, 15.25it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.34it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.01it/s]
 70%|███████   | 14/20 [00:00<00:00, 14.90it/s]
 80%|████████  | 16/20 [00:01<00:00, 14.85it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.04it/s]
100%|██████████| 20/20 [00:01<00:00, 14.78it/s]
 24%|██▍       | 24/100 [00:35<01:44,  1.38s/it]

Step: 23, Reward: 1.649656828943054, Total Reward: 37.532205003946814, Score: 0.017254142146030386, Total Score: 0.2696641401897024

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.40it/s]
 20%|██        | 4/20 [00:00<00:01, 15.69it/s]
 30%|███       | 6/20 [00:00<00:00, 15.45it/s]
 40%|████      | 8/20 [00:00<00:00, 15.56it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.69it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.53it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.66it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.72it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.65it/s]
100%|██████████| 20/20 [00:01<00:00, 15.52it/s]
 25%|██▌       | 25/100 [00:36<01:41,  1.36s/it]

Step: 24, Reward: 1.6520971180163586, Total Reward: 39.18430212196317, Score: 0.017761015760854884, Total Score: 0.2874251559505573

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.05it/s]
 20%|██        | 4/20 [00:00<00:01, 15.17it/s]
 30%|███       | 6/20 [00:00<00:00, 15.45it/s]
 40%|████      | 8/20 [00:00<00:00, 15.68it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.54it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.60it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.61it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.39it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.20it/s]
100%|██████████| 20/20 [00:01<00:00, 15.26it/s]
 26%|██▌       | 26/100 [00:37<01:39,  1.35s/it]

Step: 25, Reward: 1.6532160502693138, Total Reward: 40.837518172232485, Score: 0.0182686391789852, Total Score: 0.3056937951295425

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.48it/s]
 20%|██        | 4/20 [00:00<00:01, 15.31it/s]
 30%|███       | 6/20 [00:00<00:00, 15.43it/s]
 40%|████      | 8/20 [00:00<00:00, 15.55it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.42it/s]
 60%|██████    | 12/20 [00:00<00:00, 14.96it/s]
 70%|███████   | 14/20 [00:00<00:00, 14.67it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.02it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.02it/s]
100%|██████████| 20/20 [00:01<00:00, 15.05it/s]
 27%|██▋       | 27/100 [00:39<01:38,  1.35s/it]

Step: 26, Reward: 1.6534424303321889, Total Reward: 42.49096060256468, Score: 0.0187766064002786, Total Score: 0.3244704015298211

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 16.23it/s]
 20%|██        | 4/20 [00:00<00:01, 15.38it/s]
 30%|███       | 6/20 [00:00<00:00, 15.55it/s]
 40%|████      | 8/20 [00:00<00:00, 15.19it/s]
 50%|█████     | 10/20 [00:00<00:00, 14.98it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.14it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.17it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.15it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.35it/s]
100%|██████████| 20/20 [00:01<00:00, 15.25it/s]
 28%|██▊       | 28/100 [00:40<01:36,  1.34s/it]

Step: 27, Reward: 1.6543163640247096, Total Reward: 44.145276966589385, Score: 0.019284643179118023, Total Score: 0.34375504470893914

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.79it/s]
 20%|██        | 4/20 [00:00<00:01, 15.28it/s]
 30%|███       | 6/20 [00:00<00:00, 15.38it/s]
 40%|████      | 8/20 [00:00<00:00, 14.96it/s]
 50%|█████     | 10/20 [00:00<00:00, 14.82it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.07it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.27it/s]
 80%|████████  | 16/20 [00:01<00:00, 14.97it/s]
 90%|█████████ | 18/20 [00:01<00:00, 14.74it/s]
100%|██████████| 20/20 [00:01<00:00, 14.84it/s]
 29%|██▉       | 29/100 [00:41<01:35,  1.35s/it]

Step: 28, Reward: 1.6274647523947772, Total Reward: 45.772741718984165, Score: 0.019792948482854303, Total Score: 0.36354799319179343

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 13.81it/s]
 20%|██        | 4/20 [00:00<00:01, 14.21it/s]
 30%|███       | 6/20 [00:00<00:00, 14.17it/s]
 40%|████      | 8/20 [00:00<00:00, 14.18it/s]
 50%|█████     | 10/20 [00:00<00:00, 14.16it/s]
 60%|██████    | 12/20 [00:00<00:00, 13.63it/s]
 70%|███████   | 14/20 [00:01<00:00, 13.73it/s]
 80%|████████  | 16/20 [00:01<00:00, 13.80it/s]
 90%|█████████ | 18/20 [00:01<00:00, 13.76it/s]
100%|██████████| 20/20 [00:01<00:00, 13.75it/s]
 30%|███       | 30/100 [00:43<01:37,  1.39s/it]

Step: 29, Reward: 1.6125653440731562, Total Reward: 47.38530706305732, Score: 0.020293003359464205, Total Score: 0.38384099655125764

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 12.64it/s]
 20%|██        | 4/20 [00:00<00:01, 12.85it/s]
 30%|███       | 6/20 [00:00<00:01, 12.67it/s]
 40%|████      | 8/20 [00:00<00:00, 13.10it/s]
 50%|█████     | 10/20 [00:00<00:00, 13.16it/s]
 60%|██████    | 12/20 [00:00<00:00, 13.18it/s]
 70%|███████   | 14/20 [00:01<00:00, 13.14it/s]
 80%|████████  | 16/20 [00:01<00:00, 12.58it/s]
 90%|█████████ | 18/20 [00:01<00:00, 12.56it/s]
100%|██████████| 20/20 [00:01<00:00, 12.73it/s]
 31%|███       | 31/100 [00:44<01:39,  1.45s/it]

Step: 30, Reward: 1.6173703384364473, Total Reward: 49.00267740149376, Score: 0.0207884802433533, Total Score: 0.4046294767946109

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 13.95it/s]
 20%|██        | 4/20 [00:00<00:01, 13.68it/s]
 30%|███       | 6/20 [00:00<00:01, 13.79it/s]
 40%|████      | 8/20 [00:00<00:00, 13.58it/s]
 50%|█████     | 10/20 [00:00<00:00, 12.85it/s]
 60%|██████    | 12/20 [00:00<00:00, 12.85it/s]
 70%|███████   | 14/20 [00:01<00:00, 13.08it/s]
 80%|████████  | 16/20 [00:01<00:00, 13.18it/s]
 90%|█████████ | 18/20 [00:01<00:00, 13.08it/s]
100%|██████████| 20/20 [00:01<00:00, 13.01it/s]
 32%|███▏      | 32/100 [00:46<01:40,  1.48s/it]

Step: 31, Reward: 1.5497623762580708, Total Reward: 50.552439777751836, Score: 0.02128543350997813, Total Score: 0.42591491030458906

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 12.94it/s]
 20%|██        | 4/20 [00:00<00:01, 13.84it/s]
 30%|███       | 6/20 [00:00<00:01, 13.93it/s]
 40%|████      | 8/20 [00:00<00:00, 14.62it/s]
 50%|█████     | 10/20 [00:00<00:00, 14.69it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.09it/s]
 70%|███████   | 14/20 [00:00<00:00, 14.55it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.02it/s]
 90%|█████████ | 18/20 [00:01<00:00, 14.92it/s]
100%|██████████| 20/20 [00:01<00:00, 14.64it/s]
 33%|███▎      | 33/100 [00:47<01:37,  1.45s/it]

Step: 32, Reward: 1.490298587276498, Total Reward: 52.042738365028335, Score: 0.0217616135517849, Total Score: 0.44767652385637396

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 14.55it/s]
 20%|██        | 4/20 [00:00<00:01, 15.41it/s]
 30%|███       | 6/20 [00:00<00:00, 15.63it/s]
 40%|████      | 8/20 [00:00<00:00, 15.59it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.69it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.78it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.61it/s]
 80%|████████  | 16/20 [00:01<00:00, 14.95it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.19it/s]
100%|██████████| 20/20 [00:01<00:00, 15.29it/s]
 34%|███▍      | 34/100 [00:49<01:33,  1.41s/it]

Step: 33, Reward: 1.4584480178869372, Total Reward: 53.50118638291527, Score: 0.022219522747714257, Total Score: 0.46989604660408824

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.54it/s]
 20%|██        | 4/20 [00:00<00:01, 15.41it/s]
 30%|███       | 6/20 [00:00<00:00, 15.67it/s]
 40%|████      | 8/20 [00:00<00:00, 15.91it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.91it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.50it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.60it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.53it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.50it/s]
100%|██████████| 20/20 [00:01<00:00, 15.27it/s]
 35%|███▌      | 35/100 [00:50<01:30,  1.39s/it]

Step: 34, Reward: 1.4048398211769302, Total Reward: 54.9060262040922, Score: 0.022667645536581578, Total Score: 0.4925636921406698

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 14.93it/s]
 20%|██        | 4/20 [00:00<00:01, 14.99it/s]
 30%|███       | 6/20 [00:00<00:00, 14.45it/s]
 40%|████      | 8/20 [00:00<00:00, 14.46it/s]
 50%|█████     | 10/20 [00:00<00:00, 14.86it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.17it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.28it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.10it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.28it/s]
100%|██████████| 20/20 [00:01<00:00, 15.02it/s]
 36%|███▌      | 36/100 [00:51<01:28,  1.38s/it]

Step: 35, Reward: 1.378891084714629, Total Reward: 56.28491728880683, Score: 0.023099296669057166, Total Score: 0.5156629888097269

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.73it/s]
 20%|██        | 4/20 [00:00<00:01, 15.60it/s]
 30%|███       | 6/20 [00:00<00:00, 15.58it/s]
 40%|████      | 8/20 [00:00<00:00, 15.70it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.57it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.43it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.24it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.09it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.32it/s]
100%|██████████| 20/20 [00:01<00:00, 15.21it/s]
 37%|███▋      | 37/100 [00:53<01:26,  1.37s/it]

Step: 36, Reward: 1.3393234239492198, Total Reward: 57.62424071275605, Score: 0.023522974791861884, Total Score: 0.5391859636015888

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.93it/s]
 20%|██        | 4/20 [00:00<00:01, 15.57it/s]
 30%|███       | 6/20 [00:00<00:00, 15.41it/s]
 40%|████      | 8/20 [00:00<00:00, 15.43it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.58it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.46it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.47it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.27it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.45it/s]
100%|██████████| 20/20 [00:01<00:00, 15.34it/s]
 38%|███▊      | 38/100 [00:54<01:23,  1.35s/it]

Step: 37, Reward: 1.2729857307511654, Total Reward: 58.897226443507215, Score: 0.023934495353839142, Total Score: 0.563120458955428

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.87it/s]
 20%|██        | 4/20 [00:00<00:01, 15.80it/s]
 30%|███       | 6/20 [00:00<00:00, 15.95it/s]
 40%|████      | 8/20 [00:00<00:00, 15.39it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.62it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.55it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.33it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.40it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.39it/s]
100%|██████████| 20/20 [00:01<00:00, 15.41it/s]
 39%|███▉      | 39/100 [00:55<01:21,  1.34s/it]

Step: 38, Reward: 1.2637156051988185, Total Reward: 60.16094204870603, Score: 0.024325632993889564, Total Score: 0.5874460919493175

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.67it/s]
 20%|██        | 4/20 [00:00<00:01, 14.66it/s]
 30%|███       | 6/20 [00:00<00:00, 14.98it/s]
 40%|████      | 8/20 [00:00<00:00, 15.33it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.26it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.12it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.36it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.46it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.40it/s]
100%|██████████| 20/20 [00:01<00:00, 15.10it/s]
 40%|████      | 40/100 [00:57<01:20,  1.34s/it]

Step: 39, Reward: 1.2776175812728203, Total Reward: 61.438559629978855, Score: 0.0247139222948393, Total Score: 0.6121600142441568

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 13.34it/s]
 20%|██        | 4/20 [00:00<00:01, 13.82it/s]
 30%|███       | 6/20 [00:00<00:00, 14.29it/s]
 40%|████      | 8/20 [00:00<00:00, 14.64it/s]
 50%|█████     | 10/20 [00:00<00:00, 14.84it/s]
 60%|██████    | 12/20 [00:00<00:00, 14.60it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.09it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.32it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.39it/s]
100%|██████████| 20/20 [00:01<00:00, 14.90it/s]
 41%|████      | 41/100 [00:58<01:19,  1.35s/it]

Step: 40, Reward: 1.203964270030611, Total Reward: 62.642523900009465, Score: 0.02510648311744263, Total Score: 0.6372664973615995

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.63it/s]
 20%|██        | 4/20 [00:00<00:01, 15.76it/s]
 30%|███       | 6/20 [00:00<00:00, 15.88it/s]
 40%|████      | 8/20 [00:00<00:00, 15.61it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.66it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.70it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.11it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.12it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.11it/s]
100%|██████████| 20/20 [00:01<00:00, 15.19it/s]
 42%|████▏     | 42/100 [00:59<01:17,  1.34s/it]

Step: 41, Reward: 1.092016290948236, Total Reward: 63.7345401909577, Score: 0.025476413221063608, Total Score: 0.6627429105826631

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 14.69it/s]
 20%|██        | 4/20 [00:00<00:01, 13.65it/s]
 30%|███       | 6/20 [00:00<00:01, 13.64it/s]
 40%|████      | 8/20 [00:00<00:00, 13.78it/s]
 50%|█████     | 10/20 [00:00<00:00, 13.89it/s]
 60%|██████    | 12/20 [00:00<00:00, 13.75it/s]
 70%|███████   | 14/20 [00:01<00:00, 13.97it/s]
 80%|████████  | 16/20 [00:01<00:00, 14.08it/s]
 90%|█████████ | 18/20 [00:01<00:00, 13.81it/s]
100%|██████████| 20/20 [00:01<00:00, 13.69it/s]
 43%|████▎     | 43/100 [01:01<01:18,  1.38s/it]

Step: 42, Reward: 1.1296472681807004, Total Reward: 64.8641874591384, Score: 0.0258119461847254, Total Score: 0.6885548567673885

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 13.23it/s]
 20%|██        | 4/20 [00:00<00:01, 13.12it/s]
 30%|███       | 6/20 [00:00<00:01, 13.14it/s]
 40%|████      | 8/20 [00:00<00:00, 12.96it/s]
 50%|█████     | 10/20 [00:00<00:00, 12.94it/s]
 60%|██████    | 12/20 [00:00<00:00, 13.03it/s]
 70%|███████   | 14/20 [00:01<00:00, 12.87it/s]
 80%|████████  | 16/20 [00:01<00:00, 12.98it/s]
 90%|█████████ | 18/20 [00:01<00:00, 13.02it/s]
100%|██████████| 20/20 [00:01<00:00, 12.95it/s]
 44%|████▍     | 44/100 [01:02<01:20,  1.44s/it]

Step: 43, Reward: 1.1803122177054457, Total Reward: 66.04449967684386, Score: 0.026159041643764744, Total Score: 0.7147138984111532

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 12.98it/s]
 20%|██        | 4/20 [00:00<00:01, 12.58it/s]
 30%|███       | 6/20 [00:00<00:01, 12.60it/s]
 40%|████      | 8/20 [00:00<00:00, 12.52it/s]
 50%|█████     | 10/20 [00:00<00:00, 12.71it/s]
 60%|██████    | 12/20 [00:00<00:00, 12.92it/s]
 70%|███████   | 14/20 [00:01<00:00, 12.92it/s]
 80%|████████  | 16/20 [00:01<00:00, 12.95it/s]
 90%|█████████ | 18/20 [00:01<00:00, 13.02it/s]
100%|██████████| 20/20 [00:01<00:00, 12.80it/s]
 45%|████▌     | 45/100 [01:04<01:21,  1.48s/it]

Step: 44, Reward: 1.200053185050725, Total Reward: 67.24455286189458, Score: 0.02652170441696297, Total Score: 0.7412356028281162

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 13.28it/s]
 20%|██        | 4/20 [00:00<00:01, 13.24it/s]
 30%|███       | 6/20 [00:00<00:01, 13.22it/s]
 40%|████      | 8/20 [00:00<00:00, 13.41it/s]
 50%|█████     | 10/20 [00:00<00:00, 13.22it/s]
 60%|██████    | 12/20 [00:00<00:00, 13.55it/s]
 70%|███████   | 14/20 [00:01<00:00, 13.86it/s]
 80%|████████  | 16/20 [00:01<00:00, 14.10it/s]
 90%|█████████ | 18/20 [00:01<00:00, 14.54it/s]
100%|██████████| 20/20 [00:01<00:00, 13.96it/s]
 46%|████▌     | 46/100 [01:05<01:19,  1.47s/it]

Step: 45, Reward: 1.196837866003687, Total Reward: 68.44139072789827, Score: 0.026890432800476552, Total Score: 0.7681260356285927

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 16.29it/s]
 20%|██        | 4/20 [00:00<00:01, 15.50it/s]
 30%|███       | 6/20 [00:00<00:00, 15.73it/s]
 40%|████      | 8/20 [00:00<00:00, 15.55it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.10it/s]
 60%|██████    | 12/20 [00:00<00:00, 14.50it/s]
 70%|███████   | 14/20 [00:00<00:00, 14.78it/s]
 80%|████████  | 16/20 [00:01<00:00, 14.89it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.29it/s]
100%|██████████| 20/20 [00:01<00:00, 15.01it/s]
 47%|████▋     | 47/100 [01:07<01:16,  1.44s/it]

Step: 46, Reward: 1.122069119979826, Total Reward: 69.5634598478781, Score: 0.027258173244947545, Total Score: 0.7953842088735402

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 16.50it/s]
 20%|██        | 4/20 [00:00<00:01, 15.85it/s]
 30%|███       | 6/20 [00:00<00:00, 15.98it/s]
 40%|████      | 8/20 [00:00<00:00, 15.74it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.68it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.43it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.08it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.22it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.16it/s]
100%|██████████| 20/20 [00:01<00:00, 15.26it/s]
 48%|████▊     | 48/100 [01:08<01:13,  1.41s/it]

Step: 47, Reward: 1.0683895229010438, Total Reward: 70.63184937077914, Score: 0.027602940241906255, Total Score: 0.8229871491154465

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 16.20it/s]
 20%|██        | 4/20 [00:00<00:01, 15.58it/s]
 30%|███       | 6/20 [00:00<00:00, 15.75it/s]
 40%|████      | 8/20 [00:00<00:00, 15.51it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.43it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.59it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.47it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.54it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.75it/s]
100%|██████████| 20/20 [00:01<00:00, 15.59it/s]
 49%|████▉     | 49/100 [01:09<01:10,  1.37s/it]

Step: 48, Reward: 1.0614508004688126, Total Reward: 71.69330017124796, Score: 0.027931213643993428, Total Score: 0.8509183627594399

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.03it/s]
 20%|██        | 4/20 [00:00<00:01, 15.53it/s]
 30%|███       | 6/20 [00:00<00:00, 15.15it/s]
 40%|████      | 8/20 [00:00<00:00, 15.38it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.55it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.69it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.70it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.85it/s]
 90%|█████████ | 18/20 [00:01<00:00, 16.01it/s]
100%|██████████| 20/20 [00:01<00:00, 15.56it/s]
 50%|█████     | 50/100 [01:11<01:07,  1.35s/it]

Step: 49, Reward: 0.9609488901545228, Total Reward: 72.65424906140248, Score: 0.028257355053983954, Total Score: 0.8791757178134239

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 14.94it/s]
 20%|██        | 4/20 [00:00<00:01, 15.13it/s]
 30%|███       | 6/20 [00:00<00:00, 15.44it/s]
 40%|████      | 8/20 [00:00<00:00, 15.64it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.68it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.83it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.85it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.90it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.34it/s]
100%|██████████| 20/20 [00:01<00:00, 15.39it/s]
 51%|█████     | 51/100 [01:12<01:05,  1.34s/it]

Step: 50, Reward: 0.9235699583514453, Total Reward: 73.57781901975393, Score: 0.028552616243504376, Total Score: 0.9077283340569282

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 16.89it/s]
 20%|██        | 4/20 [00:00<00:01, 15.69it/s]
 30%|███       | 6/20 [00:00<00:00, 16.04it/s]
 40%|████      | 8/20 [00:00<00:00, 15.70it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.99it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.73it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.72it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.58it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.62it/s]
100%|██████████| 20/20 [00:01<00:00, 15.71it/s]
 52%|█████▏    | 52/100 [01:13<01:03,  1.33s/it]

Step: 51, Reward: 0.9895374500855947, Total Reward: 74.56735646983952, Score: 0.028836392381134678, Total Score: 0.9365647264380629

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.94it/s]
 20%|██        | 4/20 [00:00<00:01, 15.87it/s]
 30%|███       | 6/20 [00:00<00:00, 15.93it/s]
 40%|████      | 8/20 [00:00<00:00, 15.59it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.67it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.54it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.71it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.75it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.64it/s]
100%|██████████| 20/20 [00:01<00:00, 15.42it/s]
 53%|█████▎    | 53/100 [01:15<01:02,  1.33s/it]

Step: 52, Reward: 0.960900590552957, Total Reward: 75.52825706039248, Score: 0.029140437692577098, Total Score: 0.96570516413064

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 16.72it/s]
 20%|██        | 4/20 [00:00<00:01, 15.57it/s]
 30%|███       | 6/20 [00:00<00:00, 15.54it/s]
 40%|████      | 8/20 [00:00<00:00, 15.87it/s]
 50%|█████     | 10/20 [00:00<00:00, 16.10it/s]
 60%|██████    | 12/20 [00:00<00:00, 16.17it/s]
 70%|███████   | 14/20 [00:00<00:00, 16.06it/s]
 80%|████████  | 16/20 [00:00<00:00, 16.24it/s]
 90%|█████████ | 18/20 [00:01<00:00, 16.37it/s]
100%|██████████| 20/20 [00:01<00:00, 15.96it/s]
 54%|█████▍    | 54/100 [01:16<01:00,  1.31s/it]

Step: 53, Reward: 0.934898411966141, Total Reward: 76.46315547235862, Score: 0.029435684041560255, Total Score: 0.9951408481722003

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.61it/s]
 20%|██        | 4/20 [00:00<00:00, 16.13it/s]
 30%|███       | 6/20 [00:00<00:00, 15.65it/s]
 40%|████      | 8/20 [00:00<00:00, 15.59it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.73it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.48it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.69it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.18it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.15it/s]
100%|██████████| 20/20 [00:01<00:00, 15.13it/s]
 55%|█████▌    | 55/100 [01:17<00:59,  1.32s/it]

Step: 54, Reward: 0.9399328645284916, Total Reward: 77.40308833688711, Score: 0.029722940960243506, Total Score: 1.0248637891324437

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.93it/s]
 20%|██        | 4/20 [00:00<00:01, 15.19it/s]
 30%|███       | 6/20 [00:00<00:00, 14.97it/s]
 40%|████      | 8/20 [00:00<00:00, 14.93it/s]
 50%|█████     | 10/20 [00:00<00:00, 14.65it/s]
 60%|██████    | 12/20 [00:00<00:00, 14.45it/s]
 70%|███████   | 14/20 [00:00<00:00, 14.44it/s]
 80%|████████  | 16/20 [00:01<00:00, 14.29it/s]
 90%|█████████ | 18/20 [00:01<00:00, 14.32it/s]
100%|██████████| 20/20 [00:01<00:00, 14.32it/s]
 56%|█████▌    | 56/100 [01:19<00:59,  1.35s/it]

Step: 55, Reward: 0.9360239412600334, Total Reward: 78.33911227814714, Score: 0.030011744764996736, Total Score: 1.0548755338974405

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 14.14it/s]
 20%|██        | 4/20 [00:00<00:01, 13.51it/s]
 30%|███       | 6/20 [00:00<00:01, 13.67it/s]
 40%|████      | 8/20 [00:00<00:00, 13.47it/s]
 50%|█████     | 10/20 [00:00<00:00, 13.67it/s]
 60%|██████    | 12/20 [00:00<00:00, 13.57it/s]
 70%|███████   | 14/20 [00:01<00:00, 13.52it/s]
 80%|████████  | 16/20 [00:01<00:00, 13.44it/s]
 90%|█████████ | 18/20 [00:01<00:00, 13.18it/s]
100%|██████████| 20/20 [00:01<00:00, 13.38it/s]
 57%|█████▋    | 57/100 [01:20<00:59,  1.39s/it]

Step: 56, Reward: 0.965454454056784, Total Reward: 79.30456673220392, Score: 0.0302993475138501, Total Score: 1.0851748814112905

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 13.50it/s]
 20%|██        | 4/20 [00:00<00:01, 13.07it/s]
 30%|███       | 6/20 [00:00<00:01, 13.20it/s]
 40%|████      | 8/20 [00:00<00:00, 13.10it/s]
 50%|█████     | 10/20 [00:00<00:00, 12.71it/s]
 60%|██████    | 12/20 [00:00<00:00, 12.77it/s]
 70%|███████   | 14/20 [00:01<00:00, 12.96it/s]
 80%|████████  | 16/20 [00:01<00:00, 13.13it/s]
 90%|█████████ | 18/20 [00:01<00:00, 13.12it/s]
100%|██████████| 20/20 [00:01<00:00, 12.96it/s]
 58%|█████▊    | 58/100 [01:22<01:00,  1.45s/it]

Step: 57, Reward: 0.9718332611847573, Total Reward: 80.27639999338868, Score: 0.030595993083092347, Total Score: 1.1157708744943828

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 12.36it/s]
 20%|██        | 4/20 [00:00<00:01, 12.42it/s]
 30%|███       | 6/20 [00:00<00:01, 12.57it/s]
 40%|████      | 8/20 [00:00<00:00, 12.90it/s]
 50%|█████     | 10/20 [00:00<00:00, 12.90it/s]
 60%|██████    | 12/20 [00:00<00:00, 13.14it/s]
 70%|███████   | 14/20 [00:01<00:00, 13.30it/s]
 80%|████████  | 16/20 [00:01<00:00, 13.83it/s]
 90%|█████████ | 18/20 [00:01<00:00, 14.01it/s]
100%|██████████| 20/20 [00:01<00:00, 13.56it/s]
 59%|█████▉    | 59/100 [01:23<00:59,  1.46s/it]

Step: 58, Reward: 0.9845372594852125, Total Reward: 81.26093725287389, Score: 0.030894598604835323, Total Score: 1.146665473099218

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.42it/s]
 20%|██        | 4/20 [00:00<00:01, 15.82it/s]
 30%|███       | 6/20 [00:00<00:00, 16.02it/s]
 40%|████      | 8/20 [00:00<00:00, 16.01it/s]
 50%|█████     | 10/20 [00:00<00:00, 16.01it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.92it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.55it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.69it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.72it/s]
100%|██████████| 20/20 [00:01<00:00, 15.61it/s]
 60%|██████    | 60/100 [01:25<00:56,  1.41s/it]

Step: 59, Reward: 0.9294392657576154, Total Reward: 82.1903765186315, Score: 0.031197107557539388, Total Score: 1.1778625806567575

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.95it/s]
 20%|██        | 4/20 [00:00<00:00, 16.27it/s]
 30%|███       | 6/20 [00:00<00:00, 15.98it/s]
 40%|████      | 8/20 [00:00<00:00, 16.20it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.79it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.70it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.57it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.57it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.68it/s]
100%|██████████| 20/20 [00:01<00:00, 15.73it/s]
 61%|██████    | 61/100 [01:26<00:53,  1.37s/it]

Step: 60, Reward: 0.8527246579784103, Total Reward: 83.04310117660991, Score: 0.031482687098768114, Total Score: 1.2093452677555256

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.55it/s]
 20%|██        | 4/20 [00:00<00:01, 14.72it/s]
 30%|███       | 6/20 [00:00<00:00, 14.87it/s]
 40%|████      | 8/20 [00:00<00:00, 15.04it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.22it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.57it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.55it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.75it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.38it/s]
100%|██████████| 20/20 [00:01<00:00, 15.21it/s]
 62%|██████▏   | 62/100 [01:27<00:51,  1.36s/it]

Step: 61, Reward: 0.8159099155858793, Total Reward: 83.85901109219579, Score: 0.031744695306933704, Total Score: 1.2410899630624594

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.39it/s]
 20%|██        | 4/20 [00:00<00:01, 15.29it/s]
 30%|███       | 6/20 [00:00<00:00, 14.79it/s]
 40%|████      | 8/20 [00:00<00:00, 14.64it/s]
 50%|█████     | 10/20 [00:00<00:00, 14.84it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.18it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.23it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.37it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.49it/s]
100%|██████████| 20/20 [00:01<00:00, 15.16it/s]
 63%|██████▎   | 63/100 [01:28<00:50,  1.36s/it]

Step: 62, Reward: 0.741569097121501, Total Reward: 84.60058018931728, Score: 0.03199539181606714, Total Score: 1.2730853548785266

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 16.74it/s]
 20%|██        | 4/20 [00:00<00:00, 16.05it/s]
 30%|███       | 6/20 [00:00<00:00, 16.04it/s]
 40%|████      | 8/20 [00:00<00:00, 16.01it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.55it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.64it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.79it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.69it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.69it/s]
100%|██████████| 20/20 [00:01<00:00, 15.69it/s]
 64%|██████▍   | 64/100 [01:30<00:48,  1.34s/it]

Step: 63, Reward: 0.7086527176231232, Total Reward: 85.30923290694041, Score: 0.03222324636272516, Total Score: 1.3053086012412518

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.81it/s]
 20%|██        | 4/20 [00:00<00:00, 16.15it/s]
 30%|███       | 6/20 [00:00<00:00, 15.34it/s]
 40%|████      | 8/20 [00:00<00:00, 15.39it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.54it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.56it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.57it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.61it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.57it/s]
100%|██████████| 20/20 [00:01<00:00, 15.50it/s]
 65%|██████▌   | 65/100 [01:31<00:46,  1.33s/it]

Step: 64, Reward: 0.6771859704978528, Total Reward: 85.98641887743827, Score: 0.03244098702146991, Total Score: 1.3377495882627217

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.01it/s]
 20%|██        | 4/20 [00:00<00:01, 15.33it/s]
 30%|███       | 6/20 [00:00<00:00, 15.44it/s]
 40%|████      | 8/20 [00:00<00:00, 15.47it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.38it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.35it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.34it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.56it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.22it/s]
100%|██████████| 20/20 [00:01<00:00, 15.28it/s]
 66%|██████▌   | 66/100 [01:32<00:45,  1.33s/it]

Step: 65, Reward: 0.6582119187599296, Total Reward: 86.6446307961982, Score: 0.032649059206394944, Total Score: 1.3703986474691168

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.78it/s]
 20%|██        | 4/20 [00:00<00:01, 15.71it/s]
 30%|███       | 6/20 [00:00<00:00, 15.51it/s]
 40%|████      | 8/20 [00:00<00:00, 15.59it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.68it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.85it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.66it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.56it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.75it/s]
100%|██████████| 20/20 [00:01<00:00, 15.52it/s]
 67%|██████▋   | 67/100 [01:34<00:43,  1.32s/it]

Step: 66, Reward: 0.7254782825315979, Total Reward: 87.3701090787298, Score: 0.0328513014235209, Total Score: 1.4032499488926378

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.49it/s]
 20%|██        | 4/20 [00:00<00:01, 15.18it/s]
 30%|███       | 6/20 [00:00<00:00, 15.20it/s]
 40%|████      | 8/20 [00:00<00:00, 15.19it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.28it/s]
 60%|██████    | 12/20 [00:00<00:00, 14.92it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.05it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.29it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.03it/s]
100%|██████████| 20/20 [00:01<00:00, 15.03it/s]
 68%|██████▊   | 68/100 [01:35<00:42,  1.33s/it]

Step: 67, Reward: 0.823284930294071, Total Reward: 88.19339400902386, Score: 0.03307421190592654, Total Score: 1.4363241607985644

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.41it/s]
 20%|██        | 4/20 [00:00<00:01, 13.38it/s]
 30%|███       | 6/20 [00:00<00:01, 13.72it/s]
 40%|████      | 8/20 [00:00<00:00, 13.96it/s]
 50%|█████     | 10/20 [00:00<00:00, 13.98it/s]
 60%|██████    | 12/20 [00:00<00:00, 13.89it/s]
 70%|███████   | 14/20 [00:01<00:00, 13.83it/s]
 80%|████████  | 16/20 [00:01<00:00, 13.92it/s]
 90%|█████████ | 18/20 [00:01<00:00, 13.65it/s]
100%|██████████| 20/20 [00:01<00:00, 13.69it/s]
 69%|██████▉   | 69/100 [01:37<00:42,  1.37s/it]

Step: 68, Reward: 0.8182504990698899, Total Reward: 89.01164450809375, Score: 0.033327174462336566, Total Score: 1.4696513352609009

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 12.99it/s]
 20%|██        | 4/20 [00:00<00:01, 13.12it/s]
 30%|███       | 6/20 [00:00<00:01, 13.06it/s]
 40%|████      | 8/20 [00:00<00:00, 13.00it/s]
 50%|█████     | 10/20 [00:00<00:00, 12.75it/s]
 60%|██████    | 12/20 [00:00<00:00, 12.40it/s]
 70%|███████   | 14/20 [00:01<00:00, 12.54it/s]
 80%|████████  | 16/20 [00:01<00:00, 12.70it/s]
 90%|█████████ | 18/20 [00:01<00:00, 12.73it/s]
100%|██████████| 20/20 [00:01<00:00, 12.72it/s]
 70%|███████   | 70/100 [01:38<00:43,  1.44s/it]

Step: 69, Reward: 0.8299839711613238, Total Reward: 89.84162847925508, Score: 0.03357859013923298, Total Score: 1.5032299254001338

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 11.37it/s]
 20%|██        | 4/20 [00:00<00:01, 12.45it/s]
 30%|███       | 6/20 [00:00<00:01, 12.70it/s]
 40%|████      | 8/20 [00:00<00:00, 12.77it/s]
 50%|█████     | 10/20 [00:00<00:00, 12.81it/s]
 60%|██████    | 12/20 [00:00<00:00, 13.00it/s]
 70%|███████   | 14/20 [00:01<00:00, 13.05it/s]
 80%|████████  | 16/20 [00:01<00:00, 12.90it/s]
 90%|█████████ | 18/20 [00:01<00:00, 12.96it/s]
100%|██████████| 20/20 [00:01<00:00, 12.71it/s]
 71%|███████   | 71/100 [01:40<00:43,  1.48s/it]

Step: 70, Reward: 0.9397779200680799, Total Reward: 90.78140639932316, Score: 0.03383361104317364, Total Score: 1.5370635364433074

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 13.02it/s]
 20%|██        | 4/20 [00:00<00:01, 13.19it/s]
 30%|███       | 6/20 [00:00<00:01, 13.17it/s]
 40%|████      | 8/20 [00:00<00:00, 13.26it/s]
 50%|█████     | 10/20 [00:00<00:00, 13.51it/s]
 60%|██████    | 12/20 [00:00<00:00, 13.98it/s]
 70%|███████   | 14/20 [00:01<00:00, 14.35it/s]
 80%|████████  | 16/20 [00:01<00:00, 14.60it/s]
 90%|█████████ | 18/20 [00:01<00:00, 14.71it/s]
100%|██████████| 20/20 [00:01<00:00, 14.02it/s]
 72%|███████▏  | 72/100 [01:41<00:41,  1.47s/it]

Step: 71, Reward: 0.9615393552198243, Total Reward: 91.74294575454299, Score: 0.034122367239686556, Total Score: 1.571185903682994

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 14.97it/s]
 20%|██        | 4/20 [00:00<00:01, 14.51it/s]
 30%|███       | 6/20 [00:00<00:00, 15.00it/s]
 40%|████      | 8/20 [00:00<00:00, 14.94it/s]
 50%|█████     | 10/20 [00:00<00:00, 14.89it/s]
 60%|██████    | 12/20 [00:00<00:00, 14.95it/s]
 70%|███████   | 14/20 [00:00<00:00, 14.98it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.28it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.40it/s]
100%|██████████| 20/20 [00:01<00:00, 14.94it/s]
 73%|███████▎  | 73/100 [01:43<00:38,  1.44s/it]

Step: 72, Reward: 1.0080340018794205, Total Reward: 92.75097975642241, Score: 0.03441780985552355, Total Score: 1.6056037135385175

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 16.23it/s]
 20%|██        | 4/20 [00:00<00:01, 15.53it/s]
 30%|███       | 6/20 [00:00<00:00, 15.26it/s]
 40%|████      | 8/20 [00:00<00:00, 15.33it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.48it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.58it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.22it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.27it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.25it/s]
100%|██████████| 20/20 [00:01<00:00, 15.18it/s]
 74%|███████▍  | 74/100 [01:44<00:36,  1.41s/it]

Step: 73, Reward: 1.1409848317293718, Total Reward: 93.89196458815178, Score: 0.03472753841811556, Total Score: 1.640331251956633

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 14.55it/s]
 20%|██        | 4/20 [00:00<00:01, 14.96it/s]
 30%|███       | 6/20 [00:00<00:00, 15.56it/s]
 40%|████      | 8/20 [00:00<00:00, 15.63it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.24it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.52it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.50it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.62it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.63it/s]
100%|██████████| 20/20 [00:01<00:00, 15.36it/s]
 75%|███████▌  | 75/100 [01:45<00:34,  1.38s/it]

Step: 74, Reward: 1.2590534676832261, Total Reward: 95.15101805583501, Score: 0.03507811745732648, Total Score: 1.6754093694139596

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 16.02it/s]
 20%|██        | 4/20 [00:00<00:01, 15.38it/s]
 30%|███       | 6/20 [00:00<00:00, 15.54it/s]
 40%|████      | 8/20 [00:00<00:00, 15.23it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.24it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.22it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.09it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.22it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.30it/s]
100%|██████████| 20/20 [00:01<00:00, 15.14it/s]
 76%|███████▌  | 76/100 [01:47<00:32,  1.37s/it]

Step: 75, Reward: 1.3807002445735241, Total Reward: 96.53171830040853, Score: 0.03546497426974049, Total Score: 1.7108743436837002

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 16.32it/s]
 20%|██        | 4/20 [00:00<00:00, 16.05it/s]
 30%|███       | 6/20 [00:00<00:00, 14.85it/s]
 40%|████      | 8/20 [00:00<00:00, 14.40it/s]
 50%|█████     | 10/20 [00:00<00:00, 14.71it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.15it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.07it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.03it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.16it/s]
100%|██████████| 20/20 [00:01<00:00, 15.06it/s]
 77%|███████▋  | 77/100 [01:48<00:31,  1.36s/it]

Step: 76, Reward: 1.5025187688076878, Total Reward: 98.03423706921622, Score: 0.035889208275066586, Total Score: 1.7467635519587668

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 13.13it/s]
 20%|██        | 4/20 [00:00<00:01, 13.86it/s]
 30%|███       | 6/20 [00:00<00:00, 14.70it/s]
 40%|████      | 8/20 [00:00<00:00, 14.89it/s]
 50%|█████     | 10/20 [00:00<00:00, 14.90it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.11it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.25it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.54it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.67it/s]
100%|██████████| 20/20 [00:01<00:00, 15.09it/s]
 78%|███████▊  | 78/100 [01:49<00:29,  1.36s/it]

Step: 77, Reward: 1.6138815520878553, Total Reward: 99.64811862130408, Score: 0.036350872244399625, Total Score: 1.7831144242031665

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.24it/s]
 20%|██        | 4/20 [00:00<00:01, 15.69it/s]
 30%|███       | 6/20 [00:00<00:00, 15.14it/s]
 40%|████      | 8/20 [00:00<00:00, 15.64it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.69it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.76it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.55it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.77it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.73it/s]
100%|██████████| 20/20 [00:01<00:00, 15.43it/s]
 79%|███████▉  | 79/100 [01:51<00:28,  1.34s/it]

Step: 78, Reward: 1.7165485029744698, Total Reward: 101.36466712427855, Score: 0.03684675354640925, Total Score: 1.8199611777495757

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 14.23it/s]
 20%|██        | 4/20 [00:00<00:01, 14.56it/s]
 30%|███       | 6/20 [00:00<00:00, 15.07it/s]
 40%|████      | 8/20 [00:00<00:00, 15.50it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.36it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.57it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.72it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.79it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.32it/s]
100%|██████████| 20/20 [00:01<00:00, 15.33it/s]
 80%|████████  | 80/100 [01:52<00:26,  1.34s/it]

Step: 79, Reward: 1.8184694262333714, Total Reward: 103.18313655051193, Score: 0.037374180299330775, Total Score: 1.8573353580489065

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.07it/s]
 20%|██        | 4/20 [00:00<00:01, 14.11it/s]
 30%|███       | 6/20 [00:00<00:00, 14.84it/s]
 40%|████      | 8/20 [00:00<00:00, 15.46it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.33it/s]
 60%|██████    | 12/20 [00:00<00:00, 14.80it/s]
 70%|███████   | 14/20 [00:00<00:00, 14.56it/s]
 80%|████████  | 16/20 [00:01<00:00, 14.17it/s]
 90%|█████████ | 18/20 [00:01<00:00, 14.18it/s]
100%|██████████| 20/20 [00:01<00:00, 14.48it/s]
 81%|████████  | 81/100 [01:53<00:25,  1.35s/it]

Step: 80, Reward: 1.8574644990299636, Total Reward: 105.04060104954189, Score: 0.03793292327868928, Total Score: 1.8952682813275958

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 13.67it/s]
 20%|██        | 4/20 [00:00<00:01, 13.60it/s]
 30%|███       | 6/20 [00:00<00:01, 13.82it/s]
 40%|████      | 8/20 [00:00<00:00, 13.76it/s]
 50%|█████     | 10/20 [00:00<00:00, 13.58it/s]
 60%|██████    | 12/20 [00:00<00:00, 13.50it/s]
 70%|███████   | 14/20 [00:01<00:00, 13.53it/s]
 80%|████████  | 16/20 [00:01<00:00, 13.53it/s]
 90%|█████████ | 18/20 [00:01<00:00, 13.50it/s]
100%|██████████| 20/20 [00:01<00:00, 13.36it/s]
 82%|████████▏ | 82/100 [01:55<00:25,  1.40s/it]

Step: 81, Reward: 1.8645829089491746, Total Reward: 106.90518395849107, Score: 0.038503647885475965, Total Score: 1.9337719292130717

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 12.56it/s]
 20%|██        | 4/20 [00:00<00:01, 12.89it/s]
 30%|███       | 6/20 [00:00<00:01, 13.02it/s]
 40%|████      | 8/20 [00:00<00:00, 13.27it/s]
 50%|█████     | 10/20 [00:00<00:00, 13.15it/s]
 60%|██████    | 12/20 [00:00<00:00, 12.95it/s]
 70%|███████   | 14/20 [00:01<00:00, 12.93it/s]
 80%|████████  | 16/20 [00:01<00:00, 13.07it/s]
 90%|█████████ | 18/20 [00:01<00:00, 13.14it/s]
100%|██████████| 20/20 [00:01<00:00, 13.01it/s]
 83%|████████▎ | 83/100 [01:56<00:24,  1.45s/it]

Step: 82, Reward: 1.8941410550971716, Total Reward: 108.79932501358823, Score: 0.03907655969514282, Total Score: 1.9728484889082145

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 12.64it/s]
 20%|██        | 4/20 [00:00<00:01, 12.52it/s]
 30%|███       | 6/20 [00:00<00:01, 12.60it/s]
 40%|████      | 8/20 [00:00<00:00, 12.56it/s]
 50%|█████     | 10/20 [00:00<00:00, 12.65it/s]
 60%|██████    | 12/20 [00:00<00:00, 13.02it/s]
 70%|███████   | 14/20 [00:01<00:00, 13.00it/s]
 80%|████████  | 16/20 [00:01<00:00, 13.03it/s]
 90%|█████████ | 18/20 [00:01<00:00, 12.74it/s]
100%|██████████| 20/20 [00:01<00:00, 12.72it/s]
 84%|████████▍ | 84/100 [01:58<00:23,  1.49s/it]

Step: 83, Reward: 1.8502530264855113, Total Reward: 110.64957804007375, Score: 0.039658553541826506, Total Score: 2.012507042450041

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 13.99it/s]
 20%|██        | 4/20 [00:00<00:01, 14.40it/s]
 30%|███       | 6/20 [00:00<00:00, 14.80it/s]
 40%|████      | 8/20 [00:00<00:00, 14.25it/s]
 50%|█████     | 10/20 [00:00<00:00, 14.42it/s]
 60%|██████    | 12/20 [00:00<00:00, 14.56it/s]
 70%|███████   | 14/20 [00:00<00:00, 14.90it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.06it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.21it/s]
100%|██████████| 20/20 [00:01<00:00, 14.74it/s]
 85%|████████▌ | 85/100 [01:59<00:21,  1.46s/it]

Step: 84, Reward: 1.4546757588027448, Total Reward: 112.1042537988765, Score: 0.040227062351307546, Total Score: 2.0527341048013485

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 16.30it/s]
 20%|██        | 4/20 [00:00<00:01, 15.54it/s]
 30%|███       | 6/20 [00:00<00:00, 15.66it/s]
 40%|████      | 8/20 [00:00<00:00, 15.51it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.74it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.74it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.67it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.54it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.70it/s]
100%|██████████| 20/20 [00:01<00:00, 15.58it/s]
 86%|████████▌ | 86/100 [02:01<00:19,  1.41s/it]

Step: 85, Reward: 1.2088353186704868, Total Reward: 113.31308911754698, Score: 0.04067402607571703, Total Score: 2.093408130877066

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.21it/s]
 20%|██        | 4/20 [00:00<00:01, 14.86it/s]
 30%|███       | 6/20 [00:00<00:00, 15.12it/s]
 40%|████      | 8/20 [00:00<00:00, 14.83it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.08it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.49it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.39it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.26it/s]
 90%|█████████ | 18/20 [00:01<00:00, 14.97it/s]
100%|██████████| 20/20 [00:01<00:00, 14.98it/s]
 87%|████████▋ | 87/100 [02:02<00:18,  1.39s/it]

Step: 86, Reward: 1.0966750044720495, Total Reward: 114.40976412201903, Score: 0.04104545285791306, Total Score: 2.1344535837349787

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.11it/s]
 20%|██        | 4/20 [00:00<00:01, 15.04it/s]
 30%|███       | 6/20 [00:00<00:00, 15.00it/s]
 40%|████      | 8/20 [00:00<00:00, 15.37it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.17it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.24it/s]
 70%|███████   | 14/20 [00:00<00:00, 14.94it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.15it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.16it/s]
100%|██████████| 20/20 [00:01<00:00, 15.03it/s]
 88%|████████▊ | 88/100 [02:03<00:16,  1.38s/it]

Step: 87, Reward: 1.043542650360527, Total Reward: 115.45330677237955, Score: 0.04138241725805474, Total Score: 2.1758360009930335

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.39it/s]
 20%|██        | 4/20 [00:00<00:01, 15.66it/s]
 30%|███       | 6/20 [00:00<00:00, 15.47it/s]
 40%|████      | 8/20 [00:00<00:00, 15.35it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.14it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.40it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.59it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.62it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.74it/s]
100%|██████████| 20/20 [00:01<00:00, 15.53it/s]
 89%|████████▉ | 89/100 [02:05<00:14,  1.36s/it]

Step: 88, Reward: 1.0156134737161555, Total Reward: 116.4689202460957, Score: 0.04170305620921811, Total Score: 2.2175390572022518

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.61it/s]
 20%|██        | 4/20 [00:00<00:01, 14.63it/s]
 30%|███       | 6/20 [00:00<00:00, 14.95it/s]
 40%|████      | 8/20 [00:00<00:00, 15.22it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.19it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.33it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.30it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.49it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.56it/s]
100%|██████████| 20/20 [00:01<00:00, 15.12it/s]
 90%|█████████ | 90/100 [02:06<00:13,  1.35s/it]

Step: 89, Reward: 1.0140850969394835, Total Reward: 117.48300534303519, Score: 0.04201511364059115, Total Score: 2.259554170842843

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.45it/s]
 20%|██        | 4/20 [00:00<00:01, 15.25it/s]
 30%|███       | 6/20 [00:00<00:00, 15.37it/s]
 40%|████      | 8/20 [00:00<00:00, 14.97it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.21it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.42it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.32it/s]
 80%|████████  | 16/20 [00:01<00:00, 14.83it/s]
 90%|█████████ | 18/20 [00:01<00:00, 14.98it/s]
100%|██████████| 20/20 [00:01<00:00, 15.03it/s]
 91%|█████████ | 91/100 [02:07<00:12,  1.35s/it]

Step: 90, Reward: 1.0251360576856314, Total Reward: 118.50814140072082, Score: 0.042326701462862465, Total Score: 2.3018808723057056

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.14it/s]
 20%|██        | 4/20 [00:00<00:01, 15.36it/s]
 30%|███       | 6/20 [00:00<00:00, 14.96it/s]
 40%|████      | 8/20 [00:00<00:00, 15.10it/s]
 50%|█████     | 10/20 [00:00<00:00, 14.70it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.00it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.08it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.19it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.06it/s]
100%|██████████| 20/20 [00:01<00:00, 14.98it/s]
 92%|█████████▏| 92/100 [02:09<00:10,  1.35s/it]

Step: 91, Reward: 1.0343617856501035, Total Reward: 119.54250318637092, Score: 0.04264168480371826, Total Score: 2.344522557109424

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 13.79it/s]
 20%|██        | 4/20 [00:00<00:01, 14.53it/s]
 30%|███       | 6/20 [00:00<00:00, 14.53it/s]
 40%|████      | 8/20 [00:00<00:00, 14.66it/s]
 50%|█████     | 10/20 [00:00<00:00, 15.11it/s]
 60%|██████    | 12/20 [00:00<00:00, 15.08it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.26it/s]
 80%|████████  | 16/20 [00:01<00:00, 15.33it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.31it/s]
100%|██████████| 20/20 [00:01<00:00, 14.74it/s]
 93%|█████████▎| 93/100 [02:10<00:09,  1.36s/it]

Step: 92, Reward: 1.0457811860341049, Total Reward: 120.58828437240503, Score: 0.04295950284207034, Total Score: 2.3874820599514943

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.83it/s]
 20%|██        | 4/20 [00:00<00:01, 14.78it/s]
 30%|███       | 6/20 [00:00<00:00, 14.76it/s]
 40%|████      | 8/20 [00:00<00:00, 14.86it/s]
 50%|█████     | 10/20 [00:00<00:00, 14.98it/s]
 60%|██████    | 12/20 [00:00<00:00, 14.93it/s]
 70%|███████   | 14/20 [00:00<00:00, 14.63it/s]
 80%|████████  | 16/20 [00:01<00:00, 14.11it/s]
 90%|█████████ | 18/20 [00:01<00:00, 14.05it/s]
100%|██████████| 20/20 [00:01<00:00, 14.33it/s]
 94%|█████████▍| 94/100 [02:11<00:08,  1.37s/it]

Step: 93, Reward: 1.0595894780069643, Total Reward: 121.647873850412, Score: 0.043280829605782875, Total Score: 2.4307628895572773

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 12.72it/s]
 20%|██        | 4/20 [00:00<00:01, 13.70it/s]
 30%|███       | 6/20 [00:00<00:01, 13.56it/s]
 40%|████      | 8/20 [00:00<00:00, 12.96it/s]
 50%|█████     | 10/20 [00:00<00:00, 13.09it/s]
 60%|██████    | 12/20 [00:00<00:00, 13.01it/s]
 70%|███████   | 14/20 [00:01<00:00, 12.00it/s]
 80%|████████  | 16/20 [00:01<00:00, 11.09it/s]
 90%|█████████ | 18/20 [00:01<00:00, 10.80it/s]
100%|██████████| 20/20 [00:01<00:00, 11.62it/s]
 95%|█████████▌| 95/100 [02:13<00:07,  1.48s/it]

Step: 94, Reward: 1.069293613696973, Total Reward: 122.71716746410897, Score: 0.04360639910576882, Total Score: 2.474369288663046

  0%|          | 0/20 [00:00<?, ?it/s]
  5%|▌         | 1/20 [00:00<00:01,  9.76it/s]
 10%|█         | 2/20 [00:00<00:01,  9.55it/s]
 15%|█▌        | 3/20 [00:00<00:01,  9.64it/s]
 25%|██▌       | 5/20 [00:00<00:01,  9.93it/s]
 30%|███       | 6/20 [00:00<00:01,  9.81it/s]
 35%|███▌      | 7/20 [00:00<00:01,  9.29it/s]
 40%|████      | 8/20 [00:00<00:01,  9.40it/s]
 45%|████▌     | 9/20 [00:00<00:01,  9.26it/s]
 50%|█████     | 10/20 [00:01<00:01,  9.20it/s]
 55%|█████▌    | 11/20 [00:01<00:00,  9.37it/s]
 65%|██████▌   | 13/20 [00:01<00:00,  9.93it/s]
 70%|███████   | 14/20 [00:01<00:00,  9.72it/s]
 75%|███████▌  | 15/20 [00:01<00:00,  9.75it/s]
 80%|████████  | 16/20 [00:01<00:00,  9.65it/s]
 85%|████████▌ | 17/20 [00:01<00:00,  9.54it/s]
 90%|█████████ | 18/20 [00:01<00:00,  9.59it/s]
 95%|█████████▌| 19/20 [00:01<00:00,  9.50it/s]
100%|██████████| 20/20 [00:02<00:00,  9.50it/s]
 96%|█████████▌| 96/100 [02:15<00:06,  1.68s/it]

Step: 95, Reward: 1.0759698500937362, Total Reward: 123.79313731420271, Score: 0.0439349502988255, Total Score: 2.5183042389618717

  0%|          | 0/20 [00:00<?, ?it/s]
  5%|▌         | 1/20 [00:00<00:02,  9.05it/s]
 10%|█         | 2/20 [00:00<00:02,  8.69it/s]
 15%|█▌        | 3/20 [00:00<00:01,  8.90it/s]
 20%|██        | 4/20 [00:00<00:01,  9.18it/s]
 25%|██▌       | 5/20 [00:00<00:01,  8.80it/s]
 30%|███       | 6/20 [00:00<00:01,  9.11it/s]
 35%|███▌      | 7/20 [00:00<00:01,  9.34it/s]
 45%|████▌     | 9/20 [00:00<00:01, 10.04it/s]
 55%|█████▌    | 11/20 [00:01<00:00, 10.33it/s]
 65%|██████▌   | 13/20 [00:01<00:00, 10.58it/s]
 75%|███████▌  | 15/20 [00:01<00:00, 10.54it/s]
 85%|████████▌ | 17/20 [00:01<00:00, 10.08it/s]
100%|██████████| 20/20 [00:02<00:00,  9.93it/s]
 97%|█████████▋| 97/100 [02:17<00:05,  1.78s/it]

Step: 96, Reward: 1.3074041389557671, Total Reward: 125.10054145315847, Score: 0.044265552832510414, Total Score: 2.562569791794382

  0%|          | 0/20 [00:00<?, ?it/s]
  5%|▌         | 1/20 [00:00<00:01,  9.85it/s]
 10%|█         | 2/20 [00:00<00:01,  9.60it/s]
 15%|█▌        | 3/20 [00:00<00:01,  9.70it/s]
 25%|██▌       | 5/20 [00:00<00:01, 10.27it/s]
 35%|███▌      | 7/20 [00:00<00:01, 10.26it/s]
 45%|████▌     | 9/20 [00:00<00:00, 11.77it/s]
 55%|█████▌    | 11/20 [00:00<00:00, 12.70it/s]
 65%|██████▌   | 13/20 [00:01<00:00, 13.11it/s]
 75%|███████▌  | 15/20 [00:01<00:00, 13.53it/s]
 85%|████████▌ | 17/20 [00:01<00:00, 14.12it/s]
100%|██████████| 20/20 [00:01<00:00, 12.73it/s]
 98%|█████████▊| 98/100 [02:19<00:03,  1.73s/it]

Step: 97, Reward: 1.6721521590810136, Total Reward: 126.77269361223948, Score: 0.04466726587386679, Total Score: 2.607237057668249

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 14.44it/s]
 20%|██        | 4/20 [00:00<00:01, 14.78it/s]
 30%|███       | 6/20 [00:00<00:00, 14.94it/s]
 40%|████      | 8/20 [00:00<00:00, 14.87it/s]
 50%|█████     | 10/20 [00:00<00:00, 14.95it/s]
 60%|██████    | 12/20 [00:00<00:00, 14.97it/s]
 70%|███████   | 14/20 [00:00<00:00, 15.18it/s]
 80%|████████  | 16/20 [00:01<00:00, 14.82it/s]
 90%|█████████ | 18/20 [00:01<00:00, 15.00it/s]
100%|██████████| 20/20 [00:01<00:00, 14.79it/s]
 99%|█████████▉| 99/100 [02:20<00:01,  1.62s/it]

Step: 98, Reward: 1.75613883059031, Total Reward: 128.52883244282978, Score: 0.04518105140461443, Total Score: 2.652418109072863

  0%|          | 0/20 [00:00<?, ?it/s]
 10%|█         | 2/20 [00:00<00:01, 15.29it/s]
 20%|██        | 4/20 [00:00<00:01, 15.03it/s]
 30%|███       | 6/20 [00:00<00:00, 15.02it/s]
 40%|████      | 8/20 [00:00<00:00, 15.03it/s]
 50%|█████     | 10/20 [00:00<00:00, 14.75it/s]
 60%|██████    | 12/20 [00:00<00:00, 14.66it/s]
 70%|███████   | 14/20 [00:00<00:00, 14.63it/s]
 80%|████████  | 16/20 [00:01<00:00, 14.75it/s]
 90%|█████████ | 18/20 [00:01<00:00, 14.54it/s]
100%|██████████| 20/20 [00:01<00:00, 14.71it/s]
100%|██████████| 100/100 [02:22<00:00,  1.42s/it]

Step: 99, Reward: 1.7563279535050194, Total Reward: 130.2851603963348, Score: 0.045720642682980664, Total Score: 2.698138751755844

In [ ]:

show_sample(render, np.expand_dims(np.stack(rollout), axis=0))

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

Introduction

Installing Packages

`apt-get install` requirements

Install Diffusers

`pip install` requirements

Import D4RL to initialize Mujoco

Environment & Model Setup

Imports

Create environment

Define constants

Helper functions

Sample env. initial state

Run the Diffusion Process -- from Scratch

Initialize model

Planning helper function

Setup for denoising

Sample initial noise

Generate trajectories

Render the samples

Renderering Tools

Video helpers

Renderer helpers

Rendering class

Show Plans

Initialize renderer class for the environment

Show the video

Run Value Guided Diffusion -- with Pipeline

Product

Resources

Company

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more, all in one place. Commercial Alternative to JupyterHub.

Introduction

Installing Packages

apt-get install requirements

Install Diffusers

pip install requirements

Import D4RL to initialize Mujoco

Environment & Model Setup

Imports

Create environment

Define constants

Helper functions

Sample env. initial state

Run the Diffusion Process -- from Scratch

Initialize model

Planning helper function

Setup for denoising

Sample initial noise

Generate trajectories

Render the samples

Renderering Tools

Video helpers

Renderer helpers

Rendering class

Show Plans

Initialize renderer class for the environment

Show the video

Run Value Guided Diffusion -- with Pipeline

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

`apt-get install` requirements

`pip install` requirements