Imitation#
Imitation provides clean implementations of imitation and reward learning algorithms, under a unified and user-friendly API. Currently, we have implementations of Behavioral Cloning, DAgger (with synthetic examples), density-based reward modeling, Maximum Causal Entropy Inverse Reinforcement Learning, Adversarial Inverse Reinforcement Learning, Generative Adversarial Imitation Learning, and Deep RL from Human Preferences.
You can find us on GitHub at http://github.com/HumanCompatibleAI/imitation.
Main Features#
Built on and compatible with Stable Baselines 3 (SB3).
Modular Pytorch implementations of Behavioral Cloning, DAgger, GAIL, and AIRL that can train arbitrary SB3 policies.
GAIL and AIRL have customizable reward and discriminator networks.
Scripts to train policies using SB3 and save rollouts from these policies as synthetic “expert” demonstrations.
Data structures and scripts for loading and storing expert demonstrations.
Citing imitation#
If you use imitation
in your research project, please cite our paper to help us track our impact and enable readers to more easily replicate your results. You may use the following BibTeX:
@misc{gleave2022imitation,
author = {Gleave, Adam and Taufeeque, Mohammad and Rocamonde, Juan and Jenner, Erik and Wang, Steven H. and Toyer, Sam and Ernestus, Maximilian and Belrose, Nora and Emmons, Scott and Russell, Stuart},
title = {imitation: Clean Imitation Learning Implementations},
year = {2022},
howPublished = {arXiv:2211.11972v1 [cs.LG]},
archivePrefix = {arXiv},
eprint = {2211.11972},
primaryClass = {cs.LG},
url = {https://arxiv.org/abs/2211.11972},
}
What is imitation
?#
imitation
is an open-source library providing high-quality, reliable and modular implementations of seven reward and imitation learning algorithms, built on modern backends like PyTorch and Stable Baselines3. It includes implementations of Behavioral Cloning (BC), DAgger, Generative Adversarial Imitation Learning (GAIL), Adversarial Inverse Reinforcement Learning (AIRL), Reward Learning through Preference Comparisons, Maximum Causal Entropy Inverse Reinforcement Learning (MCE IRL), and Density-based reward modeling. The algorithms follow a consistent interface, making it simple to train and compare a range of algorithms.
A key use case of imitation
is as an experimental baseline. Small implementation details in imitation learning algorithms can have significant impacts
on performance, which can lead to spurious positive results if a weak experimental baseline is used. To address this challenge, imitation
’s algorithms have been carefully benchmarked and compared to prior implementations. The codebase is statically type-checked and over 90% of it is covered by automated tests.
In addition to providing reliable baselines, imitation
aims to simplify the process of developing novel reward and imitation learning algorithms. Its implementations are modular: users can freely change the reward or policy network architecture, RL algorithm and optimizer without touching the codebase itself. Algorithms can be extended by subclassing and overriding relevant methods. imitation
also provides utility methods to handle common tasks to support the development of entirely novel algorithms.
Our goal for imitation
is to make it easier to use, develop, and compare imitation and reward learning algorithms. The library is in active development, and we welcome contributions and feedback.
Check out our recommended
First Steps for an overview of how to use the library. We also have tutorials, such as Train an Agent using Behavior Cloning, that provide detailed examples of specific algorithms. If you are interested in helping develop imitation
then we suggest you refer to the Developer Guide as well as more specific guidelines for Contributing.
Installation#
Prerequisites#
Python 3.8+
pip (it helps to make sure this is up-to-date:
pip install -U pip
)(on ARM64 Macs) you need to set environment variables due to a bug in grpcio:
export GRPC_PYTHON_BUILD_SYSTEM_OPENSSL=1
export GRPC_PYTHON_BUILD_SYSTEM_ZLIB=1
(Optional) OpenGL (to render gym environments)
(Optional) FFmpeg (to encode videos of renders)
Installation from PyPI#
To install the latest PyPI release, simply run:
pip install imitation
Installation from source#
Installation from source is useful if you wish to contribute to the development of imitation
, or if you need features that have not yet been made available in a stable release:
git clone http://github.com/HumanCompatibleAI/imitation
cd imitation
pip install -e .
There are also a number of dependencies used for running tests and building the documentation, which can be installed with:
pip install -e ".[dev]"
First Steps#
Imitation can be used in two main ways: through its command-line interface (CLI) or Python API. The CLI allows you to quickly train and test algorithms and policies directly from the command line. The Python API provides greater flexibility and extensibility, and allows you to inter-operate with your existing Python environment.
CLI Quickstart#
We provide several CLI scripts as front-ends to the algorithms implemented in imitation
.
These use Sacred for configuration and replicability.
For information on how to configure Sacred CLI options, see the Sacred docs.
#!/usr/bin/env bash
# Train PPO agent on pendulum and collect expert demonstrations. Tensorboard logs saved in quickstart/rl/
python -m imitation.scripts.train_rl with pendulum environment.fast policy_evaluation.fast rl.fast fast logging.log_dir=quickstart/rl/
# Train GAIL from demonstrations. Tensorboard logs saved in output/ (default log directory).
python -m imitation.scripts.train_adversarial gail with pendulum environment.fast demonstrations.fast policy_evaluation.fast rl.fast fast demonstrations.path=quickstart/rl/rollouts/final.npz demonstrations.source=local
# Train AIRL from demonstrations. Tensorboard logs saved in output/ (default log directory).
python -m imitation.scripts.train_adversarial airl with pendulum environment.fast demonstrations.fast policy_evaluation.fast rl.fast fast demonstrations.path=quickstart/rl/rollouts/final.npz demonstrations.source=local
Note
Remove the fast
options from the commands above to allow training run to completion.
Tip
python -m imitation.scripts.train_rl print_config
will list Sacred script options.
These configuration options are also documented in each script’s docstrings.
Python Interface Quickstart#
Here’s an example script that loads CartPole demonstrations and trains BC, GAIL, and
AIRL models on that data. You will need to pip install seals
or pip install imitation[test]
to run this.
"""This is a simple example demonstrating how to clone the behavior of an expert.
Refer to the jupyter notebooks for more detailed examples of how to use the algorithms.
"""
import numpy as np
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.ppo import MlpPolicy
from imitation.algorithms import bc
from imitation.data import rollout
from imitation.data.wrappers import RolloutInfoWrapper
from imitation.policies.serialize import load_policy
from imitation.util.util import make_vec_env
rng = np.random.default_rng(0)
env = make_vec_env(
"seals:seals/CartPole-v0",
rng=rng,
post_wrappers=[lambda env, _: RolloutInfoWrapper(env)], # for computing rollouts
)
def train_expert():
# note: use `download_expert` instead to download a pretrained, competent expert
print("Training a expert.")
expert = PPO(
policy=MlpPolicy,
env=env,
seed=0,
batch_size=64,
ent_coef=0.0,
learning_rate=0.0003,
n_epochs=10,
n_steps=64,
)
expert.learn(1_000) # Note: change this to 100_000 to train a decent expert.
return expert
def download_expert():
print("Downloading a pretrained expert.")
expert = load_policy(
"ppo-huggingface",
organization="HumanCompatibleAI",
env_name="seals-CartPole-v0",
venv=env,
)
return expert
def sample_expert_transitions():
# expert = train_expert() # uncomment to train your own expert
expert = download_expert()
print("Sampling expert transitions.")
rollouts = rollout.rollout(
expert,
env,
rollout.make_sample_until(min_timesteps=None, min_episodes=50),
rng=rng,
)
return rollout.flatten_trajectories(rollouts)
transitions = sample_expert_transitions()
bc_trainer = bc.BC(
observation_space=env.observation_space,
action_space=env.action_space,
demonstrations=transitions,
rng=rng,
)
evaluation_env = make_vec_env(
"seals:seals/CartPole-v0",
rng=rng,
env_make_kwargs={"render_mode": "human"}, # for rendering
)
print("Evaluating the untrained policy.")
reward, _ = evaluate_policy(
bc_trainer.policy, # type: ignore[arg-type]
evaluation_env,
n_eval_episodes=3,
render=True, # comment out to speed up
)
print(f"Reward before training: {reward}")
print("Training a policy using Behavior Cloning")
bc_trainer.train(n_epochs=1)
print("Evaluating the trained policy.")
reward, _ = evaluate_policy(
bc_trainer.policy, # type: ignore[arg-type]
evaluation_env,
n_eval_episodes=3,
render=True, # comment out to speed up
)
print(f"Reward after training: {reward}")
Command Line Interface#
Many features of the core library are accessible via the command line interface built using the Sacred package.
Sacred is used to configure and run the algorithms. It is centered around the concept of experiments which are composed of reusable ingredients. Each experiment and each ingredient has its own configuration namespace. Named configurations are used to specify a coherent set of configuration values. It is recommended to at least read the Sacred documentation about the command line interface.
The scripts
package contains a number of sacred experiments to either execute algorithms or perform utility tasks.
The most important ingredients
for imitation learning are:
Usage Examples#
Here we demonstrate some usage examples for the command line interface. You can always find out all the configurable values by running:
python -m imitation.scripts.<script> print_config
Run BC on the CartPole-v1
environment with a pre-trained PPO policy as expert#
Note
Here the cartpole environment is specified via a named configuration.
python -m imitation.scripts.train_imitation bc with \
cartpole \
demonstrations.n_expert_demos=50 \
bc.train_kwargs.n_batches=2000 \
expert.policy_type=ppo \
expert.loader_kwargs.path=tests/testdata/expert_models/cartpole_0/policies/final/model.zip
50 expert demonstrations are sampled from the PPO policy that is included in the testdata folder. 2000 batches are enough to train a good policy.
Run DAgger on the CartPole-v0
environment with a random policy as expert#
python -m imitation.scripts.train_imitation dagger with \
cartpole \
dagger.total_timesteps=2000 \
demonstrations.n_expert_demos=10 \
expert.policy_type=random
This will not produce any meaningful results, since a random policy is not a good expert.
Run AIRL on the MountainCar-v0
environment with a expert from the HuggingFace model hub#
python -m imitation.scripts.train_adversarial airl with \
seals_mountain_car \
total_timesteps=5000 \
expert.policy_type=ppo-huggingface \
demonstrations.n_expert_demos=500
Note
The small number of total timesteps is only for demonstration purposes and will not produce a good policy.
Run GAIL on the seals/Swimmer-v0
environment#
Here we do not use the named configuration for the seals environment, but instead specify the gym_id directly.
The seals:
prefix ensures that the seals package is imported and the environment is registered.
Note
The Swimmer environment needs mujoco_py to be installed.
python -m imitation.scripts.train_adversarial gail with \
environment.gym_id="seals:seals/Swimmer-v0" \
total_timesteps=5000 \
demonstrations.n_expert_demos=50
Train an expert and save the rollouts explicitly, then train a policy on the saved rollouts#
First, train an expert and save the demonstrations. By default, this will use PPO
and train for 1M time steps.
We can set the number of time steps to train for by setting total_timesteps
.
After training the expert, we generate rollouts using the expert policy and save them to disk.
We can set a minimum number of episodes or time steps to be saved by setting one of rollout_save_n_episodes
or
rollout_save_n_timesteps
. Note that the number of episodes or time steps saved may be slightly larger than the
specified number.
By default the demonstrations are saved in <log_dir>/rollouts/final
(where for this script by default <log_dir>
is output/train_rl/<environment>/<timestamp>
).
However, we can pass an explicit path as logging directory.
python -m imitation.scripts.train_rl with seals_cartpole \
total_timesteps=40000 \
logging.log_dir=output/ppo/seals_cartpole/trained \
rollout_save_n_episodes=50
Instead of training a new expert, we can also load a pre-trained expert policy and generate rollouts from it.
This can be achieved using the eval_policy
script.
Note that the rollout_save_path is relative to the log_dir
of the imitation script.
python -m imitation.scripts.eval_policy with seals_cartpole \
expert.policy_type=ppo-huggingface \
eval_n_episodes=50 \
logging.log_dir=output/ppo/seals_cartpole/loaded \
rollout_save_path=rollouts/final
Now we can run the imitation script (in this case DAgger) and pass the path to the demonstrations we just generated
python -m imitation.scripts.train_imitation dagger with \
seals_cartpole \
dagger.total_timesteps=2000 \
demonstrations.source=local \
demonstrations.path=output/ppo/seals_cartpole/loaded/rollouts/final
Visualise saved policies#
We can use the eval_policy
script to visualise and render a saved policy.
Here we are looking at the policy saved by the previous example.
python -m imitation.scripts.eval_policy with \
expert.policy_type=ppo \
expert.loader_kwargs.path=output/train_rl/Pendulum-v1/my_run/policies/final/model.zip \
environment.num_vec=1 \
render=True \
environment.gym_id='Pendulum-v1'
Comparing algorithms’ performance#
Let’s use the CLI to compare the performance of different algorithms.
First, let’s train an expert on the CartPole-v1
environment.
python -m imitation.scripts.train_rl with \
cartpole \
logging.log_dir=output/train_rl/CartPole-v1/expert \
total_timesteps=10000
Now let’s train a weaker agent.
python -m imitation.scripts.train_rl with \
cartpole \
logging.log_dir=output/train_rl/CartPole-v1/non_expert \
total_timesteps=1000 # simply training less
We can evaluate each policy using the eval_policy
script.
For the expert:
python -m imitation.scripts.eval_policy with \
expert.policy_type=ppo \
expert.loader_kwargs.path=output/train_rl/CartPole-v1/expert/policies/final/model.zip \
environment.gym_id='CartPole-v1' \
environment.num_vec=1 \
logging.log_dir=output/eval_policy/CartPole-v1/expert
which will return something like
INFO - eval_policy - Result: {
'n_traj': 74,
'monitor_return_len': 74,
'return_min': 26.0,
'return_mean': 154.21621621621622,
'return_std': 79.94377589657559,
'return_max': 500.0,
'len_min': 26,
'len_mean': 154.21621621621622,
'len_std': 79.94377589657559,
'len_max': 500,
'monitor_return_min': 26.0,
'monitor_return_mean': 154.21621621621622,
'monitor_return_std': 79.94377589657559,
'monitor_return_max': 500.0
}
INFO - eval_policy - Completed after 0:00:12
For the non-expert:
python -m imitation.scripts.eval_policy with \
expert.policy_type=ppo \
expert.loader_kwargs.path=output/train_rl/CartPole-v1/non_expert/policies/final/model.zip \
environment.gym_id='CartPole-v1' \
environment.num_vec=1 \
logging.log_dir=output/eval_policy/CartPole-v1/non_expert
INFO - eval_policy - Result: {
'n_traj': 355,
'monitor_return_len': 355,
'return_min': 8.0,
'return_mean': 28.92676056338028,
'return_std': 15.686012049373561,
'return_max': 104.0,
'len_min': 8,
'len_mean': 28.92676056338028,
'len_std': 15.686012049373561,
'len_max': 104,
'monitor_return_min': 8.0,
'monitor_return_mean': 28.92676056338028,
'monitor_return_std': 15.686012049373561,
'monitor_return_max': 104.0
}
INFO - eval_policy - Completed after 0:00:17
This will save the monitor CSVs (one for each vectorised env, controlled by environment.num_vec).
The monitor CSVs follow the naming convention mon*.monitor.csv
.
We can load these CSV files with pandas
and use the imitation.test.reward_improvement
module to compare the performances of the two policies.
from pathlib import Path
import pandas as pd
from imitation.testing.reward_improvement import is_significant_reward_improvement
expert_monitor = pd.concat(
[
pd.read_csv(f, skiprows=1)
for f in Path("./output/train_rl/CartPole-v1/expert/monitor").glob(
"mon*.monitor.csv"
)
]
)
non_expert_monitor = pd.concat(
[
pd.read_csv(f, skiprows=1)
for f in Path("./output/train_rl/CartPole-v1/non_expert/monitor").glob(
"mon*.monitor.csv"
)
]
)
if is_significant_reward_improvement(non_expert_monitor["r"], expert_monitor["r"], 0.05):
print("The expert improved over the non-expert with >95% probability")
else:
print("No significant (p=0.05) reward improvement of expert over non-expert")
True
Algorithm Scripts#
Call the algorithm scripts like this:
python -m imitation.scripts.<script> [command] with <named_config> <config_values>
algorithm |
script |
command |
---|---|---|
BC |
train_imitation |
bc |
DAgger |
train_imitation |
dagger |
AIRL |
train_adversarial |
airl |
GAIL |
train_adversarial |
gail |
Preference Comparison |
train_preference_comparisons |
|
MCE IRL |
none |
|
Density Based Reward Estimation |
none |
Utility Scripts#
Call the utility scripts like this:
python -m imitation.scripts.<script>
Functionality |
Script |
---|---|
Reinforcement Learning |
|
Evaluating a Policy |
|
Parallel Execution of Algorithm Scripts |
|
Converting Trajectory Formats |
|
Analyzing Experimental Results |
Output Directories#
The results of the script runs are stored in the following directory structure:
output
├── <algo>
│ └── <environment>
│ └── <timestamp>
│ ├── log
│ ├── monitor
│ └── sacred -> ../../../sacred/<script_name>/1
└── sacred
└── <script_name>
├── 1
└── _sources
It contains the final model, tensorboard logs, sacred logs and the sacred source files.
Experts#
The algorithms in the imitation library are all about learning from some kind of expert. In many cases this expert is a piece of software itself. The imitation library natively supports experts trained using the stable-baselines3 reinforcement learning library.
For example, BC and DAgger can learn from an expert policy and the command line interface of AIRL/GAIL allows one to specify an expert to sample demonstrations from.
In the First Steps tutorial, we first train an expert policy using the stable-baselines3 library and then imitate it’s behavior using Behavioral Cloning (BC). In practice, you may want to load a pre-trained policy for performance reasons.
Loading a policy from a file#
The Python interface provides a load_policy()
function to which you pass a policy_type, a VecEnv and any extra kwargs to pass to the
corresponding policy loader.
import numpy as np
from imitation.policies.serialize import load_policy
from imitation.util import util
venv = util.make_vec_env("your-env", n_envs=4, rng=np.random.default_rng())
local_policy = load_policy("ppo", venv, path="path/to/model.zip")
To load a policy from disk, use either ppo or sac as the policy type. The path is specified by path in the loader_kwargs and it should either point to a zip file containing the policy or a directory containing a model.zip file that was created by stable-baselines3.
In the command line interface the expert.policy_type and expert.loader_kwargs parameters define the expert policy to load. For example, to train AIRL on a PPO expert, you would use the following command:
python -m imitation.scripts.train_adversarial airl \
with expert.policy_type=ppo expert.loader_kwargs.path="path/to/model.zip"
Loading a policy from HuggingFace#
HuggingFace is a popular repository for pre-trained models.
To load a stable-baselines3 policy from HuggingFace, use either ppo-huggingface or sac-huggingface as the policy type. By default, the policies are loaded from the HumanCompatibleAI organization, but you can override this by setting the organization parameter in the loader_kwargs. When using the Python API, you also have to specify the environment name as env_name.
import numpy as np
from imitation.policies.serialize import load_policy
from imitation.util import util
venv = util.make_vec_env("your-env", n_envs=4, rng=np.random.default_rng())
remote_policy = load_policy(
"ppo-huggingface",
organization="your-org",
env_name="your-env",
venv=venv,
)
)
In the command line interface, the env-name is automatically injected into the loader_kwargs and does not need to be defined explicitly. In this example, to train AIRL on a PPO expert that was loaded from your-org on HuggingFace:
python -m imitation.scripts.train_adversarial airl \
with expert.policy_type=ppo-huggingface expert.loader_kwargs.organization=your-org
Uploading a policy to HuggingFace#
The huggingface-sb3 package provides utilities to push your models to HuggingFace and load them from there. Make sure to use the naming scheme helpers as described in the readme. Otherwise, the loader will not be able to find your model in the repository.
For a convenient high-level interface to train RL models and upload them to HuggingFace, we recommend using the rl-baselines3-zoo.
Custom expert types#
If you want to use a custom expert type, you can write a corresponding factory
function according to PolicyLoaderFn()
and then
register it at the policy_registry
.
For example:
from imitation.policies.serialize import policy_registry
from stable_baselines3.common import policies
def my_policy_loader(venv, some_param: int) -> policies.BasePolicy:
# load your policy here
return policy
policy_registry.register("my-policy", my_policy_loader)
Then, you can use my-policy as the policy_type in the command line interface or the Python API:
python -m imitation.scripts.train_adversarial airl \
with expert.policy_type=my-policy expert.loader_kwargs.some_param=42
Trajectories#
For imitation learning we need trajectories.
Trajectories are sequences of observations and actions and sometimes rewards, which are generated by an agent
interacting with an environment.
They are also called rollouts or episodes.
Some are generated by experts and serve as demonstrations,
others are generated by the agent and serve as training data for a discriminator.
In this library they are stored in a Trajectory
dataclass:
@dataclasses.dataclass(frozen=True)
class Trajectory:
obs: np.ndarray
"""Observations, shape (trajectory_len + 1, ) + observation_shape."""
acts: np.ndarray
"""Actions, shape (trajectory_len, ) + action_shape."""
infos: Optional[np.ndarray]
"""An array of info dicts, shape (trajectory_len, )."""
terminal: bool
"""Does this trajectory (fragment) end in a terminal state?"""
The info dictionaries are optional and can contain arbitrary information.
Look at the Trajectory
class as well as the
gymnasium documentation for more details.
TrajectoryWithRew
is a subclass of
Trajectory
and has another
rews
field,
which is an array of rewards of shape (trajectory_len, ).
Usually, they are passed around as sequences of trajectories.
Some algorithms do not need as much information about the ordering of states, actions and rewards. Rather than using trajectories, these algorithms can make use of individual
Transitions
(flattened
trajectories).
Generating Trajectories#
To generate trajectories from a given policy, run the following command:
import numpy as np
import imitation.data.rollout as rollout
your_trajectories = rollout.rollout(
your_policy,
your_env,
sample_until=rollout.make_sample_until(min_episodes=10),
rng=np.random.default_rng(),
unwrap=False,
)
Storing/Loading Trajectories#
Trajectories can be stored on disk or uploaded to the HuggingFace Dataset Hub.
This will store the sequence of trajectories into a directory at your_path as a HuggingFace Dataset:
from imitation.data import serialize
serialize.save(your_path, your_trajectories)
In the same way you can load trajectories from a HuggingFace Dataset:
from imitation.data import serialize
your_trajectories = serialize.load(your_path)
Note that some older, now deprecated, trajectory formats are supported by this loader
,
but not by the saver
.
Reward Networks#
The goal of both inverse reinforcement learning (IRL) algorithms (e.g. AIRL, GAIL) and preference comparison is to discover a reward function. In imitation learning, these discovered rewards are parameterized by reward networks.
Reward Network API#
Reward networks need to support two separate but equally important modes of operation. First, these networks need to produce a reward that can be differentiated and used for training the reward network. These rewards are provided by the forward
method. Second, these networks need to produce a reward that can be used for training policies. These rewards are provided by the predict_processed
method, which applies additional post-processing that is unhelpful during reward network training.
Reward Network Architecture#
In imitation learning, reward networks are torch.nn.Module. Out of the box, imitation provides a few reward network architectures such as multi-layer perceptron BasicRewardNet
and a convolutional neural net CNNRewardNet
. To implement your own custom reward network, you can subclass RewardNet
.
from imitation.rewards.reward_nets import RewardNet
import torch as th
class MyRewardNet(RewardNet):
def __init__(self, observation_space, action_space):
super().__init__(observation_space, action_space)
# initialize your custom reward network here
def forward(self,
state: th.Tensor, # (batch_size, *obs_shape)
action: th.Tensor, # (batch_size, *action_shape)
next_state: th.Tensor, # (batch_size, *obs_shape)
done: th.Tensor, # (batch_size,)
) -> th.Tensor:
# implement your custom reward network here
return th.zeros_like(done) # (batch_size,)
Replace an Environment’s Reward with a Reward Network#
In order to use a reward network to train a policy, we need to integrate it into an environment. This is done by wrapping the environment in a RewardVecEnvWrapper
. This wrapper replaces the environment’s reward function with the reward network’s function.
from imitation.util import util
from imitation.rewards.reward_wrapper import RewardVecEnvWrapper
from imitation.rewards.reward_nets import BasicRewardNet
reward_net = BasicRewardNet(obs_space, action_space)
venv = util.make_vec_env("Pendulum-v1", n_envs=3, rng=rng)
venv = RewardVecEnvWrapper(venv, reward_net.predict_processed)
Reward Network Wrappers#
Imitation learning algorithms should converge to a reward function that will theoretically induce the optimal or soft-optimal policy. However, these reward functions may not always be well suited for training RL agents, or we may want to modify them to encourage exploration, for instance.
There are two types of wrapper:
ForwardWrapper
allows for direct modification of the results of the reward network’sforward
method. It is used during the learning of the reward network and thus must be differentiable. These wrappers are always applied first and are thus take effect regardless of weather you call forward, predict or predict_processed. They are used for applying transformations like potential shaping (seeShapedRewardNet
).PredictProcessedWrapper
modifies the predict_processed call to the reward network. Thus this type of reward network wrapper is designed to only modify the reward when it is being used to train/evaluate a policy but not when we are taking gradients on it. Thus it does not have to be differentiable.
The most commonly used is the NormalizedRewardNet
which is a predict procssed wrapper. This class uses a normalization layer to standardize the output of the reward function using its running mean and variance, which is useful for stabilizing training. When a reward network is saved, its wrappers are saved along with it, so that the normalization fit during reward learning can be used during future policy learning or evaluation.
from imitation.rewards.reward_nets import NormalizedRewardNet
from imitation.util.networks import RunningNorm
train_reward_net = NormalizedRewardNet(
reward_net,
normalize_output_layer=RunningNorm,
)
Note
The reward normalization wrapper does _not_ function identically to stable baselines3’s VecNormalize environment wrapper. First, it does not normalize the observations. Second, unlike VecNormalize
, it scales and centers the reward using the base rewards’s mean and variance. The VecNormalizes
scales the reward down using a running estimate of the _return_.
By default, the normalization wrapper updates the normalization on each call to predict_processed
. This behavior can be altered as shown below.
from functools import partial
eval_rew_fn = partial(reward_net.predict_processed, update_stats=False)
Serializing and Deserializing Reward Networks#
Reward networks, wrappers included, are serialized simply by calling th.save(reward_net, path)
.
However, when evaluating reward networks, we may or may not want to include the wrappers it was trained with. To load the reward network just as it was saved, wrappers included, we can simply call th.load(path)
. When using a learned reward network to train or evaluate a policy, we can select whether or not to include the reward network wrappers and convert it into a function using the load_reward
utility. For example, we might want to remove or keep the reward normalization fit during training in the evaluation phase.
import torch as th
from imitation.rewards.serialize import load_reward
from imitation.rewards.reward_nets import NormalizedRewardNet
th.save(train_reward_net, path)
train_reward_net = th.load(path)
# We can also load the reward network as a reward function for use in evaluation
eval_rew_fn_normalized = load_reward(reward_type="RewardNet_normalized", reward_path=path, venv=venv)
eval_rew_fn_unnormalized = load_reward(reward_type="RewardNet_unnormalized", reward_path=path, venv=venv)
# If we want to continue to update the reward networks normalization by default it is frozen for evaluation and retraining
rew_fn_normalized = load_reward(reward_type="RewardNet_normalized", reward_path=path, venv=venv, update_stats=True)
Limitations on Horizon Length#
Warning
Variable Horizon Environments Considered Harmful
Reinforcement learning (RL) algorithms are commonly trained and evaluated in variable horizon environments.
In these environments, the episode ends when some termination condition is reached (rather than after a fixed number of steps).
This typically corresponds to success, such as reaching the top of the mountain in MountainCar
, or to failure, such as the pole falling down in CartPole
.
A variable horizon will tend to speed up RL training, by increasing the proportion of samples where the agent’s actions still have a meaningful impact on the reward, pruning out states that are already a foregone conclusion.
However, termination conditions must be carefully hand-designed for each environment.
Their inclusion therefore provides a significant source of information about the reward.
Evaluating reward and imitation learning algorithms in variable-horizon environments can therefore be deeply misleading.
In fact, reward learning in commonly used variable horizon environments such as MountainCar
and CartPole
can be solved by learning a single bit: the sign of the reward.
Of course, an algorithm being able to learn a single bit predicts very little about its performance in real-world tasks, that do not usually come with such an informative termination condition.
To make matters worse, some algorithms have a strong inductive bias towards a particular sign. Indeed, Figure 5 of Kostrikov et al (2021) shows that GAIL is able to reach a third of expert performance even without seeing any expert demonstrations. Consequently, algorithms that happen to have an inductive bias aligned with the task (e.g. positive reward bias for environments where longer episodes are better) may outperform unbiased algorithms on certain environments. Conversely, algorithms with a misaligned inductive bias will perform worse than an unbiased algorithm. This may lead to illusory discrepancies between algorithms, or even different implementations of the same algorithm.
Kostrikov et al (2021) introduces a way to correct for this bias. However, this does not solve the problem of information leakage. Rather, it merely ensures that different algorithms are all able to equally exploit the information leak provided by the termination condition.
In light of this issue, we would strongly recommend users evaluate imitation
and other reward or imitation learning algorithms only in fixed-horizon environments.
This is a common, though unfortunately not ubiquitous, practice in reward learning papers.
For example, Christiano et al (2017) use fixed horizon environments because:
Removing variable length episodes leaves the agent with only the information encoded in the environment itself; human feedback provides its only guidance about what it ought to do.
Many environments, like HalfCheetah
, are naturally fixed-horizon.
Moreover, most variable-horizon tasks can be easily converted into fixed-horizon tasks.
Our sister project seals provides fixed-horizon versions of many commonly used MuJoCo continuous control tasks, as well as mitigating other potential pitfalls in reward learning evaluation.
Given the serious issues with evaluation and training in variable horizon tasks, imitation
will by default throw an error
if training AIRL, GAIL or DRLHP in variable horizon tasks. If you have read this document and understand the problems that
variable horizon tasks can cause but still want to train in variable horizon settings, you can override this safety check
by setting allow_variable_horizon=True
. Note this check is not applied for BC or DAgger, which operate on individual
transitions (not episodes) and so cannot exploit the information leak.
Usage with allow_variable_horizon=True
is not officially supported, and we will not optimize imitation
algorithms
to perform well in this situation, as it would not represent real progress. Examples of situations where setting this
flag may nonetheless be appropriate include:
Investigating the bias introduced by variable horizon tasks – e.g. comparing variable to fixed horizon tasks.
For unit tests to verify algorithms continue to run on variable horizon environments.
Where the termination condition is trivial (e.g. has the robot fallen over?) and the target behaviour is complex (e.g. solve a Rubik’s cube). In this case, while the termination condition still helps reward and imitation learning, the problem remains highly non-trivial even with this information side-channel. However, the existence of this side-channel should of course be prominently disclosed.
See this GitHub issue for further discussion.
Non-Support for Infinite Length Horizons#
At the moment, we do not support infinite-length horizons. Many of the imitation algorithms, especially those relying on RL, do not easily port over to infinite-horizon setups. Similarly, much of the logging and reward calculation logic assumes the existence of a finite horizon. Although we may explore workarounds in the future, this is not a feature that we can currently support.
Benchmarking imitation
#
The imitation library is benchmarked by running the algorithms BC, DAgger, AIRL and GAIL on five different environments from the seals environment suite each with 10 different random seeds. You will find the benchmark results in the release artifacts, e.g. for the v1.0 release here.
Running a Single Benchmark#
To run a single benchmark from the commandline, you may use:
python -m imitation.scripts.<train_script> <algo> with <algo>_<env>
There are two different train_scripts
: train_imitation
and train_adversarial
each running different algorithms:
train_script |
algo |
---|---|
train_imitation |
bc, dagger |
train_adversarial |
gail, airl |
There are five environment configurations for which we have tuned hyperparameters:
environment |
---|
seals_ant |
seals_half_cheetah |
seals_hopper |
seals_swimmer |
seals_walker |
If you want to run the same benchmark from a python script, you can use the following code:
...
from imitation.scripts.<train_script> import <train_script>_ex
<train_script>_ex.run(command_name="<algo>", named_configs=["<algo>_<env>"])
Inputs#
The tuned hyperparameters can be found in src/imitation/scripts/config/tuned_hps
.
For v0.4.0, they correspond to the hyperparameters used in the paper
imitation: Clean Imitation Learning Implementations.
You may be able to get reasonable performance by using hyperparameters tuned for a similar environment.
The experts and expert demonstrations are loaded from the HuggingFace model hub and are grouped under the HumanCompatibleAI Organization.
Outputs#
The training scripts are sacred experiments which place their output in an output folder structured like this:
output
├── airl
│ └── seals-Swimmer-v1
│ └── 20231012_121226_c5c0e4
│ └── sacred -> ../../../sacred/train_adversarial/2
├── dagger
│ └── seals-CartPole-v0
│ └── 20230927_095917_c29dc2
│ └── sacred -> ../../../sacred/train_imitation/1
└── sacred
├── train_adversarial
│ ├── 1
│ ├── 2
│ ├── 3
│ ├── 4
│ ├── ...
│ └── _sources
└── train_imitation
├── 1
└── _sources
In the sacred
folder all runs are grouped by the training script, and each gets a
folder with their run id.
That run folder contains
a
config.json
file with the hyperparameters used for that runa
run.json
file with run information with the final score and expert scorea
cout.txt
file with the stdout of the run
Additionally, there are run folders grouped by algorithm and environment. They contain further log files and model checkpoints as well as a symlink to the corresponding sacred run folder.
Important entries in the json files are:
run.json
command
: The name of the algorithmresult.imit_stats.monitor_return_mean
: the score of a runresult.expert_stats.monitor_return_mean
: the score of the expert policy that was used for a run
config.json
environment.gym_id
The environment name of the run
Running the Complete Benchmark Suite#
To execute the entire benchmarking suite with 10 seeds for each configuration,
you can utilize the run_all_benchmarks.sh
script.
This script will consecutively run all configurations.
To optimize the process, consider parallelization options.
You can either send all commands to GNU Parallel,
use SLURM by invoking run_all_benchmarks_on_slurm.sh
or
split up the lines in multiple scripts to run on multiple machines manually.
Generating Benchmark Summaries#
There are scripts to summarize all runs in a folder in a CSV file or in a markdown file. For the CSV, run:
python sacred_output_to_csv.py output/sacred > summary.csv
This generates a csv file like this:
algo, env, score, expert_score
gail, seals/Walker2d-v1, 2298.883520464286, 2502.8930135576925
gail, seals/Swimmer-v1, 287.33667667857145, 295.40472964423077
airl, seals/Walker2d-v1, 310.4065185178571, 2502.8930135576925
...
For a more comprehensive summary that includes aggregate statistics such as mean, standard deviation, IQM (Inter Quartile Mean) with confidence intervals, as recommended by the rliable library, use the following command:
python sacred_output_to_markdown_summary output/sacred --output summary.md
This will produce a markdown summary file named summary.md
.
Hint:
If you have multiple output folders, because you ran different parts of the
benchmark on different machines, you can copy the output folders into a common root
folder.
The above scripts will search all nested directories for folders with
a run.json
and a config.json
file.
For example, calling python sacred_output_to_csv.py benchmark_runs/ > summary.csv
on an output folder structured like this:
benchmark_runs
├── first_batch
│ ├── 1
│ ├── 2
│ ├── 3
│ ├── ...
└── second_batch
├── 1
├── 2
├── 3
├── ...
will aggregate all runs from both first_batch
and second_batch
into a single
csv file.
Comparing an Algorithm against the Benchmark Runs#
If you modified one of the existing algorithms or implemented a new one, you might want to compare it to the benchmark runs to see if there is a significant improvement or not.
If your algorithm has the same file output format as described above, you can use the
compute_probability_of_improvement.py
script to do the comparison.
It uses the “Probability of Improvement” metric as recommended by the
rliable library.
python compute_probability_of_improvement.py <your_runs_dir> <baseline_runs_dir> --baseline-algo <algo>
where:
your_runs_dir
is the directory containing the runs for your algorithmbaseline_runs_dir
is the directory containing runs for a known algorithm. Hint: you do not need to re-run our benchmarks. We provide our run folders as release artifacts.algo
is the algorithm you want to compare against
If your_runs_dir
contains runs for more than one algorithm, you will have to
disambiguate using the --algo
option.
Tuning Hyperparameters#
The hyperparameters of any algorithm in imitation can be tuned using src/imitation/scripts/tuning.py
.
The benchmarking hyperparameter configs were generated by tuning the hyperparameters using
the search space defined in the scripts/config/tuning.py
.
The tuning script proceeds in two phases:
Tune the hyperparameters using the search space provided.
Re-evaluate the best hyperparameter config found in the first phase based on the maximum mean return on a separate set of seeds. Report the mean and standard deviation of these trials.
To use it with the default search space:
python -m imitation.scripts.tuning with <algo> 'parallel_run_config.base_named_configs=["<env>"]'
In this command:
<algo>
provides the default search space and settings for the specific algorithm, which is defined in thescripts/config/tuning.py
<env>
sets the environment to tune the algorithm in. They are defined in the algo-specifcscripts/config/train_[adversarial|imitation|preference_comparisons|rl].py
files. For the already tuned environments, use the<algo>_<env>
named configs here.
See the documentation of scripts/tuning.py
and scripts/parallel.py
for many other arguments that can be
provided through the command line to change the tuning behavior.
Benchmark Summary#
This is a summary of the sacred runs in benchmark_runs
generated by sacred_output_to_markdown_summary.py
.
Scores#
The scores are normalized based on the performance of a random agent as the baseline and the expert as the maximum possible score as explained in this blog post:
(score - random_score) / (expert_score - random_score)
Aggregate scores and confidence intervals are computed using the rliable library.
AIRL#
Environment |
Score (mean/std) |
Normalized Score (mean/std) |
N |
---|---|---|---|
seals/Ant-v1 |
2485.889 / 533.471 |
0.981 / 0.184 |
10 |
seals/HalfCheetah-v1 |
938.450 / 804.871 |
0.627 / 0.412 |
10 |
seals/Hopper-v1 |
183.780 / 93.295 |
0.921 / 0.373 |
10 |
seals/Swimmer-v1 |
286.699 / 7.763 |
0.970 / 0.027 |
10 |
seals/Walker2d-v1 |
1154.921 / 659.564 |
0.461 / 0.264 |
10 |
Aggregate Normalized scores#
Metric |
Value |
95% CI |
---|---|---|
Mean |
0.792 |
[0.709, 0.792] |
IQM |
0.918 |
[0.871, 0.974] |
BC#
Environment |
Score (mean/std) |
Normalized Score (mean/std) |
N |
---|---|---|---|
seals/Ant-v1 |
2090.551 / 180.340 |
0.844 / 0.062 |
10 |
seals/HalfCheetah-v1 |
1516.476 / 37.487 |
0.923 / 0.019 |
10 |
seals/Hopper-v1 |
204.271 / 0.609 |
1.003 / 0.002 |
10 |
seals/Swimmer-v1 |
276.242 / 9.328 |
0.935 / 0.032 |
10 |
seals/Walker2d-v1 |
2393.254 / 37.641 |
0.956 / 0.015 |
10 |
Aggregate Normalized scores#
Metric |
Value |
95% CI |
---|---|---|
Mean |
0.932 |
[0.922, 0.932] |
IQM |
0.941 |
[0.941, 0.949] |
DAGGER#
Environment |
Score (mean/std) |
Normalized Score (mean/std) |
N |
---|---|---|---|
seals/Ant-v1 |
2302.527 / 108.315 |
0.957 / 0.052 |
10 |
seals/HalfCheetah-v1 |
1615.004 / 8.262 |
1.017 / 0.008 |
10 |
seals/Hopper-v1 |
204.789 / 1.599 |
1.011 / 0.012 |
10 |
seals/Swimmer-v1 |
283.776 / 6.524 |
0.988 / 0.024 |
10 |
seals/Walker2d-v1 |
2419.748 / 52.215 |
1.002 / 0.026 |
10 |
Aggregate Normalized scores#
Metric |
Value |
95% CI |
---|---|---|
Mean |
0.995 |
[0.987, 0.998] |
IQM |
1.004 |
[1.003, 1.008] |
GAIL#
Environment |
Score (mean/std) |
Normalized Score (mean/std) |
N |
---|---|---|---|
seals/Ant-v1 |
2527.566 / 148.034 |
0.995 / 0.051 |
10 |
seals/HalfCheetah-v1 |
1595.129 / 37.374 |
0.963 / 0.019 |
10 |
seals/Hopper-v1 |
187.105 / 14.298 |
0.935 / 0.057 |
10 |
seals/Swimmer-v1 |
249.949 / 74.295 |
0.845 / 0.254 |
10 |
seals/Walker2d-v1 |
2399.196 / 89.949 |
0.959 / 0.036 |
10 |
Aggregate Normalized scores#
Metric |
Value |
95% CI |
---|---|---|
Mean |
0.939 |
[0.900, 0.944] |
IQM |
0.957 |
[0.965, 0.970] |
Behavioral Cloning (BC)#
Behavioral cloning directly learns a policy by using supervised learning on observation-action pairs from expert demonstrations. It is a simple approach to learning a policy, but the policy often generalizes poorly and does not recover well from errors.
Alternatives to behavioral cloning include DAgger (similar but gathers on-policy demonstrations) and GAIL/AIRL (more robust approaches to learning from demonstrations).
Example#
Detailed example notebook: Train an Agent using Behavior Cloning
import numpy as np
import gymnasium as gym
from stable_baselines3.common.evaluation import evaluate_policy
from imitation.algorithms import bc
from imitation.data import rollout
from imitation.data.wrappers import RolloutInfoWrapper
from imitation.policies.serialize import load_policy
from imitation.util.util import make_vec_env
rng = np.random.default_rng(0)
env = make_vec_env(
"seals:seals/CartPole-v0",
rng=rng,
n_envs=1,
post_wrappers=[lambda env, _: RolloutInfoWrapper(env)], # for computing rollouts
)
expert = load_policy(
"ppo-huggingface",
organization="HumanCompatibleAI",
env_name="seals-CartPole-v0",
venv=env,
)
rollouts = rollout.rollout(
expert,
env,
rollout.make_sample_until(min_timesteps=None, min_episodes=50),
rng=rng,
)
transitions = rollout.flatten_trajectories(rollouts)
bc_trainer = bc.BC(
observation_space=env.observation_space,
action_space=env.action_space,
demonstrations=transitions,
rng=rng,
)
bc_trainer.train(n_epochs=1)
reward, _ = evaluate_policy(bc_trainer.policy, env, 10)
print("Reward:", reward)
API#
- class imitation.algorithms.bc.BC(*, observation_space, action_space, rng, policy=None, demonstrations=None, batch_size=32, minibatch_size=None, optimizer_cls=<class 'torch.optim.adam.Adam'>, optimizer_kwargs=None, ent_weight=0.001, l2_weight=0.0, device='auto', custom_logger=None)[source]
Bases:
DemonstrationAlgorithm
Behavioral cloning (BC).
Recovers a policy via supervised learning from observation-action pairs.
- __init__(*, observation_space, action_space, rng, policy=None, demonstrations=None, batch_size=32, minibatch_size=None, optimizer_cls=<class 'torch.optim.adam.Adam'>, optimizer_kwargs=None, ent_weight=0.001, l2_weight=0.0, device='auto', custom_logger=None)[source]
Builds BC.
- Parameters
observation_space (
Space
) – the observation space of the environment.action_space (
Space
) – the action space of the environment.rng (
Generator
) – the random state to use for the random number generator.policy (
Optional
[ActorCriticPolicy
]) – a Stable Baselines3 policy; if unspecified, defaults to FeedForward32Policy.demonstrations (
Union
[Iterable
[Trajectory
],Iterable
[TransitionMapping
],TransitionsMinimal
,None
]) – Demonstrations from an expert (optional). Transitions expressed directly as a types.TransitionsMinimal object, a sequence of trajectories, or an iterable of transition batches (mappings from keywords to arrays containing observations, etc).batch_size (
int
) – The number of samples in each batch of expert data.minibatch_size (
Optional
[int
]) – size of minibatch to calculate gradients over. The gradients are accumulated until batch_size examples are processed before making an optimization step. This is useful in GPU training to reduce memory usage, since fewer examples are loaded into memory at once, facilitating training with larger batch sizes, but is generally slower. Must be a factor of batch_size. Optional, defaults to batch_size.optimizer_cls (
Type
[Optimizer
]) – optimiser to use for supervised training.optimizer_kwargs (
Optional
[Mapping
[str
,Any
]]) – keyword arguments, excluding learning rate and weight decay, for optimiser construction.ent_weight (
float
) – scaling applied to the policy’s entropy regularization.l2_weight (
float
) – scaling applied to the policy’s L2 regularization.device (
Union
[str
,device
]) – name/identity of device to place policy on.custom_logger (
Optional
[HierarchicalLogger
]) – Where to log to; if None (default), creates a new logger.
- Raises
ValueError – If weight_decay is specified in optimizer_kwargs (use the parameter l2_weight instead), or if the batch size is not a multiple of the minibatch size.
- allow_variable_horizon: bool
If True, allow variable horizon trajectories; otherwise error if detected.
- property policy: ActorCriticPolicy
Returns a policy imitating the demonstration data.
- Return type
ActorCriticPolicy
- set_demonstrations(demonstrations)[source]
Sets the demonstration data.
Changing the demonstration data on-demand can be useful for interactive algorithms like DAgger.
- Parameters
demonstrations (
Union
[Iterable
[Trajectory
],Iterable
[TransitionMapping
],TransitionsMinimal
]) – Either a Torch DataLoader, any other iterator that yields dictionaries containing “obs” and “acts” Tensors or NumPy arrays, TransitionKind instance, or a Sequence of Trajectory objects.- Return type
None
- train(*, n_epochs=None, n_batches=None, on_epoch_end=None, on_batch_end=None, log_interval=500, log_rollouts_venv=None, log_rollouts_n_episodes=5, progress_bar=True, reset_tensorboard=False)[source]
Train with supervised learning for some number of epochs.
Here an ‘epoch’ is just a complete pass through the expert data loader, as set by self.set_expert_data_loader(). Note, that when you specify n_batches smaller than the number of batches in an epoch, the on_epoch_end callback will never be called.
- Parameters
n_epochs (
Optional
[int
]) – Number of complete passes made through expert data before ending training. Provide exactly one of n_epochs and n_batches.n_batches (
Optional
[int
]) – Number of batches loaded from dataset before ending training. Provide exactly one of n_epochs and n_batches.on_epoch_end (
Optional
[Callable
[[],None
]]) – Optional callback with no parameters to run at the end of each epoch.on_batch_end (
Optional
[Callable
[[],None
]]) – Optional callback with no parameters to run at the end of each batch.log_interval (
int
) – Log stats after every log_interval batches.log_rollouts_venv (
Optional
[VecEnv
]) – If not None, then this VecEnv (whose observation and actions spaces must match self.observation_space and self.action_space) is used to generate rollout stats, including average return and average episode length. If None, then no rollouts are generated.log_rollouts_n_episodes (
int
) – Number of rollouts to generate when calculating rollout stats. Non-positive number disables rollouts.progress_bar (
bool
) – If True, then show a progress bar during training.reset_tensorboard (
bool
) – If True, then start plotting to Tensorboard from x=0 even if .train() logged to Tensorboard previously. Has no practical effect if .train() is being called for the first time.
Generative Adversarial Imitation Learning (GAIL)#
GAIL learns a policy by simultaneously training it with a discriminator that aims to distinguish expert trajectories against trajectories from the learned policy.
Note
GAIL paper: Generative Adversarial Imitation Learning
Example#
Detailed example notebook: Train an Agent using Generative Adversarial Imitation Learning
import numpy as np
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.ppo import MlpPolicy
from imitation.algorithms.adversarial.gail import GAIL
from imitation.data import rollout
from imitation.data.wrappers import RolloutInfoWrapper
from imitation.policies.serialize import load_policy
from imitation.rewards.reward_nets import BasicRewardNet
from imitation.util.networks import RunningNorm
from imitation.util.util import make_vec_env
SEED = 42
env = make_vec_env(
"seals:seals/CartPole-v0",
rng=np.random.default_rng(SEED),
n_envs=8,
post_wrappers=[lambda env, _: RolloutInfoWrapper(env)], # to compute rollouts
)
expert = load_policy(
"ppo-huggingface",
organization="HumanCompatibleAI",
env_name="seals-CartPole-v0",
venv=env,
)
rollouts = rollout.rollout(
expert,
env,
rollout.make_sample_until(min_timesteps=None, min_episodes=60),
rng=np.random.default_rng(SEED),
)
learner = PPO(
env=env,
policy=MlpPolicy,
batch_size=64,
ent_coef=0.0,
learning_rate=0.0004,
gamma=0.95,
n_epochs=5,
seed=SEED,
)
reward_net = BasicRewardNet(
observation_space=env.observation_space,
action_space=env.action_space,
normalize_input_layer=RunningNorm,
)
gail_trainer = GAIL(
demonstrations=rollouts,
demo_batch_size=1024,
gen_replay_buffer_capacity=512,
n_disc_updates_per_round=8,
venv=env,
gen_algo=learner,
reward_net=reward_net,
)
# evaluate the learner before training
env.seed(SEED)
learner_rewards_before_training, _ = evaluate_policy(
learner, env, 100, return_episode_rewards=True,
)
# train the learner and evaluate again
gail_trainer.train(20000) # Train for 800_000 steps to match expert.
env.seed(SEED)
learner_rewards_after_training, _ = evaluate_policy(
learner, env, 100, return_episode_rewards=True,
)
print("mean reward after training:", np.mean(learner_rewards_after_training))
print("mean reward before training:", np.mean(learner_rewards_before_training))
API#
- class imitation.algorithms.adversarial.gail.GAIL(*, demonstrations, demo_batch_size, venv, gen_algo, reward_net, **kwargs)[source]
Bases:
AdversarialTrainer
Generative Adversarial Imitation Learning (GAIL).
- __init__(*, demonstrations, demo_batch_size, venv, gen_algo, reward_net, **kwargs)[source]
Generative Adversarial Imitation Learning.
- Parameters
demonstrations (
Union
[Iterable
[Trajectory
],Iterable
[TransitionMapping
],TransitionsMinimal
]) – Demonstrations from an expert (optional). Transitions expressed directly as a types.TransitionsMinimal object, a sequence of trajectories, or an iterable of transition batches (mappings from keywords to arrays containing observations, etc).demo_batch_size (
int
) – The number of samples in each batch of expert data. The discriminator batch size is twice this number because each discriminator batch contains a generator sample for every expert sample.venv (
VecEnv
) – The vectorized environment to train in.gen_algo (
BaseAlgorithm
) – The generator RL algorithm that is trained to maximize discriminator confusion. Environment and logger will be set to venv and custom_logger.reward_net (
RewardNet
) – a Torch module that takes an observation, action and next observation tensor as input, then computes the logits. Used as the GAIL discriminator.**kwargs – Passed through to AdversarialTrainer.__init__.
- allow_variable_horizon: bool
If True, allow variable horizon trajectories; otherwise error if detected.
- property logger: HierarchicalLogger
- Return type
- logits_expert_is_high(state, action, next_state, done, log_policy_act_prob=None)[source]
Compute the discriminator’s logits for each state-action sample.
- Parameters
state (
Tensor
) – The state of the environment at the time of the action.action (
Tensor
) – The action taken by the expert or generator.next_state (
Tensor
) – The state of the environment after the action.done (
Tensor
) – whether a terminal state (as defined under the MDP of the task) has been reached.log_policy_act_prob (
Optional
[Tensor
]) – The log probability of the action taken by the generator, \(\log{P(a|s)}\).
- Return type
Tensor
- Returns
The logits of the discriminator for each state-action sample.
- property policy: BasePolicy
Returns a policy imitating the demonstration data.
- Return type
BasePolicy
- property reward_test: RewardNet
Reward used to train policy at “test” time after adversarial training.
- Return type
- set_demonstrations(demonstrations)
Sets the demonstration data.
Changing the demonstration data on-demand can be useful for interactive algorithms like DAgger.
- Parameters
demonstrations (
Union
[Iterable
[Trajectory
],Iterable
[TransitionMapping
],TransitionsMinimal
]) – Either a Torch DataLoader, any other iterator that yields dictionaries containing “obs” and “acts” Tensors or NumPy arrays, TransitionKind instance, or a Sequence of Trajectory objects.- Return type
None
- train(total_timesteps, callback=None)
Alternates between training the generator and discriminator.
Every “round” consists of a call to train_gen(self.gen_train_timesteps), a call to train_disc, and finally a call to callback(round).
Training ends once an additional “round” would cause the number of transitions sampled from the environment to exceed total_timesteps.
- Parameters
total_timesteps (
int
) – An upper bound on the number of transitions to sample from the environment during training.callback (
Optional
[Callable
[[int
],None
]]) – A function called at the end of every round which takes in a single argument, the round number. Round numbers are in range(total_timesteps // self.gen_train_timesteps).
- Return type
None
- train_disc(*, expert_samples=None, gen_samples=None)
Perform a single discriminator update, optionally using provided samples.
- Parameters
expert_samples (
Optional
[Mapping
]) – Transition samples from the expert in dictionary form. If provided, must contain keys corresponding to every field of the Transitions dataclass except “infos”. All corresponding values can be either NumPy arrays or Tensors. Extra keys are ignored. Must contain self.demo_batch_size samples. If this argument is not provided, then self.demo_batch_size expert samples from self.demo_data_loader are used by default.gen_samples (
Optional
[Mapping
]) – Transition samples from the generator policy in same dictionary form as expert_samples. If provided, must contain exactly self.demo_batch_size samples. If not provided, then take len(expert_samples) samples from the generator replay buffer.
- Return type
Mapping
[str
,float
]- Returns
Statistics for discriminator (e.g. loss, accuracy).
- train_gen(total_timesteps=None, learn_kwargs=None)
Trains the generator to maximize the discriminator loss.
After the end of training populates the generator replay buffer (used in discriminator training) with self.disc_batch_size transitions.
- Parameters
total_timesteps (
Optional
[int
]) – The number of transitions to sample from self.venv_train during training. By default, self.gen_train_timesteps.learn_kwargs (
Optional
[Mapping
]) – kwargs for the Stable Baselines RLModel.learn() method.
- Return type
None
- venv: VecEnv
The original vectorized environment.
- venv_train: VecEnv
Like self.venv, but wrapped with train reward unless in debug mode.
If debug_use_ground_truth=True was passed into the initializer then self.venv_train is the same as self.venv.
- venv_wrapped: VecEnvWrapper
- class imitation.algorithms.adversarial.common.AdversarialTrainer(*, demonstrations, demo_batch_size, venv, gen_algo, reward_net, demo_minibatch_size=None, n_disc_updates_per_round=2, log_dir='output/', disc_opt_cls=<class 'torch.optim.adam.Adam'>, disc_opt_kwargs=None, gen_train_timesteps=None, gen_replay_buffer_capacity=None, custom_logger=None, init_tensorboard=False, init_tensorboard_graph=False, debug_use_ground_truth=False, allow_variable_horizon=False)[source]
Bases:
DemonstrationAlgorithm
[Transitions
]Base class for adversarial imitation learning algorithms like GAIL and AIRL.
- __init__(*, demonstrations, demo_batch_size, venv, gen_algo, reward_net, demo_minibatch_size=None, n_disc_updates_per_round=2, log_dir='output/', disc_opt_cls=<class 'torch.optim.adam.Adam'>, disc_opt_kwargs=None, gen_train_timesteps=None, gen_replay_buffer_capacity=None, custom_logger=None, init_tensorboard=False, init_tensorboard_graph=False, debug_use_ground_truth=False, allow_variable_horizon=False)[source]
Builds AdversarialTrainer.
- Parameters
demonstrations (
Union
[Iterable
[Trajectory
],Iterable
[TransitionMapping
],TransitionsMinimal
]) – Demonstrations from an expert (optional). Transitions expressed directly as a types.TransitionsMinimal object, a sequence of trajectories, or an iterable of transition batches (mappings from keywords to arrays containing observations, etc).demo_batch_size (
int
) – The number of samples in each batch of expert data. The discriminator batch size is twice this number because each discriminator batch contains a generator sample for every expert sample.venv (
VecEnv
) – The vectorized environment to train in.gen_algo (
BaseAlgorithm
) – The generator RL algorithm that is trained to maximize discriminator confusion. Environment and logger will be set to venv and custom_logger.reward_net (
RewardNet
) – a Torch module that takes an observation, action and next observation tensors as input and computes a reward signal.demo_minibatch_size (
Optional
[int
]) – size of minibatch to calculate gradients over. The gradients are accumulated until the entire batch is processed before making an optimization step. This is useful in GPU training to reduce memory usage, since fewer examples are loaded into memory at once, facilitating training with larger batch sizes, but is generally slower. Must be a factor of demo_batch_size. Optional, defaults to demo_batch_size.n_disc_updates_per_round (
int
) – The number of discriminator updates after each round of generator updates in AdversarialTrainer.learn().log_dir (
Union
[str
,bytes
,PathLike
]) – Directory to store TensorBoard logs, plots, etc. in.disc_opt_cls (
Type
[Optimizer
]) – The optimizer for discriminator training.disc_opt_kwargs (
Optional
[Mapping
]) – Parameters for discriminator training.gen_train_timesteps (
Optional
[int
]) – The number of steps to train the generator policy for each iteration. If None, then defaults to the batch size (for on-policy) or number of environments (for off-policy).gen_replay_buffer_capacity (
Optional
[int
]) – The capacity of the generator replay buffer (the number of obs-action-obs samples from the generator that can be stored). By default this is equal to gen_train_timesteps, meaning that we sample only from the most recent batch of generator samples.custom_logger (
Optional
[HierarchicalLogger
]) – Where to log to; if None (default), creates a new logger.init_tensorboard (
bool
) – If True, makes various discriminator TensorBoard summaries.init_tensorboard_graph (
bool
) – If both this and init_tensorboard are True, then write a Tensorboard graph summary to disk.debug_use_ground_truth (
bool
) – If True, use the ground truth reward for self.train_env. This disables the reward wrapping that would normally replace the environment reward with the learned reward. This is useful for sanity checking that the policy training is functional.allow_variable_horizon (
bool
) – If False (default), algorithm will raise an exception if it detects trajectories of different length during training. If True, overrides this safety check. WARNING: variable horizon episodes leak information about the reward via termination condition, and can seriously confound evaluation. Read https://imitation.readthedocs.io/en/latest/guide/variable_horizon.html before overriding this.
- Raises
ValueError – if the batch size is not a multiple of the minibatch size.
- allow_variable_horizon: bool
If True, allow variable horizon trajectories; otherwise error if detected.
- property logger: HierarchicalLogger
- Return type
- abstract logits_expert_is_high(state, action, next_state, done, log_policy_act_prob=None)[source]
Compute the discriminator’s logits for each state-action sample.
A high value corresponds to predicting expert, and a low value corresponds to predicting generator.
- Parameters
state (
Tensor
) – state at time t, of shape (batch_size,) + state_shape.action (
Tensor
) – action taken at time t, of shape (batch_size,) + action_shape.next_state (
Tensor
) – state at time t+1, of shape (batch_size,) + state_shape.done (
Tensor
) – binary episode completion flag after action at time t, of shape (batch_size,).log_policy_act_prob (
Optional
[Tensor
]) – log probability of generator policy taking action at time t.
- Return type
Tensor
- Returns
Discriminator logits of shape (batch_size,). A high output indicates an expert-like transition.
- property policy: BasePolicy
Returns a policy imitating the demonstration data.
- Return type
BasePolicy
- abstract property reward_test: RewardNet
Reward used to train policy at “test” time after adversarial training.
- Return type
- abstract property reward_train: RewardNet
Reward used to train generator policy.
- Return type
- set_demonstrations(demonstrations)[source]
Sets the demonstration data.
Changing the demonstration data on-demand can be useful for interactive algorithms like DAgger.
- Parameters
demonstrations (
Union
[Iterable
[Trajectory
],Iterable
[TransitionMapping
],TransitionsMinimal
]) – Either a Torch DataLoader, any other iterator that yields dictionaries containing “obs” and “acts” Tensors or NumPy arrays, TransitionKind instance, or a Sequence of Trajectory objects.- Return type
None
- train(total_timesteps, callback=None)[source]
Alternates between training the generator and discriminator.
Every “round” consists of a call to train_gen(self.gen_train_timesteps), a call to train_disc, and finally a call to callback(round).
Training ends once an additional “round” would cause the number of transitions sampled from the environment to exceed total_timesteps.
- Parameters
total_timesteps (
int
) – An upper bound on the number of transitions to sample from the environment during training.callback (
Optional
[Callable
[[int
],None
]]) – A function called at the end of every round which takes in a single argument, the round number. Round numbers are in range(total_timesteps // self.gen_train_timesteps).
- Return type
None
- train_disc(*, expert_samples=None, gen_samples=None)[source]
Perform a single discriminator update, optionally using provided samples.
- Parameters
expert_samples (
Optional
[Mapping
]) – Transition samples from the expert in dictionary form. If provided, must contain keys corresponding to every field of the Transitions dataclass except “infos”. All corresponding values can be either NumPy arrays or Tensors. Extra keys are ignored. Must contain self.demo_batch_size samples. If this argument is not provided, then self.demo_batch_size expert samples from self.demo_data_loader are used by default.gen_samples (
Optional
[Mapping
]) – Transition samples from the generator policy in same dictionary form as expert_samples. If provided, must contain exactly self.demo_batch_size samples. If not provided, then take len(expert_samples) samples from the generator replay buffer.
- Return type
Mapping
[str
,float
]- Returns
Statistics for discriminator (e.g. loss, accuracy).
- train_gen(total_timesteps=None, learn_kwargs=None)[source]
Trains the generator to maximize the discriminator loss.
After the end of training populates the generator replay buffer (used in discriminator training) with self.disc_batch_size transitions.
- Parameters
total_timesteps (
Optional
[int
]) – The number of transitions to sample from self.venv_train during training. By default, self.gen_train_timesteps.learn_kwargs (
Optional
[Mapping
]) – kwargs for the Stable Baselines RLModel.learn() method.
- Return type
None
- venv: VecEnv
The original vectorized environment.
- venv_train: VecEnv
Like self.venv, but wrapped with train reward unless in debug mode.
If debug_use_ground_truth=True was passed into the initializer then self.venv_train is the same as self.venv.
- venv_wrapped: VecEnvWrapper
Adversarial Inverse Reinforcement Learning (AIRL)#
AIRL, similar to GAIL, adversarially trains a policy against a discriminator that aims to distinguish the expert demonstrations from the learned policy. Unlike GAIL, AIRL recovers a reward function that is more generalizable to changes in environment dynamics.
The expert policy must be stochastic.
Example#
Detailed example notebook: Train an Agent using Adversarial Inverse Reinforcement Learning
import numpy as np
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.ppo import MlpPolicy
from imitation.algorithms.adversarial.airl import AIRL
from imitation.data import rollout
from imitation.data.wrappers import RolloutInfoWrapper
from imitation.policies.serialize import load_policy
from imitation.rewards.reward_nets import BasicShapedRewardNet
from imitation.util.networks import RunningNorm
from imitation.util.util import make_vec_env
SEED = 42
env = make_vec_env(
"seals:seals/CartPole-v0",
rng=np.random.default_rng(SEED),
n_envs=8,
post_wrappers=[lambda env, _: RolloutInfoWrapper(env)], # to compute rollouts
)
expert = load_policy(
"ppo-huggingface",
organization="HumanCompatibleAI",
env_name="seals-CartPole-v0",
venv=env,
)
rollouts = rollout.rollout(
expert,
env,
rollout.make_sample_until(min_episodes=60),
rng=np.random.default_rng(SEED),
)
learner = PPO(
env=env,
policy=MlpPolicy,
batch_size=64,
ent_coef=0.0,
learning_rate=0.0005,
gamma=0.95,
clip_range=0.1,
vf_coef=0.1,
n_epochs=5,
seed=SEED,
)
reward_net = BasicShapedRewardNet(
observation_space=env.observation_space,
action_space=env.action_space,
normalize_input_layer=RunningNorm,
)
airl_trainer = AIRL(
demonstrations=rollouts,
demo_batch_size=2048,
gen_replay_buffer_capacity=512,
n_disc_updates_per_round=16,
venv=env,
gen_algo=learner,
reward_net=reward_net,
)
env.seed(SEED)
learner_rewards_before_training, _ = evaluate_policy(
learner, env, 100, return_episode_rewards=True,
)
airl_trainer.train(20000) # Train for 2_000_000 steps to match expert.
env.seed(SEED)
learner_rewards_after_training, _ = evaluate_policy(
learner, env, 100, return_episode_rewards=True,
)
print("mean reward after training:", np.mean(learner_rewards_after_training))
print("mean reward before training:", np.mean(learner_rewards_before_training))
API#
- class imitation.algorithms.adversarial.airl.AIRL(*, demonstrations, demo_batch_size, venv, gen_algo, reward_net, **kwargs)[source]
Bases:
AdversarialTrainer
Adversarial Inverse Reinforcement Learning (AIRL).
- __init__(*, demonstrations, demo_batch_size, venv, gen_algo, reward_net, **kwargs)[source]
Builds an AIRL trainer.
- Parameters
demonstrations (
Union
[Iterable
[Trajectory
],Iterable
[TransitionMapping
],TransitionsMinimal
]) – Demonstrations from an expert (optional). Transitions expressed directly as a types.TransitionsMinimal object, a sequence of trajectories, or an iterable of transition batches (mappings from keywords to arrays containing observations, etc).demo_batch_size (
int
) – The number of samples in each batch of expert data. The discriminator batch size is twice this number because each discriminator batch contains a generator sample for every expert sample.venv (
VecEnv
) – The vectorized environment to train in.gen_algo (
BaseAlgorithm
) – The generator RL algorithm that is trained to maximize discriminator confusion. Environment and logger will be set to venv and custom_logger.reward_net (
RewardNet
) – Reward network; used as part of AIRL discriminator.**kwargs – Passed through to AdversarialTrainer.__init__.
- Raises
TypeError – If gen_algo.policy does not have an evaluate_actions attribute (present in ActorCriticPolicy), needed to compute log-probability of actions.
- allow_variable_horizon: bool
If True, allow variable horizon trajectories; otherwise error if detected.
- property logger: HierarchicalLogger
- Return type
- logits_expert_is_high(state, action, next_state, done, log_policy_act_prob=None)[source]
Compute the discriminator’s logits for each state-action sample.
In Fu’s AIRL paper (https://arxiv.org/pdf/1710.11248.pdf), the discriminator output was given as
\[D_{\theta}(s,a) = \frac{ \exp{r_{\theta}(s,a)} } { \exp{r_{\theta}(s,a)} + \pi(a|s) }\]with a high value corresponding to the expert and a low value corresponding to the generator.
In other words, the discriminator output is the probability that the action is taken by the expert rather than the generator.
The logit of the above is given as
\[\operatorname{logit}(D_{\theta}(s,a)) = r_{\theta}(s,a) - \log{ \pi(a|s) }\]which is what is returned by this function.
- Parameters
state (
Tensor
) – The state of the environment at the time of the action.action (
Tensor
) – The action taken by the expert or generator.next_state (
Tensor
) – The state of the environment after the action.done (
Tensor
) – whether a terminal state (as defined under the MDP of the task) has been reached.log_policy_act_prob (
Optional
[Tensor
]) – The log probability of the action taken by the generator, \(\log{ \pi(a|s) }\).
- Return type
Tensor
- Returns
The logits of the discriminator for each state-action sample.
- Raises
TypeError – If log_policy_act_prob is None.
- property policy: BasePolicy
Returns a policy imitating the demonstration data.
- Return type
BasePolicy
- property reward_test: RewardNet
Returns the unshaped version of reward network used for testing.
- Return type
- set_demonstrations(demonstrations)
Sets the demonstration data.
Changing the demonstration data on-demand can be useful for interactive algorithms like DAgger.
- Parameters
demonstrations (
Union
[Iterable
[Trajectory
],Iterable
[TransitionMapping
],TransitionsMinimal
]) – Either a Torch DataLoader, any other iterator that yields dictionaries containing “obs” and “acts” Tensors or NumPy arrays, TransitionKind instance, or a Sequence of Trajectory objects.- Return type
None
- train(total_timesteps, callback=None)
Alternates between training the generator and discriminator.
Every “round” consists of a call to train_gen(self.gen_train_timesteps), a call to train_disc, and finally a call to callback(round).
Training ends once an additional “round” would cause the number of transitions sampled from the environment to exceed total_timesteps.
- Parameters
total_timesteps (
int
) – An upper bound on the number of transitions to sample from the environment during training.callback (
Optional
[Callable
[[int
],None
]]) – A function called at the end of every round which takes in a single argument, the round number. Round numbers are in range(total_timesteps // self.gen_train_timesteps).
- Return type
None
- train_disc(*, expert_samples=None, gen_samples=None)
Perform a single discriminator update, optionally using provided samples.
- Parameters
expert_samples (
Optional
[Mapping
]) – Transition samples from the expert in dictionary form. If provided, must contain keys corresponding to every field of the Transitions dataclass except “infos”. All corresponding values can be either NumPy arrays or Tensors. Extra keys are ignored. Must contain self.demo_batch_size samples. If this argument is not provided, then self.demo_batch_size expert samples from self.demo_data_loader are used by default.gen_samples (
Optional
[Mapping
]) – Transition samples from the generator policy in same dictionary form as expert_samples. If provided, must contain exactly self.demo_batch_size samples. If not provided, then take len(expert_samples) samples from the generator replay buffer.
- Return type
Mapping
[str
,float
]- Returns
Statistics for discriminator (e.g. loss, accuracy).
- train_gen(total_timesteps=None, learn_kwargs=None)
Trains the generator to maximize the discriminator loss.
After the end of training populates the generator replay buffer (used in discriminator training) with self.disc_batch_size transitions.
- Parameters
total_timesteps (
Optional
[int
]) – The number of transitions to sample from self.venv_train during training. By default, self.gen_train_timesteps.learn_kwargs (
Optional
[Mapping
]) – kwargs for the Stable Baselines RLModel.learn() method.
- Return type
None
- venv: VecEnv
The original vectorized environment.
- venv_train: VecEnv
Like self.venv, but wrapped with train reward unless in debug mode.
If debug_use_ground_truth=True was passed into the initializer then self.venv_train is the same as self.venv.
- venv_wrapped: VecEnvWrapper
- class imitation.algorithms.adversarial.common.AdversarialTrainer(*, demonstrations, demo_batch_size, venv, gen_algo, reward_net, demo_minibatch_size=None, n_disc_updates_per_round=2, log_dir='output/', disc_opt_cls=<class 'torch.optim.adam.Adam'>, disc_opt_kwargs=None, gen_train_timesteps=None, gen_replay_buffer_capacity=None, custom_logger=None, init_tensorboard=False, init_tensorboard_graph=False, debug_use_ground_truth=False, allow_variable_horizon=False)[source]
Bases:
DemonstrationAlgorithm
[Transitions
]Base class for adversarial imitation learning algorithms like GAIL and AIRL.
- __init__(*, demonstrations, demo_batch_size, venv, gen_algo, reward_net, demo_minibatch_size=None, n_disc_updates_per_round=2, log_dir='output/', disc_opt_cls=<class 'torch.optim.adam.Adam'>, disc_opt_kwargs=None, gen_train_timesteps=None, gen_replay_buffer_capacity=None, custom_logger=None, init_tensorboard=False, init_tensorboard_graph=False, debug_use_ground_truth=False, allow_variable_horizon=False)[source]
Builds AdversarialTrainer.
- Parameters
demonstrations (
Union
[Iterable
[Trajectory
],Iterable
[TransitionMapping
],TransitionsMinimal
]) – Demonstrations from an expert (optional). Transitions expressed directly as a types.TransitionsMinimal object, a sequence of trajectories, or an iterable of transition batches (mappings from keywords to arrays containing observations, etc).demo_batch_size (
int
) – The number of samples in each batch of expert data. The discriminator batch size is twice this number because each discriminator batch contains a generator sample for every expert sample.venv (
VecEnv
) – The vectorized environment to train in.gen_algo (
BaseAlgorithm
) – The generator RL algorithm that is trained to maximize discriminator confusion. Environment and logger will be set to venv and custom_logger.reward_net (
RewardNet
) – a Torch module that takes an observation, action and next observation tensors as input and computes a reward signal.demo_minibatch_size (
Optional
[int
]) – size of minibatch to calculate gradients over. The gradients are accumulated until the entire batch is processed before making an optimization step. This is useful in GPU training to reduce memory usage, since fewer examples are loaded into memory at once, facilitating training with larger batch sizes, but is generally slower. Must be a factor of demo_batch_size. Optional, defaults to demo_batch_size.n_disc_updates_per_round (
int
) – The number of discriminator updates after each round of generator updates in AdversarialTrainer.learn().log_dir (
Union
[str
,bytes
,PathLike
]) – Directory to store TensorBoard logs, plots, etc. in.disc_opt_cls (
Type
[Optimizer
]) – The optimizer for discriminator training.disc_opt_kwargs (
Optional
[Mapping
]) – Parameters for discriminator training.gen_train_timesteps (
Optional
[int
]) – The number of steps to train the generator policy for each iteration. If None, then defaults to the batch size (for on-policy) or number of environments (for off-policy).gen_replay_buffer_capacity (
Optional
[int
]) – The capacity of the generator replay buffer (the number of obs-action-obs samples from the generator that can be stored). By default this is equal to gen_train_timesteps, meaning that we sample only from the most recent batch of generator samples.custom_logger (
Optional
[HierarchicalLogger
]) – Where to log to; if None (default), creates a new logger.init_tensorboard (
bool
) – If True, makes various discriminator TensorBoard summaries.init_tensorboard_graph (
bool
) – If both this and init_tensorboard are True, then write a Tensorboard graph summary to disk.debug_use_ground_truth (
bool
) – If True, use the ground truth reward for self.train_env. This disables the reward wrapping that would normally replace the environment reward with the learned reward. This is useful for sanity checking that the policy training is functional.allow_variable_horizon (
bool
) – If False (default), algorithm will raise an exception if it detects trajectories of different length during training. If True, overrides this safety check. WARNING: variable horizon episodes leak information about the reward via termination condition, and can seriously confound evaluation. Read https://imitation.readthedocs.io/en/latest/guide/variable_horizon.html before overriding this.
- Raises
ValueError – if the batch size is not a multiple of the minibatch size.
- allow_variable_horizon: bool
If True, allow variable horizon trajectories; otherwise error if detected.
- abstract logits_expert_is_high(state, action, next_state, done, log_policy_act_prob=None)[source]
Compute the discriminator’s logits for each state-action sample.
A high value corresponds to predicting expert, and a low value corresponds to predicting generator.
- Parameters
state (
Tensor
) – state at time t, of shape (batch_size,) + state_shape.action (
Tensor
) – action taken at time t, of shape (batch_size,) + action_shape.next_state (
Tensor
) – state at time t+1, of shape (batch_size,) + state_shape.done (
Tensor
) – binary episode completion flag after action at time t, of shape (batch_size,).log_policy_act_prob (
Optional
[Tensor
]) – log probability of generator policy taking action at time t.
- Return type
Tensor
- Returns
Discriminator logits of shape (batch_size,). A high output indicates an expert-like transition.
- property policy: BasePolicy
Returns a policy imitating the demonstration data.
- Return type
BasePolicy
- abstract property reward_test: RewardNet
Reward used to train policy at “test” time after adversarial training.
- Return type
- abstract property reward_train: RewardNet
Reward used to train generator policy.
- Return type
- set_demonstrations(demonstrations)[source]
Sets the demonstration data.
Changing the demonstration data on-demand can be useful for interactive algorithms like DAgger.
- Parameters
demonstrations (
Union
[Iterable
[Trajectory
],Iterable
[TransitionMapping
],TransitionsMinimal
]) – Either a Torch DataLoader, any other iterator that yields dictionaries containing “obs” and “acts” Tensors or NumPy arrays, TransitionKind instance, or a Sequence of Trajectory objects.- Return type
None
- train(total_timesteps, callback=None)[source]
Alternates between training the generator and discriminator.
Every “round” consists of a call to train_gen(self.gen_train_timesteps), a call to train_disc, and finally a call to callback(round).
Training ends once an additional “round” would cause the number of transitions sampled from the environment to exceed total_timesteps.
- Parameters
total_timesteps (
int
) – An upper bound on the number of transitions to sample from the environment during training.callback (
Optional
[Callable
[[int
],None
]]) – A function called at the end of every round which takes in a single argument, the round number. Round numbers are in range(total_timesteps // self.gen_train_timesteps).
- Return type
None
- train_disc(*, expert_samples=None, gen_samples=None)[source]
Perform a single discriminator update, optionally using provided samples.
- Parameters
expert_samples (
Optional
[Mapping
]) – Transition samples from the expert in dictionary form. If provided, must contain keys corresponding to every field of the Transitions dataclass except “infos”. All corresponding values can be either NumPy arrays or Tensors. Extra keys are ignored. Must contain self.demo_batch_size samples. If this argument is not provided, then self.demo_batch_size expert samples from self.demo_data_loader are used by default.gen_samples (
Optional
[Mapping
]) – Transition samples from the generator policy in same dictionary form as expert_samples. If provided, must contain exactly self.demo_batch_size samples. If not provided, then take len(expert_samples) samples from the generator replay buffer.
- Return type
Mapping
[str
,float
]- Returns
Statistics for discriminator (e.g. loss, accuracy).
- train_gen(total_timesteps=None, learn_kwargs=None)[source]
Trains the generator to maximize the discriminator loss.
After the end of training populates the generator replay buffer (used in discriminator training) with self.disc_batch_size transitions.
- Parameters
total_timesteps (
Optional
[int
]) – The number of transitions to sample from self.venv_train during training. By default, self.gen_train_timesteps.learn_kwargs (
Optional
[Mapping
]) – kwargs for the Stable Baselines RLModel.learn() method.
- Return type
None
- venv: VecEnv
The original vectorized environment.
- venv_train: VecEnv
Like self.venv, but wrapped with train reward unless in debug mode.
If debug_use_ground_truth=True was passed into the initializer then self.venv_train is the same as self.venv.
- venv_wrapped: VecEnvWrapper
DAgger#
DAgger (Dataset Aggregation) iteratively trains a policy using supervised learning on a dataset of observation-action pairs from expert demonstrations (like behavioral cloning), runs the policy to gather observations, queries the expert for good actions on those observations, and adds the newly labeled observations to the dataset. DAgger improves on behavioral cloning by training on a dataset that better resembles the observations the trained policy is likely to encounter, but it requires querying the expert online.
Note
DAgger paper: A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
Example#
Detailed example notebook: Train an Agent using the DAgger Algorithm
import tempfile
import numpy as np
import gymnasium as gym
from stable_baselines3.common.evaluation import evaluate_policy
from imitation.algorithms import bc
from imitation.algorithms.dagger import SimpleDAggerTrainer
from imitation.policies.serialize import load_policy
from imitation.util.util import make_vec_env
rng = np.random.default_rng(0)
env = make_vec_env(
"seals:seals/CartPole-v0",
rng=rng,
)
expert = load_policy(
"ppo-huggingface",
organization="HumanCompatibleAI",
env_name="seals-CartPole-v0",
venv=env,
)
bc_trainer = bc.BC(
observation_space=env.observation_space,
action_space=env.action_space,
rng=rng,
)
with tempfile.TemporaryDirectory(prefix="dagger_example_") as tmpdir:
print(tmpdir)
dagger_trainer = SimpleDAggerTrainer(
venv=env,
scratch_dir=tmpdir,
expert_policy=expert,
bc_trainer=bc_trainer,
rng=rng,
)
dagger_trainer.train(8_000)
reward, _ = evaluate_policy(dagger_trainer.policy, env, 10)
print("Reward:", reward)
API#
- class imitation.algorithms.dagger.InteractiveTrajectoryCollector(venv, get_robot_acts, beta, save_dir, rng)[source]
Bases:
VecEnvWrapper
DAgger VecEnvWrapper for querying and saving expert actions.
Every call to .step(actions) accepts and saves expert actions to self.save_dir, but only forwards expert actions to the wrapped VecEnv with probability self.beta. With probability 1 - self.beta, a “robot” action (i.e an action from the imitation policy) is forwarded instead.
Demonstrations are saved as TrajectoryWithRew to self.save_dir at the end of every episode.
- __init__(venv, get_robot_acts, beta, save_dir, rng)[source]
Builds InteractiveTrajectoryCollector.
- Parameters
venv (
VecEnv
) – vectorized environment to sample trajectories from.get_robot_acts (
Callable
[[ndarray
],ndarray
]) – get robot actions that can be substituted for human actions. Takes a vector of observations as input & returns a vector of actions.beta (
float
) – fraction of the time to use action given to .step() instead of robot action. The choice of robot or human action is independently randomized for each individual Env at every timestep.save_dir (
Union
[str
,bytes
,PathLike
]) – directory to save collected trajectories in.rng (
Generator
) – random state for random number generation.
- close()
Clean up the environment’s resources.
- Return type
None
- env_is_wrapped(wrapper_class, indices=None)
Check if environments are wrapped with a given wrapper.
- Parameters
method_name – The name of the environment method to invoke.
indices (
Union
[None
,int
,Iterable
[int
]]) – Indices of envs whose method to callmethod_args – Any positional arguments to provide in the call
method_kwargs – Any keyword arguments to provide in the call
- Return type
List
[bool
]- Returns
True if the env is wrapped, False otherwise, for each env queried.
- env_method(method_name, *method_args, indices=None, **method_kwargs)
Call instance methods of vectorized environments.
- Parameters
method_name (
str
) – The name of the environment method to invoke.indices (
Union
[None
,int
,Iterable
[int
]]) – Indices of envs whose method to callmethod_args – Any positional arguments to provide in the call
method_kwargs – Any keyword arguments to provide in the call
- Return type
List
[Any
]- Returns
List of items returned by the environment’s method call
- get_attr(attr_name, indices=None)
Return attribute from vectorized environment.
- Parameters
attr_name (
str
) – The name of the attribute whose value to returnindices (
Union
[None
,int
,Iterable
[int
]]) – Indices of envs to get attribute from
- Return type
List
[Any
]- Returns
List of values of ‘attr_name’ in all environments
- get_images()
Return RGB images from each environment when available
- Return type
Sequence
[Optional
[ndarray
]]
- getattr_depth_check(name, already_found)
See base class.
- Return type
Optional
[str
]- Returns
name of module whose attribute is being shadowed, if any.
- getattr_recursive(name)
Recursively check wrappers to find attribute.
- Parameters
name (
str
) – name of attribute to look for- Return type
Any
- Returns
attribute
- render(mode=None)
Gym environment rendering
- Parameters
mode (
Optional
[str
]) – the rendering type- Return type
Optional
[ndarray
]
- reset()[source]
Resets the environment.
- Returns
first observation of a new trajectory.
- Return type
obs
- reset_infos: List[Dict[str, Any]]
- seed(seed=None)[source]
Set the seed for the DAgger random number generator and wrapped VecEnv.
The DAgger RNG is used along with self.beta to determine whether the expert or robot action is forwarded to the wrapped VecEnv.
- Parameters
seed (
Optional
[int
]) – The random seed. May be None for completely random seeding.- Return type
List
[Optional
[int
]]- Returns
A list containing the seeds for each individual env. Note that all list elements may be None, if the env does not return anything when seeded.
- set_attr(attr_name, value, indices=None)
Set attribute inside vectorized environments.
- Parameters
attr_name (
str
) – The name of attribute to assign new valuevalue (
Any
) – Value to assign to attr_nameindices (
Union
[None
,int
,Iterable
[int
]]) – Indices of envs to assign value
- Return type
None
- Returns
- set_options(options=None)
Set environment options for all environments. If a dict is passed instead of a list, the same options will be used for all environments. WARNING: Those options will only be passed to the environment at the next reset.
- Parameters
options (
Union
[List
[Dict
],Dict
,None
]) – A dictionary of environment options to pass to each environment at the next reset.- Return type
None
- step(actions)
Step the environments with the given action
- Parameters
actions (
ndarray
) – the action- Return type
Tuple
[Union
[ndarray
,Dict
[str
,ndarray
],Tuple
[ndarray
,...
]],ndarray
,ndarray
,List
[Dict
]]- Returns
observation, reward, done, information
- step_async(actions)[source]
Steps with a 1 - beta chance of using self.get_robot_acts instead.
DAgger needs to be able to inject imitation policy actions randomly at some subset of time steps. This method has a self.beta chance of keeping the actions passed in as an argument, and a 1 - self.beta chance of forwarding actions generated by self.get_robot_acts instead. “robot” (i.e. imitation policy) action if necessary.
At the end of every episode, a TrajectoryWithRew is saved to self.save_dir, where every saved action is the expert action, regardless of whether the robot action was used during that timestep.
- Parameters
actions (
ndarray
) – the _intended_ demonstrator/expert actions for the current state. This will be executed with probability self.beta. Otherwise, a “robot” (typically a BC policy) action will be sampled and executed instead via self.get_robot_act.- Return type
None
- step_wait()[source]
Returns observation, reward, etc after previous step_async() call.
Stores the transition, and saves trajectory as demo once complete.
- Return type
Tuple
[Union
[ndarray
,Dict
[str
,ndarray
],Tuple
[ndarray
,...
]],ndarray
,ndarray
,List
[Dict
]]- Returns
Observation, reward, dones (is terminal?) and info dict.
- traj_accum: Optional[TrajectoryAccumulator]
- property unwrapped: VecEnv
- Return type
VecEnv
- class imitation.algorithms.dagger.DAggerTrainer(*, venv, scratch_dir, rng, beta_schedule=None, bc_trainer, custom_logger=None)[source]
Bases:
BaseImitationAlgorithm
DAgger training class with low-level API suitable for interactive human feedback.
In essence, this is just BC with some helpers for incrementally resuming training and interpolating between demonstrator/learnt policies. Interaction proceeds in “rounds” in which the demonstrator first provides a fresh set of demonstrations, and then an underlying BC is invoked to fine-tune the policy on the entire set of demonstrations collected in all rounds so far. Demonstrations and policy/trainer checkpoints are stored in a directory with the following structure:
scratch-dir-name/ checkpoint-001.pt checkpoint-002.pt … checkpoint-XYZ.pt checkpoint-latest.pt demos/ round-000/ demos_round_000_000.npz demos_round_000_001.npz … round-001/ demos_round_001_000.npz … … round-XYZ/ …
- DEFAULT_N_EPOCHS: int = 4
The default number of BC training epochs in extend_and_update.
- __init__(*, venv, scratch_dir, rng, beta_schedule=None, bc_trainer, custom_logger=None)[source]
Builds DAggerTrainer.
- Parameters
venv (
VecEnv
) – Vectorized training environment.scratch_dir (
Union
[str
,bytes
,PathLike
]) – Directory to use to store intermediate training information (e.g. for resuming training).rng (
Generator
) – random state for random number generation.beta_schedule (
Optional
[Callable
[[int
],float
]]) – Provides a value of beta (the probability of taking expert action in any given state) at each round of training. If None, then linear_beta_schedule will be used instead.bc_trainer (
BC
) – A BC instance used to train the underlying policy.custom_logger (
Optional
[HierarchicalLogger
]) – Where to log to; if None (default), creates a new logger.
- allow_variable_horizon: bool
If True, allow variable horizon trajectories; otherwise error if detected.
- property batch_size: int
- Return type
int
- create_trajectory_collector()[source]
Create trajectory collector to extend current round’s demonstration set.
- Return type
- Returns
A collector configured with the appropriate beta, imitator policy, etc. for the current round. Refer to the documentation for InteractiveTrajectoryCollector to see how to use this.
- extend_and_update(bc_train_kwargs=None)[source]
Extend internal batch of data and train BC.
Specifically, this method will load new transitions (if necessary), train the model for a while, and advance the round counter. If there are no fresh demonstrations in the demonstration directory for the current round, then this will raise a NeedsDemosException instead of training or advancing the round counter. In that case, the user should call .create_trajectory_collector() and use the returned InteractiveTrajectoryCollector to produce a new set of demonstrations for the current interaction round.
- Parameters
bc_train_kwargs (
Optional
[Mapping
[str
,Any
]]) – Keyword arguments for calling BC.train(). If the log_rollouts_venv key is not provided, then it is set to self.venv by default. If neither of the n_epochs and n_batches keys are provided, then n_epochs is set to self.DEFAULT_N_EPOCHS.- Return type
int
- Returns
New round number after advancing the round counter.
- property logger: HierarchicalLogger
Returns logger for this object.
- Return type
- property policy: BasePolicy
- Return type
BasePolicy
- save_trainer()[source]
Create a snapshot of trainer in the scratch/working directory.
The created snapshot can be reloaded with reconstruct_trainer(). In addition to saving one copy of the policy in the trainer snapshot, this method saves a second copy of the policy in its own file. Having a second copy of the policy is convenient because it can be loaded on its own and passed to evaluation routines for other algorithms.
- Returns
a path to one of the created DAggerTrainer checkpoints. policy_path: a path to one of the created DAggerTrainer policies.
- Return type
checkpoint_path
- class imitation.algorithms.dagger.SimpleDAggerTrainer(*, venv, scratch_dir, expert_policy, rng, expert_trajs=None, **dagger_trainer_kwargs)[source]
Bases:
DAggerTrainer
Simpler subclass of DAggerTrainer for training with synthetic feedback.
- DEFAULT_N_EPOCHS: int = 4
The default number of BC training epochs in extend_and_update.
- __init__(*, venv, scratch_dir, expert_policy, rng, expert_trajs=None, **dagger_trainer_kwargs)[source]
Builds SimpleDAggerTrainer.
- Parameters
venv (
VecEnv
) – Vectorized training environment. Note that when the robot action is randomly injected (in accordance with beta_schedule argument), every individual environment will get a robot action simultaneously for that timestep.scratch_dir (
Union
[str
,bytes
,PathLike
]) – Directory to use to store intermediate training information (e.g. for resuming training).expert_policy (
BasePolicy
) – The expert policy used to generate synthetic demonstrations.rng (
Generator
) – Random state to use for the random number generator.expert_trajs (
Optional
[Sequence
[Trajectory
]]) – Optional starting dataset that is inserted into the round 0 dataset.dagger_trainer_kwargs – Other keyword arguments passed to the superclass initializer DAggerTrainer.__init__.
- Raises
ValueError – The observation or action space does not match between venv and expert_policy.
- allow_variable_horizon: bool
If True, allow variable horizon trajectories; otherwise error if detected.
- property batch_size: int
- Return type
int
- create_trajectory_collector()
Create trajectory collector to extend current round’s demonstration set.
- Return type
- Returns
A collector configured with the appropriate beta, imitator policy, etc. for the current round. Refer to the documentation for InteractiveTrajectoryCollector to see how to use this.
- extend_and_update(bc_train_kwargs=None)
Extend internal batch of data and train BC.
Specifically, this method will load new transitions (if necessary), train the model for a while, and advance the round counter. If there are no fresh demonstrations in the demonstration directory for the current round, then this will raise a NeedsDemosException instead of training or advancing the round counter. In that case, the user should call .create_trajectory_collector() and use the returned InteractiveTrajectoryCollector to produce a new set of demonstrations for the current interaction round.
- Parameters
bc_train_kwargs (
Optional
[Mapping
[str
,Any
]]) – Keyword arguments for calling BC.train(). If the log_rollouts_venv key is not provided, then it is set to self.venv by default. If neither of the n_epochs and n_batches keys are provided, then n_epochs is set to self.DEFAULT_N_EPOCHS.- Return type
int
- Returns
New round number after advancing the round counter.
- property logger: HierarchicalLogger
Returns logger for this object.
- Return type
- property policy: BasePolicy
- Return type
BasePolicy
- save_trainer()
Create a snapshot of trainer in the scratch/working directory.
The created snapshot can be reloaded with reconstruct_trainer(). In addition to saving one copy of the policy in the trainer snapshot, this method saves a second copy of the policy in its own file. Having a second copy of the policy is convenient because it can be loaded on its own and passed to evaluation routines for other algorithms.
- Returns
a path to one of the created DAggerTrainer checkpoints. policy_path: a path to one of the created DAggerTrainer policies.
- Return type
checkpoint_path
- train(total_timesteps, *, rollout_round_min_episodes=3, rollout_round_min_timesteps=500, bc_train_kwargs=None)[source]
Train the DAgger agent.
The agent is trained in “rounds” where each round consists of a dataset aggregation step followed by BC update step.
During a dataset aggregation step, self.expert_policy is used to perform rollouts in the environment but there is a 1 - beta chance (beta is determined from the round number and self.beta_schedule) that the DAgger agent’s action is used instead. Regardless of whether the DAgger agent’s action is used during the rollout, the expert action and corresponding observation are always appended to the dataset. The number of environment steps in the dataset aggregation stage is determined by the rollout_round_min* arguments.
During a BC update step, BC.train() is called to update the DAgger agent on all data collected so far.
- Parameters
total_timesteps (
int
) – The number of timesteps to train inside the environment. In practice this is a lower bound, because the number of timesteps is rounded up to finish the minimum number of episodes or timesteps in the last DAgger training round, and the environment timesteps are executed in multiples of self.venv.num_envs.rollout_round_min_episodes (
int
) – The number of episodes the must be completed completed before a dataset aggregation step ends.rollout_round_min_timesteps (
int
) – The number of environment timesteps that must be completed before a dataset aggregation step ends. Also, that any round will always train for at least self.batch_size timesteps, because otherwise BC could fail to receive any batches.bc_train_kwargs (
Optional
[dict
]) – Keyword arguments for calling BC.train(). If the log_rollouts_venv key is not provided, then it is set to self.venv by default. If neither of the n_epochs and n_batches keys are provided, then n_epochs is set to self.DEFAULT_N_EPOCHS.
- Return type
None
Density-Based Reward Modeling#
Density-based reward modeling is an inverse reinforcement learning (IRL) technique that assigns higher rewards to states or state-action pairs that occur more frequently in an expert’s demonstrations. This variant utilizes kernel density estimation to model the underlying distribution of expert demonstrations. It assigns rewards to states or state-action pairs based on their estimated log-likelihood under the distribution of expert demonstrations.
The key intuition behind this method is to incentivize the agent to take actions that resemble the expert’s actions in similar states.
While this approach is relatively simple, it does have several drawbacks:
It assumes that the expert demonstrations are representative of the expert’s behavior, which may not always be true.
It does not provide an interpretable reward function.
The kernel density estimation is not well-suited for high-dimensional state-action spaces.
Example#
Detailed example notebook: Learning a Reward Function using Kernel Density
import pprint
import numpy as np
from stable_baselines3 import PPO
from stable_baselines3.common.policies import ActorCriticPolicy
from imitation.algorithms import density as db
from imitation.data import serialize
from imitation.util import util
rng = np.random.default_rng(0)
env = util.make_vec_env("Pendulum-v1", rng=rng, n_envs=2)
rollouts = serialize.load("../tests/testdata/expert_models/pendulum_0/rollouts/final.npz")
imitation_trainer = PPO(
ActorCriticPolicy, env, learning_rate=3e-4, gamma=0.95, ent_coef=1e-4, n_steps=2048
)
density_trainer = db.DensityAlgorithm(
venv=env,
rng=rng,
demonstrations=rollouts,
rl_algo=imitation_trainer,
density_type=db.DensityType.STATE_ACTION_DENSITY,
is_stationary=True,
kernel="gaussian",
kernel_bandwidth=0.4,
standardise_inputs=True,
)
density_trainer.train()
def print_stats(density_trainer, n_trajectories):
stats = density_trainer.test_policy(n_trajectories=n_trajectories)
print("True reward function stats:")
pprint.pprint(stats)
stats_im = density_trainer.test_policy(true_reward=False, n_trajectories=n_trajectories)
print("Imitation reward function stats:")
pprint.pprint(stats_im)
print("Stats before training:")
print_stats(density_trainer, 1)
density_trainer.train_policy(100) # Train for 1_000_000 steps to approach expert performance.
print("Stats after training:")
print_stats(density_trainer, 1)
API#
- class imitation.algorithms.density.DensityAlgorithm(*, demonstrations, venv, rng, density_type=DensityType.STATE_ACTION_DENSITY, kernel='gaussian', kernel_bandwidth=0.5, rl_algo=None, is_stationary=True, standardise_inputs=True, custom_logger=None, allow_variable_horizon=False)[source]
Bases:
DemonstrationAlgorithm
Learns a reward function based on density modeling.
Specifically, it constructs a non-parametric estimate of p(s), p(s,a), p(s,s’) and then computes a reward using the log of these probabilities.
- __init__(*, demonstrations, venv, rng, density_type=DensityType.STATE_ACTION_DENSITY, kernel='gaussian', kernel_bandwidth=0.5, rl_algo=None, is_stationary=True, standardise_inputs=True, custom_logger=None, allow_variable_horizon=False)[source]
Builds DensityAlgorithm.
- Parameters
demonstrations (
Union
[Iterable
[Trajectory
],Iterable
[TransitionMapping
],TransitionsMinimal
,None
]) – expert demonstration trajectories.density_type (
DensityType
) – type of density to train on: single state, state-action pairs, or state-state pairs.kernel (
str
) – kernel to use for density estimation with sklearn.KernelDensity.kernel_bandwidth (
float
) – bandwidth of kernel. If standardise_inputs is true and you are using a Gaussian kernel, then it probably makes sense to set this somewhere between 0.1 and 1.venv (
VecEnv
) – The environment to learn a reward model in. We don’t actually need any environment interaction to fit the reward model, but we use this to extract the observation and action space, and to train the RL algorithm rl_algo (if specified).rng (
Generator
) – random state for sampling from demonstrations.rl_algo (
Optional
[BaseAlgorithm
]) – An RL algorithm to train on the resulting reward model (optional).is_stationary (
bool
) – if True, share same density models for all timesteps; if False, use a different density model for each timestep. A non-stationary model is particularly likely to be useful when using STATE_DENSITY, to encourage agent to imitate entire trajectories, not just a few states that have high frequency in the demonstration dataset. If non-stationary, demonstrations must be trajectories, not transitions (which do not contain timesteps).standardise_inputs (
bool
) – if True, then the inputs to the reward model will be standardised to have zero mean and unit variance over the demonstration trajectories. Otherwise, inputs will be passed to the reward model with their ordinary scale.custom_logger (
Optional
[HierarchicalLogger
]) – Where to log to; if None (default), creates a new logger.allow_variable_horizon (
bool
) – If False (default), algorithm will raise an exception if it detects trajectories of different length during training. If True, overrides this safety check. WARNING: variable horizon episodes leak information about the reward via termination condition, and can seriously confound evaluation. Read https://imitation.readthedocs.io/en/latest/guide/variable_horizon.html before overriding this.
- allow_variable_horizon: bool
If True, allow variable horizon trajectories; otherwise error if detected.
- buffering_wrapper: BufferingWrapper
- density_type: DensityType
- is_stationary: bool
- kernel: str
- kernel_bandwidth: float
- property logger: HierarchicalLogger
- Return type
- property policy: BasePolicy
Returns a policy imitating the demonstration data.
- Return type
BasePolicy
- rl_algo: Optional[BaseAlgorithm]
- set_demonstrations(demonstrations)[source]
Sets the demonstration data.
- Return type
None
- standardise: bool
- test_policy(*, n_trajectories=10, true_reward=True)[source]
Test current imitation policy on environment & give some rollout stats.
- Parameters
n_trajectories (
int
) – number of rolled-out trajectories.true_reward (
bool
) – should this use ground truth reward from underlying environment (True), or imitation reward (False)?
- Returns
- rollout statistics collected by
imitation.utils.rollout.rollout_stats().
- Return type
dict
- train()[source]
Fits the density model to demonstration data self.transitions.
- Return type
None
- train_policy(n_timesteps=1000000, **kwargs)[source]
Train the imitation policy for a given number of timesteps.
- Parameters
n_timesteps (
int
) – number of timesteps to train the policy for.kwargs (dict) – extra arguments that will be passed to the learn() method of the imitation RL model. Refer to Stable Baselines docs for details.
- Return type
None
- transitions: Dict[Optional[int], ndarray]
- venv: VecEnv
- venv_wrapped: RewardVecEnvWrapper
- wrapper_callback: WrappedRewardCallback
Maximum Causal Entropy Inverse Reinforcement Learning (MCE IRL)#
Implements Modeling Interaction via the Principle of Maximum Causal Entropy.
Example#
Detailed example notebook: Learn a Reward Function using Maximum Conditional Entropy Inverse Reinforcement Learning
from functools import partial
from seals import base_envs
from seals.diagnostics.cliff_world import CliffWorldEnv
import numpy as np
from stable_baselines3.common.vec_env import DummyVecEnv
from imitation.algorithms.mce_irl import (
MCEIRL,
mce_occupancy_measures,
mce_partition_fh,
)
from imitation.data import rollout
from imitation.rewards import reward_nets
rng = np.random.default_rng(0)
env_creator = partial(CliffWorldEnv, height=4, horizon=8, width=7, use_xy_obs=True)
env_single = env_creator()
state_env_creator = lambda: base_envs.ExposePOMDPStateWrapper(env_creator())
# This is just a vectorized environment because `generate_trajectories` expects one
state_venv = DummyVecEnv([state_env_creator] * 4)
_, _, pi = mce_partition_fh(env_single)
_, om = mce_occupancy_measures(env_single, pi=pi)
reward_net = reward_nets.BasicRewardNet(
env_single.observation_space,
env_single.action_space,
hid_sizes=[256],
use_action=False,
use_done=False,
use_next_state=False,
)
# training on analytically computed occupancy measures
mce_irl = MCEIRL(
om,
env_single,
reward_net,
log_interval=250,
optimizer_kwargs={"lr": 0.01},
rng=rng,
)
occ_measure = mce_irl.train()
imitation_trajs = rollout.generate_trajectories(
policy=mce_irl.policy,
venv=state_venv,
sample_until=rollout.make_min_timesteps(5000),
rng=rng,
)
print("Imitation stats: ", rollout.rollout_stats(imitation_trajs))
API#
- class imitation.algorithms.mce_irl.MCEIRL(demonstrations, env, reward_net, rng, optimizer_cls=<class 'torch.optim.adam.Adam'>, optimizer_kwargs=None, discount=1.0, linf_eps=0.001, grad_l2_eps=0.0001, log_interval=100, *, custom_logger=None)[source]
Bases:
DemonstrationAlgorithm
[TransitionsMinimal
]Tabular MCE IRL.
Reward is a function of observations, but policy is a function of states.
The “observations” effectively exist just to let MCE IRL learn a reward in a reasonable feature space, giving a helpful inductive bias, e.g. that similar states have similar reward.
Since we are performing planning to compute the policy, there is no need for function approximation in the policy.
- __init__(demonstrations, env, reward_net, rng, optimizer_cls=<class 'torch.optim.adam.Adam'>, optimizer_kwargs=None, discount=1.0, linf_eps=0.001, grad_l2_eps=0.0001, log_interval=100, *, custom_logger=None)[source]
Creates MCE IRL.
- Parameters
demonstrations (
Union
[ndarray
,Iterable
[Trajectory
],Iterable
[TransitionMapping
],TransitionsMinimal
,None
]) – Demonstrations from an expert (optional). Can be a sequence of trajectories, or transitions, an iterable over mappings that represent a batch of transitions, or a state occupancy measure. The demonstrations must have observations one-hot coded unless demonstrations is a state-occupancy measure.env (
TabularModelPOMDP
) – a tabular MDP.rng (
Generator
) – random state used for sampling from policy.reward_net (
RewardNet
) – a neural network that computes rewards for the supplied observations.optimizer_cls (
Type
[Optimizer
]) – optimizer to use for supervised training.optimizer_kwargs (
Optional
[Mapping
[str
,Any
]]) – keyword arguments for optimizer construction.discount (
float
) – the discount factor to use when computing occupancy measure. If not 1.0 (undiscounted), then demonstrations must either be a (discounted) state-occupancy measure, or trajectories. Transitions are not allowed as we cannot discount them appropriately without knowing the timestep they were drawn from.linf_eps (
float
) – optimisation terminates if the $l_{infty}$ distance between the demonstrator’s state occupancy measure and the state occupancy measure for the current reward falls below this value.grad_l2_eps (
float
) – optimisation also terminates if the $ell_2$ norm of the MCE IRL gradient falls below this value.log_interval (
Optional
[int
]) – how often to log current loss stats (using logging). None to disable.custom_logger (
Optional
[HierarchicalLogger
]) – Where to log to; if None (default), creates a new logger.
- Raises
ValueError – if the env horizon is not finite (or an integer).
- allow_variable_horizon: bool
If True, allow variable horizon trajectories; otherwise error if detected.
- demo_state_om: Optional[ndarray]
- property logger: HierarchicalLogger
- Return type
- property policy: BasePolicy
Returns a policy imitating the demonstration data.
- Return type
BasePolicy
- set_demonstrations(demonstrations)[source]
Sets the demonstration data.
Changing the demonstration data on-demand can be useful for interactive algorithms like DAgger.
- Parameters
demonstrations (
Union
[ndarray
,Iterable
[Trajectory
],Iterable
[TransitionMapping
],TransitionsMinimal
]) – Either a Torch DataLoader, any other iterator that yields dictionaries containing “obs” and “acts” Tensors or NumPy arrays, TransitionKind instance, or a Sequence of Trajectory objects.- Return type
None
- train(max_iter=1000)[source]
Runs MCE IRL.
- Parameters
max_iter (
int
) – The maximum number of iterations to train for. May terminate earlier if self.linf_eps or self.grad_l2_eps thresholds are reached.- Return type
ndarray
- Returns
State occupancy measure for the final reward function. self.reward_net and self.optimizer will be updated in-place during optimisation.
- class imitation.algorithms.base.DemonstrationAlgorithm(*, demonstrations, custom_logger=None, allow_variable_horizon=False)[source]
Bases:
BaseImitationAlgorithm
,Generic
[TransitionKind
]An algorithm that learns from demonstration: BC, IRL, etc.
- __init__(*, demonstrations, custom_logger=None, allow_variable_horizon=False)[source]
Creates an algorithm that learns from demonstrations.
- Parameters
demonstrations (
Union
[Iterable
[Trajectory
],Iterable
[TransitionMapping
],TransitionsMinimal
,None
]) – Demonstrations from an expert (optional). Transitions expressed directly as a types.TransitionsMinimal object, a sequence of trajectories, or an iterable of transition batches (mappings from keywords to arrays containing observations, etc).custom_logger (
Optional
[HierarchicalLogger
]) – Where to log to; if None (default), creates a new logger.allow_variable_horizon (
bool
) – If False (default), algorithm will raise an exception if it detects trajectories of different length during training. If True, overrides this safety check. WARNING: variable horizon episodes leak information about the reward via termination condition, and can seriously confound evaluation. Read https://imitation.readthedocs.io/en/latest/getting-started/variable-horizon.html before overriding this.
- allow_variable_horizon: bool
If True, allow variable horizon trajectories; otherwise error if detected.
- abstract property policy: BasePolicy
Returns a policy imitating the demonstration data.
- Return type
BasePolicy
- abstract set_demonstrations(demonstrations)[source]
Sets the demonstration data.
Changing the demonstration data on-demand can be useful for interactive algorithms like DAgger.
- Parameters
demonstrations (
Union
[Iterable
[Trajectory
],Iterable
[TransitionMapping
],TransitionsMinimal
]) – Either a Torch DataLoader, any other iterator that yields dictionaries containing “obs” and “acts” Tensors or NumPy arrays, TransitionKind instance, or a Sequence of Trajectory objects.- Return type
None
Preference Comparisons#
The preference comparison algorithm learns a reward function from preferences between pairs of trajectories. The comparisons are modeled as being generated from a Bradley-Terry (or Boltzmann rational) model, where the probability of preferring trajectory A over B is proportional to the exponential of the difference between the return of trajectory A minus B. In other words, the difference in returns forms a logit for a binary classification problem, and accordingly the reward function is trained using a cross-entropy loss to predict the preference comparison.
Note
Our implementation is based on the Deep Reinforcement Learning from Human Preferences algorithm.
An ensemble of reward networks can also be trained instead of a single network. The uncertainty in the preference between the member networks can be used to actively select preference queries.
Example#
You can copy this example to train PPO on Pendulum using a reward model trained on 200 synthetic preference comparisons. For a more detailed example, refer to Learning a Reward Function using Preference Comparisons.
import numpy as np
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.ppo import MlpPolicy
from imitation.algorithms import preference_comparisons
from imitation.policies.base import FeedForward32Policy, NormalizeFeaturesExtractor
from imitation.rewards.reward_nets import BasicRewardNet
from imitation.rewards.reward_wrapper import RewardVecEnvWrapper
from imitation.util.networks import RunningNorm
from imitation.util.util import make_vec_env
rng = np.random.default_rng(0)
venv = make_vec_env("Pendulum-v1", rng=rng)
reward_net = BasicRewardNet(
venv.observation_space, venv.action_space, normalize_input_layer=RunningNorm,
)
fragmenter = preference_comparisons.RandomFragmenter(warning_threshold=0, rng=rng)
gatherer = preference_comparisons.SyntheticGatherer(rng=rng)
preference_model = preference_comparisons.PreferenceModel(reward_net)
reward_trainer = preference_comparisons.BasicRewardTrainer(
preference_model=preference_model,
loss=preference_comparisons.CrossEntropyRewardLoss(),
epochs=10,
rng=rng,
)
agent = PPO(
policy=FeedForward32Policy,
policy_kwargs=dict(
features_extractor_class=NormalizeFeaturesExtractor,
features_extractor_kwargs=dict(normalize_class=RunningNorm),
),
env=venv,
n_steps=2048 // venv.num_envs,
clip_range=0.1,
ent_coef=0.01,
gae_lambda=0.95,
n_epochs=10,
gamma=0.97,
learning_rate=2e-3,
)
trajectory_generator = preference_comparisons.AgentTrainer(
algorithm=agent,
reward_fn=reward_net,
venv=venv,
exploration_frac=0.05,
rng=rng,
)
pref_comparisons = preference_comparisons.PreferenceComparisons(
trajectory_generator,
reward_net,
num_iterations=5, # Set to 60 for better performance
fragmenter=fragmenter,
preference_gatherer=gatherer,
reward_trainer=reward_trainer,
initial_epoch_multiplier=4,
initial_comparison_frac=0.1,
query_schedule="hyperbolic",
)
pref_comparisons.train(total_timesteps=50_000, total_comparisons=200)
n_eval_episodes = 10
reward_mean, reward_std = evaluate_policy(agent.policy, venv, n_eval_episodes)
reward_stderr = reward_std/np.sqrt(n_eval_episodes)
print(f"Reward: {reward_mean:.0f} +/- {reward_stderr:.0f}")
API#
- class imitation.algorithms.preference_comparisons.PreferenceComparisons(trajectory_generator, reward_model, num_iterations, fragmenter=None, preference_gatherer=None, reward_trainer=None, comparison_queue_size=None, fragment_length=100, transition_oversampling=1, initial_comparison_frac=0.1, initial_epoch_multiplier=200.0, custom_logger=None, allow_variable_horizon=False, rng=None, query_schedule='hyperbolic')[source]
Bases:
BaseImitationAlgorithm
Main interface for reward learning using preference comparisons.
- __init__(trajectory_generator, reward_model, num_iterations, fragmenter=None, preference_gatherer=None, reward_trainer=None, comparison_queue_size=None, fragment_length=100, transition_oversampling=1, initial_comparison_frac=0.1, initial_epoch_multiplier=200.0, custom_logger=None, allow_variable_horizon=False, rng=None, query_schedule='hyperbolic')[source]
Initialize the preference comparison trainer.
The loggers of all subcomponents are overridden with the logger used by this class.
- Parameters
trajectory_generator (
TrajectoryGenerator
) – generates trajectories while optionally training an RL agent on the learned reward function (can also be a sampler from a static dataset of trajectories though).reward_model (
RewardNet
) – a RewardNet instance to be used for learning the rewardnum_iterations (
int
) – number of times to train the agent against the reward model and then train the reward model against newly gathered preferences.fragmenter (
Optional
[Fragmenter
]) – takes in a set of trajectories and returns pairs of fragments for which preferences will be gathered. These fragments could be random, or they could be selected more deliberately (active learning). Default is a random fragmenter.preference_gatherer (
Optional
[PreferenceGatherer
]) – how to get preferences between trajectory fragments. Default (and currently the only option) is to use synthetic preferences based on ground-truth rewards. Human preferences could be implemented here in the future.reward_trainer (
Optional
[RewardTrainer
]) – trains the reward model based on pairs of fragments and associated preferences. Default is to use the preference model and loss function from DRLHP.comparison_queue_size (
Optional
[int
]) – the maximum number of comparisons to keep in the queue for training the reward model. If None, the queue will grow without bound as new comparisons are added.fragment_length (
int
) – number of timesteps per fragment that is used to elicit preferencestransition_oversampling (
float
) – factor by which to oversample transitions before creating fragments. Since fragments are sampled with replacement, this is usually chosen > 1 to avoid having the same transition in too many fragments.initial_comparison_frac (
float
) – fraction of the total_comparisons argument to train() that will be sampled before the rest of training begins (using a randomly initialized agent). This can be used to pretrain the reward model before the agent is trained on the learned reward, to help avoid irreversibly learning a bad policy from an untrained reward. Note that there will often be some additional pretraining comparisons since comparisons_per_iteration won’t exactly divide the total number of comparisons. How many such comparisons there are depends discontinuously on total_comparisons and comparisons_per_iteration.initial_epoch_multiplier (
float
) – before agent training begins, train the reward model for this many more epochs than usual (on fragments sampled from a random agent).custom_logger (
Optional
[HierarchicalLogger
]) – Where to log to; if None (default), creates a new logger.allow_variable_horizon (
bool
) – If False (default), algorithm will raise an exception if it detects trajectories of different length during training. If True, overrides this safety check. WARNING: variable horizon episodes leak information about the reward via termination condition, and can seriously confound evaluation. Read https://imitation.readthedocs.io/en/latest/guide/variable_horizon.html before overriding this.rng (
Optional
[Generator
]) – random number generator to use for initializing subcomponents such as fragmenter. Only used when default components are used; if you instantiate your own fragmenter, preference gatherer, etc., you are responsible for seeding them!query_schedule (
Union
[str
,Callable
[[float
],float
]]) – one of (“constant”, “hyperbolic”, “inverse_quadratic”), or a function that takes in a float between 0 and 1 inclusive, representing a fraction of the total number of timesteps elapsed up to some time T, and returns a potentially unnormalized probability indicating the fraction of total_comparisons that should be queried at that iteration. This function will be called num_iterations times in __init__() with values from np.linspace(0, 1, num_iterations) as input. The outputs will be normalized to sum to 1 and then used to apportion the comparisons among the num_iterations iterations.
- Raises
ValueError – if query_schedule is not a valid string or callable.
- allow_variable_horizon: bool
If True, allow variable horizon trajectories; otherwise error if detected.
- property logger: HierarchicalLogger
- Return type
- train(total_timesteps, total_comparisons, callback=None)[source]
Train the reward model and the policy if applicable.
- Parameters
total_timesteps (
int
) – number of environment interaction stepstotal_comparisons (
int
) – number of preferences to gather in totalcallback (
Optional
[Callable
[[int
],None
]]) – callback functions called at the end of each iteration
- Return type
Mapping
[str
,Any
]- Returns
A dictionary with final metrics such as loss and accuracy of the reward model.
- class imitation.algorithms.base.BaseImitationAlgorithm(*, custom_logger=None, allow_variable_horizon=False)[source]
Bases:
ABC
Base class for all imitation learning algorithms.
- __init__(*, custom_logger=None, allow_variable_horizon=False)[source]
Creates an imitation learning algorithm.
- Parameters
custom_logger (
Optional
[HierarchicalLogger
]) – Where to log to; if None (default), creates a new logger.allow_variable_horizon (
bool
) – If False (default), algorithm will raise an exception if it detects trajectories of different length during training. If True, overrides this safety check. WARNING: variable horizon episodes leak information about the reward via termination condition, and can seriously confound evaluation. Read https://imitation.readthedocs.io/en/latest/getting-started/variable-horizon.html before overriding this.
- allow_variable_horizon: bool
If True, allow variable horizon trajectories; otherwise error if detected.
- property logger: HierarchicalLogger
- Return type
Soft Q Imitation Learning (SQIL)#
Soft Q Imitation learning learns to imitate a policy from demonstrations by using the DQN algorithm with modified rewards. During each policy update, half of the batch is sampled from the demonstrations and half is sampled from the environment. Expert demonstrations are assigned a reward of 1, and the environment is assigned a reward of 0. This encourages the policy to imitate the demonstrations, and to simultaneously avoid states not seen in the demonstrations.
Note
This implementation is based on the DQN implementation in Stable Baselines 3, which does not implement the soft Q-learning and therefore does not support continuous actions. Therefore, this implementation only supports discrete actions and the name “soft” Q-learning could be misleading.
Example#
Detailed example notebook: Train an Agent using Soft Q Imitation Learning
import datasets
import gymnasium as gym
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.vec_env import DummyVecEnv
from imitation.algorithms import sqil
from imitation.data import huggingface_utils
# Download some expert trajectories from the HuggingFace Datasets Hub.
dataset = datasets.load_dataset("HumanCompatibleAI/ppo-CartPole-v1")
rollouts = huggingface_utils.TrajectoryDatasetSequence(dataset["train"])
sqil_trainer = sqil.SQIL(
venv=DummyVecEnv([lambda: gym.make("CartPole-v1")]),
demonstrations=rollouts,
policy="MlpPolicy",
)
# Hint: set to 1_000_000 to match the expert performance.
sqil_trainer.train(total_timesteps=1_000)
reward, _ = evaluate_policy(sqil_trainer.policy, sqil_trainer.venv, 10)
print("Reward:", reward)
API#
- class imitation.algorithms.sqil.SQIL(*, venv, demonstrations, policy, custom_logger=None, rl_algo_class=<class 'stable_baselines3.dqn.dqn.DQN'>, rl_kwargs=None)[source]
Bases:
DemonstrationAlgorithm
[Transitions
]Soft Q Imitation Learning (SQIL).
Trains a policy via DQN-style Q-learning, replacing half the buffer with expert demonstrations and adjusting the rewards.
- __init__(*, venv, demonstrations, policy, custom_logger=None, rl_algo_class=<class 'stable_baselines3.dqn.dqn.DQN'>, rl_kwargs=None)[source]
Builds SQIL.
- Parameters
venv (
VecEnv
) – The vectorized environment to train on.demonstrations (
Union
[Iterable
[Trajectory
],Iterable
[TransitionMapping
],TransitionsMinimal
,None
]) – Demonstrations to use for training.policy (
Union
[str
,Type
[BasePolicy
]]) – The policy model to use (SB3).custom_logger (
Optional
[HierarchicalLogger
]) – Where to log to; if None (default), creates a new logger.rl_algo_class (
Type
[OffPolicyAlgorithm
]) – Off-policy RL algorithm to use.rl_kwargs (
Optional
[Dict
[str
,Any
]]) – Keyword arguments to pass to the RL algorithm constructor.
- Raises
ValueError – if dqn_kwargs includes a key replay_buffer_class or replay_buffer_kwargs.
- allow_variable_horizon: bool
If True, allow variable horizon trajectories; otherwise error if detected.
- expert_buffer: ReplayBuffer
- property policy: BasePolicy
Returns a policy imitating the demonstration data.
- Return type
BasePolicy
- set_demonstrations(demonstrations)[source]
Sets the demonstration data.
Changing the demonstration data on-demand can be useful for interactive algorithms like DAgger.
- Parameters
demonstrations (
Union
[Iterable
[Trajectory
],Iterable
[TransitionMapping
],TransitionsMinimal
]) – Either a Torch DataLoader, any other iterator that yields dictionaries containing “obs” and “acts” Tensors or NumPy arrays, TransitionKind instance, or a Sequence of Trajectory objects.- Return type
None
- train(*, total_timesteps, tb_log_name='SQIL', **kwargs)[source]
Train an Agent using Behavior Cloning#
Behavior cloning is the most naive approach to imitation learning. We take the transitions of trajectories taken by some expert and use them as training samples to train a new policy. The method has many drawbacks and often does not work. However in this example, where we use an agent for the seals/CartPole-v0 environment, it is feasible.
Note that we use a variant of the CartPole environment from the seals package, which has fixed episode durations. Read more about why we do this here.
First we need some kind of expert in CartPole so we can sample some expert trajectories. For convenience we just download one from the HuggingFace model hub.
If you want to train an expert yourself have a look at the training documenation of RL Baselines3 Zoo.
import numpy as np
import gymnasium as gym
from imitation.policies.serialize import load_policy
from imitation.util.util import make_vec_env
from imitation.data.wrappers import RolloutInfoWrapper
env = make_vec_env(
"seals:seals/CartPole-v0",
rng=np.random.default_rng(),
post_wrappers=[
lambda env, _: RolloutInfoWrapper(env)
], # needed for computing rollouts later
)
expert = load_policy(
"ppo-huggingface",
organization="HumanCompatibleAI",
env_name="seals/CartPole-v0",
venv=env,
)
Let’s quickly check if the expert is any good. We usually should be able to reach a reward of 500, which is the maximum achievable value.
from stable_baselines3.common.evaluation import evaluate_policy
reward, _ = evaluate_policy(expert, env, 10)
print(reward)
500.0
Now we can use the expert to sample some trajectories.
We flatten them right away since we are only interested in the individual transitions for behavior cloning.
imitation
comes with a number of helper functions that makes collecting those transitions really easy. First we collect 50 episode rollouts, then we flatten them to just the transitions that we need for training.
Note that the rollout function requires a vectorized environment and needs the RolloutInfoWrapper
around each of the environments. This is why we passed the post_wrappers
argument to make_vec_env
above.
from imitation.data import rollout
rng = np.random.default_rng()
rollouts = rollout.rollout(
expert,
env,
rollout.make_sample_until(min_timesteps=None, min_episodes=50),
rng=rng,
)
transitions = rollout.flatten_trajectories(rollouts)
Let’s have a quick look at what we just generated using those library functions:
print(
f"""The `rollout` function generated a list of {len(rollouts)} {type(rollouts[0])}.
After flattening, this list is turned into a {type(transitions)} object containing {len(transitions)} transitions.
The transitions object contains arrays for: {', '.join(transitions.__dict__.keys())}."
"""
)
The `rollout` function generated a list of 56 <class 'imitation.data.types.TrajectoryWithRew'>.
After flattening, this list is turned into a <class 'imitation.data.types.Transitions'> object containing 28000 transitions.
The transitions object contains arrays for: obs, acts, infos, next_obs, dones."
After we collected our transitions, it’s time to set up our behavior cloning algorithm.
from imitation.algorithms import bc
bc_trainer = bc.BC(
observation_space=env.observation_space,
action_space=env.action_space,
demonstrations=transitions,
rng=rng,
)
As you can see the untrained policy only gets poor rewards:
reward_before_training, _ = evaluate_policy(bc_trainer.policy, env, 10)
print(f"Reward before training: {reward_before_training}")
Reward before training: 8.2
After training, we can match the rewards of the expert (500):
bc_trainer.train(n_epochs=1)
reward_after_training, _ = evaluate_policy(bc_trainer.policy, env, 10)
print(f"Reward after training: {reward_after_training}")
---------------------------------
| batch_size | 32 |
| bc/ | |
| batch | 0 |
| ent_loss | -0.000693 |
| entropy | 0.693 |
| epoch | 0 |
| l2_loss | 0 |
| l2_norm | 72.5 |
| loss | 0.693 |
| neglogp | 0.694 |
| prob_true_act | 0.5 |
| samples_so_far | 32 |
---------------------------------
---------------------------------
| batch_size | 32 |
| bc/ | |
| batch | 500 |
| ent_loss | -0.000329 |
| entropy | 0.329 |
| epoch | 0 |
| l2_loss | 0 |
| l2_norm | 93.6 |
| loss | 0.266 |
| neglogp | 0.266 |
| prob_true_act | 0.811 |
| samples_so_far | 16032 |
---------------------------------
Reward after training: 500.0
Train an Agent using the DAgger Algorithm#
The DAgger algorithm is an extension of behavior cloning. In behavior cloning, the training trajectories are recorded directly from an expert. In DAgger, the learner generates the trajectories but an expert corrects the actions with the optimal actions in each of the visited states. This ensures that the state distribution of the training data matches that of the learner’s current policy.
First we need an expert to learn from. For convenience we download one from the HuggingFace model hub.
import numpy as np
import gymnasium as gym
from imitation.policies.serialize import load_policy
from imitation.util.util import make_vec_env
env = make_vec_env(
"seals:seals/CartPole-v0",
rng=np.random.default_rng(),
n_envs=1,
)
expert = load_policy(
"ppo-huggingface",
organization="HumanCompatibleAI",
env_name="seals/CartPole-v0",
venv=env,
)
Then we can construct a DAgger trainer und use it to train the policy on the cartpole environment.
import tempfile
from imitation.algorithms import bc
from imitation.algorithms.dagger import SimpleDAggerTrainer
bc_trainer = bc.BC(
observation_space=env.observation_space,
action_space=env.action_space,
rng=np.random.default_rng(),
)
with tempfile.TemporaryDirectory(prefix="dagger_example_") as tmpdir:
print(tmpdir)
dagger_trainer = SimpleDAggerTrainer(
venv=env,
scratch_dir=tmpdir,
expert_policy=expert,
bc_trainer=bc_trainer,
rng=np.random.default_rng(),
)
dagger_trainer.train(2000)
/tmp/dagger_example_gqkwws0y
---------------------------------
| batch_size | 32 |
| bc/ | |
| batch | 0 |
| ent_loss | -0.000693 |
| entropy | 0.693 |
| epoch | 0 |
| l2_loss | 0 |
| l2_norm | 72.5 |
| loss | 0.692 |
| neglogp | 0.692 |
| prob_true_act | 0.5 |
| samples_so_far | 32 |
| rollout/ | |
| return_max | 23 |
| return_mean | 16.8 |
| return_min | 9 |
| return_std | 4.75 |
---------------------------------
---------------------------------
| batch_size | 32 |
| bc/ | |
| batch | 0 |
| ent_loss | -0.000356 |
| entropy | 0.356 |
| epoch | 0 |
| l2_loss | 0 |
| l2_norm | 86.6 |
| loss | 0.269 |
| neglogp | 0.269 |
| prob_true_act | 0.797 |
| samples_so_far | 32 |
| rollout/ | |
| return_max | 96 |
| return_mean | 72.6 |
| return_min | 47 |
| return_std | 17.3 |
---------------------------------
Finally, the evaluation shows, that we actually trained a policy that solves the environment (500 is the max reward).
from stable_baselines3.common.evaluation import evaluate_policy
reward, _ = evaluate_policy(dagger_trainer.policy, env, 20)
print(reward)
500.0
Train an Agent using Generative Adversarial Imitation Learning#
The idea of generative adversarial imitation learning is to train a discriminator network to distinguish between expert trajectories and learner trajectories. The learner is trained using a traditional reinforcement learning algorithm such as PPO and is rewarded for trajectories that make the discriminator think that it was an expert trajectory.
As usual, we first need an expert. Again, we download one from the HuggingFace model hub for convenience.
Note that we use a variant of the CartPole environment from the seals package, which has fixed episode durations. Read more about why we do this here.
import numpy as np
from imitation.policies.serialize import load_policy
from imitation.util.util import make_vec_env
from imitation.data.wrappers import RolloutInfoWrapper
SEED = 42
env = make_vec_env(
"seals:seals/CartPole-v0",
rng=np.random.default_rng(SEED),
n_envs=8,
post_wrappers=[
lambda env, _: RolloutInfoWrapper(env)
], # needed for computing rollouts later
)
expert = load_policy(
"ppo-huggingface",
organization="HumanCompatibleAI",
env_name="seals/CartPole-v0",
venv=env,
)
We generate some expert trajectories, that the discriminator needs to distinguish from the learner’s trajectories.
from imitation.data import rollout
rollouts = rollout.rollout(
expert,
env,
rollout.make_sample_until(min_timesteps=None, min_episodes=60),
rng=np.random.default_rng(SEED),
)
Now we are ready to set up our GAIL trainer.
Note, that the reward_net
is actually the network of the discriminator.
We evaluate the learner before and after training so we can see if it made any progress.
First we construct a GAIL trainer …
from imitation.algorithms.adversarial.gail import GAIL
from imitation.rewards.reward_nets import BasicRewardNet
from imitation.util.networks import RunningNorm
from stable_baselines3 import PPO
from stable_baselines3.ppo import MlpPolicy
from stable_baselines3.common.evaluation import evaluate_policy
learner = PPO(
env=env,
policy=MlpPolicy,
batch_size=64,
ent_coef=0.0,
learning_rate=0.0004,
gamma=0.95,
n_epochs=5,
seed=SEED,
)
reward_net = BasicRewardNet(
observation_space=env.observation_space,
action_space=env.action_space,
normalize_input_layer=RunningNorm,
)
gail_trainer = GAIL(
demonstrations=rollouts,
demo_batch_size=1024,
gen_replay_buffer_capacity=512,
n_disc_updates_per_round=8,
venv=env,
gen_algo=learner,
reward_net=reward_net,
)
… then we evaluate it before training …
env.seed(SEED)
learner_rewards_before_training, _ = evaluate_policy(
learner, env, 100, return_episode_rewards=True
)
… and train it …
gail_trainer.train(200_000)
------------------------------------------
| raw/ | |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 29.8 |
| gen/time/fps | 4208 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 3 |
| gen/time/total_timesteps | 16384 |
------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.69 |
| disc/disc_loss | 0.696 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 1 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.69 |
| disc/disc_loss | 0.694 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 1 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.69 |
| disc/disc_loss | 0.693 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 1 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.69 |
| disc/disc_loss | 0.69 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 1 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.69 |
| disc/disc_loss | 0.688 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 1 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.69 |
| disc/disc_loss | 0.686 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 1 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.69 |
| disc/disc_loss | 0.684 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 1 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.69 |
| disc/disc_loss | 0.683 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 1 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| mean/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.69 |
| disc/disc_loss | 0.689 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 1 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 29.8 |
| gen/time/fps | 4.21e+03 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 3 |
| gen/time/total_timesteps | 1.64e+04 |
| gen/train/approx_kl | 0.00905 |
| gen/train/clip_fraction | 0.0295 |
| gen/train/clip_range | 0.2 |
| gen/train/entropy_loss | -0.686 |
| gen/train/explained_variance | 0.0301 |
| gen/train/learning_rate | 0.0004 |
| gen/train/loss | 0.127 |
| gen/train/n_updates | 5 |
| gen/train/policy_gradient_loss | -0.0015 |
| gen/train/value_loss | 4.43 |
--------------------------------------------------
----------------------------------------------------
| raw/ | |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 31.9 |
| gen/rollout/ep_rew_wrapped_mean | 268 |
| gen/time/fps | 4212 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 3 |
| gen/time/total_timesteps | 32768 |
| gen/train/approx_kl | 0.009048736 |
| gen/train/clip_fraction | 0.0295 |
| gen/train/clip_range | 0.2 |
| gen/train/entropy_loss | -0.686 |
| gen/train/explained_variance | 0.0301 |
| gen/train/learning_rate | 0.0004 |
| gen/train/loss | 0.127 |
| gen/train/n_updates | 5 |
| gen/train/policy_gradient_loss | -0.0015 |
| gen/train/value_loss | 4.43 |
----------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.691 |
| disc/disc_loss | 0.685 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 2 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.691 |
| disc/disc_loss | 0.684 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 2 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.691 |
| disc/disc_loss | 0.683 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 2 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.691 |
| disc/disc_loss | 0.682 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 2 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.691 |
| disc/disc_loss | 0.68 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 2 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.691 |
| disc/disc_loss | 0.68 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 2 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.691 |
| disc/disc_loss | 0.679 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 2 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.691 |
| disc/disc_loss | 0.678 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 2 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| mean/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.691 |
| disc/disc_loss | 0.681 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 2 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 31.9 |
| gen/rollout/ep_rew_wrapped_mean | 268 |
| gen/time/fps | 4.21e+03 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 3 |
| gen/time/total_timesteps | 3.28e+04 |
| gen/train/approx_kl | 0.0102 |
| gen/train/clip_fraction | 0.133 |
| gen/train/clip_range | 0.2 |
| gen/train/entropy_loss | -0.686 |
| gen/train/explained_variance | 0.841 |
| gen/train/learning_rate | 0.0004 |
| gen/train/loss | 0.0145 |
| gen/train/n_updates | 10 |
| gen/train/policy_gradient_loss | -0.00786 |
| gen/train/value_loss | 0.248 |
--------------------------------------------------
----------------------------------------------------
| raw/ | |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 34.1 |
| gen/rollout/ep_rew_wrapped_mean | 275 |
| gen/time/fps | 4213 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 3 |
| gen/time/total_timesteps | 49152 |
| gen/train/approx_kl | 0.010180451 |
| gen/train/clip_fraction | 0.133 |
| gen/train/clip_range | 0.2 |
| gen/train/entropy_loss | -0.686 |
| gen/train/explained_variance | 0.841 |
| gen/train/learning_rate | 0.0004 |
| gen/train/loss | 0.0145 |
| gen/train/n_updates | 10 |
| gen/train/policy_gradient_loss | -0.00786 |
| gen/train/value_loss | 0.248 |
----------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.69 |
| disc/disc_loss | 0.672 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 3 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.69 |
| disc/disc_loss | 0.671 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 3 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.69 |
| disc/disc_loss | 0.67 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 3 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.69 |
| disc/disc_loss | 0.668 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 3 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.69 |
| disc/disc_loss | 0.667 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 3 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.69 |
| disc/disc_loss | 0.668 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 3 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.689 |
| disc/disc_loss | 0.664 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 3 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.689 |
| disc/disc_loss | 0.661 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 3 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| mean/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.69 |
| disc/disc_loss | 0.668 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 3 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 34.1 |
| gen/rollout/ep_rew_wrapped_mean | 275 |
| gen/time/fps | 4.21e+03 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 3 |
| gen/time/total_timesteps | 4.92e+04 |
| gen/train/approx_kl | 0.0153 |
| gen/train/clip_fraction | 0.195 |
| gen/train/clip_range | 0.2 |
| gen/train/entropy_loss | -0.673 |
| gen/train/explained_variance | 0.815 |
| gen/train/learning_rate | 0.0004 |
| gen/train/loss | -0.0246 |
| gen/train/n_updates | 15 |
| gen/train/policy_gradient_loss | -0.0135 |
| gen/train/value_loss | 0.0463 |
--------------------------------------------------
----------------------------------------------------
| raw/ | |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 37.8 |
| gen/rollout/ep_rew_wrapped_mean | 277 |
| gen/time/fps | 4207 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 3 |
| gen/time/total_timesteps | 65536 |
| gen/train/approx_kl | 0.015265099 |
| gen/train/clip_fraction | 0.195 |
| gen/train/clip_range | 0.2 |
| gen/train/entropy_loss | -0.673 |
| gen/train/explained_variance | 0.815 |
| gen/train/learning_rate | 0.0004 |
| gen/train/loss | -0.0246 |
| gen/train/n_updates | 15 |
| gen/train/policy_gradient_loss | -0.0135 |
| gen/train/value_loss | 0.0463 |
----------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.687 |
| disc/disc_loss | 0.652 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 4 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.686 |
| disc/disc_loss | 0.646 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 4 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.686 |
| disc/disc_loss | 0.646 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 4 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.685 |
| disc/disc_loss | 0.64 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 4 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.685 |
| disc/disc_loss | 0.638 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 4 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.684 |
| disc/disc_loss | 0.634 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 4 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.682 |
| disc/disc_loss | 0.628 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 4 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.682 |
| disc/disc_loss | 0.625 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 4 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| mean/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.685 |
| disc/disc_loss | 0.639 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 4 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 37.8 |
| gen/rollout/ep_rew_wrapped_mean | 277 |
| gen/time/fps | 4.21e+03 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 3 |
| gen/time/total_timesteps | 6.55e+04 |
| gen/train/approx_kl | 0.0161 |
| gen/train/clip_fraction | 0.215 |
| gen/train/clip_range | 0.2 |
| gen/train/entropy_loss | -0.654 |
| gen/train/explained_variance | 0.892 |
| gen/train/learning_rate | 0.0004 |
| gen/train/loss | -0.0168 |
| gen/train/n_updates | 20 |
| gen/train/policy_gradient_loss | -0.0195 |
| gen/train/value_loss | 0.0173 |
--------------------------------------------------
----------------------------------------------------
| raw/ | |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 40.4 |
| gen/rollout/ep_rew_wrapped_mean | 284 |
| gen/time/fps | 4206 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 3 |
| gen/time/total_timesteps | 81920 |
| gen/train/approx_kl | 0.016116062 |
| gen/train/clip_fraction | 0.215 |
| gen/train/clip_range | 0.2 |
| gen/train/entropy_loss | -0.654 |
| gen/train/explained_variance | 0.892 |
| gen/train/learning_rate | 0.0004 |
| gen/train/loss | -0.0168 |
| gen/train/n_updates | 20 |
| gen/train/policy_gradient_loss | -0.0195 |
| gen/train/value_loss | 0.0173 |
----------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.689 |
| disc/disc_loss | 0.659 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 5 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.689 |
| disc/disc_loss | 0.659 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 5 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.688 |
| disc/disc_loss | 0.655 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 5 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.687 |
| disc/disc_loss | 0.651 |
| disc/disc_proportion_expert_pred | 0 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 5 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.5 |
| disc/disc_acc_expert | 0.000977 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.688 |
| disc/disc_loss | 0.652 |
| disc/disc_proportion_expert_pred | 0.000488 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 5 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.573 |
| disc/disc_acc_expert | 0.146 |
| disc/disc_acc_gen | 1 |
| disc/disc_entropy | 0.687 |
| disc/disc_loss | 0.647 |
| disc/disc_proportion_expert_pred | 0.0728 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 5 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.684 |
| disc/disc_acc_expert | 0.374 |
| disc/disc_acc_gen | 0.993 |
| disc/disc_entropy | 0.686 |
| disc/disc_loss | 0.647 |
| disc/disc_proportion_expert_pred | 0.19 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 5 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.708 |
| disc/disc_acc_expert | 0.434 |
| disc/disc_acc_gen | 0.983 |
| disc/disc_entropy | 0.686 |
| disc/disc_loss | 0.642 |
| disc/disc_proportion_expert_pred | 0.225 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 5 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| mean/ | |
| disc/disc_acc | 0.558 |
| disc/disc_acc_expert | 0.119 |
| disc/disc_acc_gen | 0.997 |
| disc/disc_entropy | 0.687 |
| disc/disc_loss | 0.652 |
| disc/disc_proportion_expert_pred | 0.0611 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 5 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 40.4 |
| gen/rollout/ep_rew_wrapped_mean | 284 |
| gen/time/fps | 4.21e+03 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 3 |
| gen/time/total_timesteps | 8.19e+04 |
| gen/train/approx_kl | 0.0112 |
| gen/train/clip_fraction | 0.129 |
| gen/train/clip_range | 0.2 |
| gen/train/entropy_loss | -0.634 |
| gen/train/explained_variance | 0.871 |
| gen/train/learning_rate | 0.0004 |
| gen/train/loss | 0.00102 |
| gen/train/n_updates | 25 |
| gen/train/policy_gradient_loss | -0.00957 |
| gen/train/value_loss | 0.0103 |
--------------------------------------------------
---------------------------------------------------
| raw/ | |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 40.8 |
| gen/rollout/ep_rew_wrapped_mean | 288 |
| gen/time/fps | 4212 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 3 |
| gen/time/total_timesteps | 98304 |
| gen/train/approx_kl | 0.01118237 |
| gen/train/clip_fraction | 0.129 |
| gen/train/clip_range | 0.2 |
| gen/train/entropy_loss | -0.634 |
| gen/train/explained_variance | 0.871 |
| gen/train/learning_rate | 0.0004 |
| gen/train/loss | 0.00102 |
| gen/train/n_updates | 25 |
| gen/train/policy_gradient_loss | -0.00957 |
| gen/train/value_loss | 0.0103 |
---------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.732 |
| disc/disc_acc_expert | 0.468 |
| disc/disc_acc_gen | 0.997 |
| disc/disc_entropy | 0.687 |
| disc/disc_loss | 0.639 |
| disc/disc_proportion_expert_pred | 0.235 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 6 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.718 |
| disc/disc_acc_expert | 0.442 |
| disc/disc_acc_gen | 0.994 |
| disc/disc_entropy | 0.687 |
| disc/disc_loss | 0.637 |
| disc/disc_proportion_expert_pred | 0.224 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 6 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.736 |
| disc/disc_acc_expert | 0.476 |
| disc/disc_acc_gen | 0.996 |
| disc/disc_entropy | 0.687 |
| disc/disc_loss | 0.638 |
| disc/disc_proportion_expert_pred | 0.24 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 6 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.734 |
| disc/disc_acc_expert | 0.472 |
| disc/disc_acc_gen | 0.996 |
| disc/disc_entropy | 0.687 |
| disc/disc_loss | 0.635 |
| disc/disc_proportion_expert_pred | 0.238 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 6 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.714 |
| disc/disc_acc_expert | 0.44 |
| disc/disc_acc_gen | 0.987 |
| disc/disc_entropy | 0.686 |
| disc/disc_loss | 0.633 |
| disc/disc_proportion_expert_pred | 0.227 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 6 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.746 |
| disc/disc_acc_expert | 0.504 |
| disc/disc_acc_gen | 0.988 |
| disc/disc_entropy | 0.686 |
| disc/disc_loss | 0.632 |
| disc/disc_proportion_expert_pred | 0.258 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 6 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.819 |
| disc/disc_acc_expert | 0.657 |
| disc/disc_acc_gen | 0.981 |
| disc/disc_entropy | 0.686 |
| disc/disc_loss | 0.631 |
| disc/disc_proportion_expert_pred | 0.338 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 6 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.856 |
| disc/disc_acc_expert | 0.733 |
| disc/disc_acc_gen | 0.979 |
| disc/disc_entropy | 0.685 |
| disc/disc_loss | 0.627 |
| disc/disc_proportion_expert_pred | 0.377 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 6 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| mean/ | |
| disc/disc_acc | 0.757 |
| disc/disc_acc_expert | 0.524 |
| disc/disc_acc_gen | 0.99 |
| disc/disc_entropy | 0.687 |
| disc/disc_loss | 0.634 |
| disc/disc_proportion_expert_pred | 0.267 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 6 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 40.8 |
| gen/rollout/ep_rew_wrapped_mean | 288 |
| gen/time/fps | 4.21e+03 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 3 |
| gen/time/total_timesteps | 9.83e+04 |
| gen/train/approx_kl | 0.00629 |
| gen/train/clip_fraction | 0.0466 |
| gen/train/clip_range | 0.2 |
| gen/train/entropy_loss | -0.635 |
| gen/train/explained_variance | 0.873 |
| gen/train/learning_rate | 0.0004 |
| gen/train/loss | 0.0116 |
| gen/train/n_updates | 30 |
| gen/train/policy_gradient_loss | -0.00363 |
| gen/train/value_loss | 0.0126 |
--------------------------------------------------
-----------------------------------------------------
| raw/ | |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 39.6 |
| gen/rollout/ep_rew_wrapped_mean | 287 |
| gen/time/fps | 4207 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 3 |
| gen/time/total_timesteps | 114688 |
| gen/train/approx_kl | 0.0062911767 |
| gen/train/clip_fraction | 0.0466 |
| gen/train/clip_range | 0.2 |
| gen/train/entropy_loss | -0.635 |
| gen/train/explained_variance | 0.873 |
| gen/train/learning_rate | 0.0004 |
| gen/train/loss | 0.0116 |
| gen/train/n_updates | 30 |
| gen/train/policy_gradient_loss | -0.00363 |
| gen/train/value_loss | 0.0126 |
-----------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.852 |
| disc/disc_acc_expert | 0.735 |
| disc/disc_acc_gen | 0.969 |
| disc/disc_entropy | 0.683 |
| disc/disc_loss | 0.62 |
| disc/disc_proportion_expert_pred | 0.383 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 7 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.89 |
| disc/disc_acc_expert | 0.812 |
| disc/disc_acc_gen | 0.968 |
| disc/disc_entropy | 0.682 |
| disc/disc_loss | 0.616 |
| disc/disc_proportion_expert_pred | 0.422 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 7 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.919 |
| disc/disc_acc_expert | 0.875 |
| disc/disc_acc_gen | 0.964 |
| disc/disc_entropy | 0.681 |
| disc/disc_loss | 0.614 |
| disc/disc_proportion_expert_pred | 0.456 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 7 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.95 |
| disc/disc_acc_expert | 0.933 |
| disc/disc_acc_gen | 0.967 |
| disc/disc_entropy | 0.681 |
| disc/disc_loss | 0.61 |
| disc/disc_proportion_expert_pred | 0.483 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 7 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.963 |
| disc/disc_acc_expert | 0.955 |
| disc/disc_acc_gen | 0.971 |
| disc/disc_entropy | 0.681 |
| disc/disc_loss | 0.608 |
| disc/disc_proportion_expert_pred | 0.492 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 7 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.962 |
| disc/disc_acc_expert | 0.962 |
| disc/disc_acc_gen | 0.963 |
| disc/disc_entropy | 0.679 |
| disc/disc_loss | 0.603 |
| disc/disc_proportion_expert_pred | 0.5 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 7 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.967 |
| disc/disc_acc_expert | 0.97 |
| disc/disc_acc_gen | 0.964 |
| disc/disc_entropy | 0.679 |
| disc/disc_loss | 0.602 |
| disc/disc_proportion_expert_pred | 0.503 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 7 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.958 |
| disc/disc_acc_expert | 0.977 |
| disc/disc_acc_gen | 0.94 |
| disc/disc_entropy | 0.678 |
| disc/disc_loss | 0.597 |
| disc/disc_proportion_expert_pred | 0.518 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 7 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| mean/ | |
| disc/disc_acc | 0.933 |
| disc/disc_acc_expert | 0.902 |
| disc/disc_acc_gen | 0.963 |
| disc/disc_entropy | 0.681 |
| disc/disc_loss | 0.609 |
| disc/disc_proportion_expert_pred | 0.47 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 7 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 39.6 |
| gen/rollout/ep_rew_wrapped_mean | 287 |
| gen/time/fps | 4.21e+03 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 3 |
| gen/time/total_timesteps | 1.15e+05 |
| gen/train/approx_kl | 0.0087 |
| gen/train/clip_fraction | 0.0778 |
| gen/train/clip_range | 0.2 |
| gen/train/entropy_loss | -0.629 |
| gen/train/explained_variance | 0.928 |
| gen/train/learning_rate | 0.0004 |
| gen/train/loss | 0.0141 |
| gen/train/n_updates | 35 |
| gen/train/policy_gradient_loss | -0.00673 |
| gen/train/value_loss | 0.0171 |
--------------------------------------------------
----------------------------------------------------
| raw/ | |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 39 |
| gen/rollout/ep_rew_wrapped_mean | 282 |
| gen/time/fps | 4209 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 3 |
| gen/time/total_timesteps | 131072 |
| gen/train/approx_kl | 0.008696594 |
| gen/train/clip_fraction | 0.0778 |
| gen/train/clip_range | 0.2 |
| gen/train/entropy_loss | -0.629 |
| gen/train/explained_variance | 0.928 |
| gen/train/learning_rate | 0.0004 |
| gen/train/loss | 0.0141 |
| gen/train/n_updates | 35 |
| gen/train/policy_gradient_loss | -0.00673 |
| gen/train/value_loss | 0.0171 |
----------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.964 |
| disc/disc_acc_expert | 0.981 |
| disc/disc_acc_gen | 0.946 |
| disc/disc_entropy | 0.671 |
| disc/disc_loss | 0.572 |
| disc/disc_proportion_expert_pred | 0.518 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 8 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.977 |
| disc/disc_acc_expert | 0.992 |
| disc/disc_acc_gen | 0.962 |
| disc/disc_entropy | 0.669 |
| disc/disc_loss | 0.563 |
| disc/disc_proportion_expert_pred | 0.515 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 8 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.969 |
| disc/disc_acc_expert | 0.994 |
| disc/disc_acc_gen | 0.943 |
| disc/disc_entropy | 0.669 |
| disc/disc_loss | 0.562 |
| disc/disc_proportion_expert_pred | 0.525 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 8 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.97 |
| disc/disc_acc_expert | 0.998 |
| disc/disc_acc_gen | 0.941 |
| disc/disc_entropy | 0.667 |
| disc/disc_loss | 0.557 |
| disc/disc_proportion_expert_pred | 0.528 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 8 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.97 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.939 |
| disc/disc_entropy | 0.665 |
| disc/disc_loss | 0.553 |
| disc/disc_proportion_expert_pred | 0.53 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 8 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.972 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.943 |
| disc/disc_entropy | 0.666 |
| disc/disc_loss | 0.552 |
| disc/disc_proportion_expert_pred | 0.528 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 8 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.969 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.938 |
| disc/disc_entropy | 0.663 |
| disc/disc_loss | 0.543 |
| disc/disc_proportion_expert_pred | 0.531 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 8 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.976 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.952 |
| disc/disc_entropy | 0.661 |
| disc/disc_loss | 0.539 |
| disc/disc_proportion_expert_pred | 0.524 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 8 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| mean/ | |
| disc/disc_acc | 0.971 |
| disc/disc_acc_expert | 0.996 |
| disc/disc_acc_gen | 0.946 |
| disc/disc_entropy | 0.667 |
| disc/disc_loss | 0.555 |
| disc/disc_proportion_expert_pred | 0.525 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 8 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 39 |
| gen/rollout/ep_rew_wrapped_mean | 282 |
| gen/time/fps | 4.21e+03 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 3 |
| gen/time/total_timesteps | 1.31e+05 |
| gen/train/approx_kl | 0.00855 |
| gen/train/clip_fraction | 0.0715 |
| gen/train/clip_range | 0.2 |
| gen/train/entropy_loss | -0.624 |
| gen/train/explained_variance | 0.922 |
| gen/train/learning_rate | 0.0004 |
| gen/train/loss | 0.00161 |
| gen/train/n_updates | 40 |
| gen/train/policy_gradient_loss | -0.00499 |
| gen/train/value_loss | 0.0237 |
--------------------------------------------------
----------------------------------------------------
| raw/ | |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 39.6 |
| gen/rollout/ep_rew_wrapped_mean | 271 |
| gen/time/fps | 4214 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 3 |
| gen/time/total_timesteps | 147456 |
| gen/train/approx_kl | 0.008551636 |
| gen/train/clip_fraction | 0.0715 |
| gen/train/clip_range | 0.2 |
| gen/train/entropy_loss | -0.624 |
| gen/train/explained_variance | 0.922 |
| gen/train/learning_rate | 0.0004 |
| gen/train/loss | 0.00161 |
| gen/train/n_updates | 40 |
| gen/train/policy_gradient_loss | -0.00499 |
| gen/train/value_loss | 0.0237 |
----------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.96 |
| disc/disc_acc_expert | 0.998 |
| disc/disc_acc_gen | 0.922 |
| disc/disc_entropy | 0.674 |
| disc/disc_loss | 0.571 |
| disc/disc_proportion_expert_pred | 0.538 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 9 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.956 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.912 |
| disc/disc_entropy | 0.672 |
| disc/disc_loss | 0.567 |
| disc/disc_proportion_expert_pred | 0.544 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 9 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.962 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.924 |
| disc/disc_entropy | 0.67 |
| disc/disc_loss | 0.56 |
| disc/disc_proportion_expert_pred | 0.538 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 9 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.966 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.932 |
| disc/disc_entropy | 0.669 |
| disc/disc_loss | 0.558 |
| disc/disc_proportion_expert_pred | 0.534 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 9 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.954 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.907 |
| disc/disc_entropy | 0.667 |
| disc/disc_loss | 0.553 |
| disc/disc_proportion_expert_pred | 0.546 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 9 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.951 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.902 |
| disc/disc_entropy | 0.667 |
| disc/disc_loss | 0.551 |
| disc/disc_proportion_expert_pred | 0.549 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 9 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.96 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.921 |
| disc/disc_entropy | 0.662 |
| disc/disc_loss | 0.539 |
| disc/disc_proportion_expert_pred | 0.54 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 9 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.958 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.915 |
| disc/disc_entropy | 0.661 |
| disc/disc_loss | 0.535 |
| disc/disc_proportion_expert_pred | 0.542 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 9 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| mean/ | |
| disc/disc_acc | 0.958 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.917 |
| disc/disc_entropy | 0.668 |
| disc/disc_loss | 0.554 |
| disc/disc_proportion_expert_pred | 0.541 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 9 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 39.6 |
| gen/rollout/ep_rew_wrapped_mean | 271 |
| gen/time/fps | 4.21e+03 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 3 |
| gen/time/total_timesteps | 1.47e+05 |
| gen/train/approx_kl | 0.00591 |
| gen/train/clip_fraction | 0.0515 |
| gen/train/clip_range | 0.2 |
| gen/train/entropy_loss | -0.613 |
| gen/train/explained_variance | 0.935 |
| gen/train/learning_rate | 0.0004 |
| gen/train/loss | -0.00763 |
| gen/train/n_updates | 45 |
| gen/train/policy_gradient_loss | -0.00313 |
| gen/train/value_loss | 0.0288 |
--------------------------------------------------
-----------------------------------------------------
| raw/ | |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 44.7 |
| gen/rollout/ep_rew_wrapped_mean | 259 |
| gen/time/fps | 4213 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 3 |
| gen/time/total_timesteps | 163840 |
| gen/train/approx_kl | 0.0059148837 |
| gen/train/clip_fraction | 0.0515 |
| gen/train/clip_range | 0.2 |
| gen/train/entropy_loss | -0.613 |
| gen/train/explained_variance | 0.935 |
| gen/train/learning_rate | 0.0004 |
| gen/train/loss | -0.00763 |
| gen/train/n_updates | 45 |
| gen/train/policy_gradient_loss | -0.00313 |
| gen/train/value_loss | 0.0288 |
-----------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.951 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.901 |
| disc/disc_entropy | 0.643 |
| disc/disc_loss | 0.495 |
| disc/disc_proportion_expert_pred | 0.549 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 10 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.948 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.896 |
| disc/disc_entropy | 0.638 |
| disc/disc_loss | 0.485 |
| disc/disc_proportion_expert_pred | 0.552 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 10 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.955 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.909 |
| disc/disc_entropy | 0.636 |
| disc/disc_loss | 0.481 |
| disc/disc_proportion_expert_pred | 0.545 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 10 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.95 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.899 |
| disc/disc_entropy | 0.633 |
| disc/disc_loss | 0.477 |
| disc/disc_proportion_expert_pred | 0.55 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 10 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.945 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.89 |
| disc/disc_entropy | 0.633 |
| disc/disc_loss | 0.475 |
| disc/disc_proportion_expert_pred | 0.555 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 10 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.942 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.885 |
| disc/disc_entropy | 0.628 |
| disc/disc_loss | 0.468 |
| disc/disc_proportion_expert_pred | 0.558 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 10 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.948 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.896 |
| disc/disc_entropy | 0.624 |
| disc/disc_loss | 0.461 |
| disc/disc_proportion_expert_pred | 0.552 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 10 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.95 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.899 |
| disc/disc_entropy | 0.615 |
| disc/disc_loss | 0.446 |
| disc/disc_proportion_expert_pred | 0.55 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 10 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| mean/ | |
| disc/disc_acc | 0.948 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.897 |
| disc/disc_entropy | 0.631 |
| disc/disc_loss | 0.474 |
| disc/disc_proportion_expert_pred | 0.552 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 10 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 44.7 |
| gen/rollout/ep_rew_wrapped_mean | 259 |
| gen/time/fps | 4.21e+03 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 3 |
| gen/time/total_timesteps | 1.64e+05 |
| gen/train/approx_kl | 0.00881 |
| gen/train/clip_fraction | 0.0822 |
| gen/train/clip_range | 0.2 |
| gen/train/entropy_loss | -0.596 |
| gen/train/explained_variance | 0.942 |
| gen/train/learning_rate | 0.0004 |
| gen/train/loss | -0.0335 |
| gen/train/n_updates | 50 |
| gen/train/policy_gradient_loss | -0.00478 |
| gen/train/value_loss | 0.0465 |
--------------------------------------------------
-----------------------------------------------------
| raw/ | |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 50.9 |
| gen/rollout/ep_rew_wrapped_mean | 243 |
| gen/time/fps | 4194 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 3 |
| gen/time/total_timesteps | 180224 |
| gen/train/approx_kl | 0.0088136345 |
| gen/train/clip_fraction | 0.0822 |
| gen/train/clip_range | 0.2 |
| gen/train/entropy_loss | -0.596 |
| gen/train/explained_variance | 0.942 |
| gen/train/learning_rate | 0.0004 |
| gen/train/loss | -0.0335 |
| gen/train/n_updates | 50 |
| gen/train/policy_gradient_loss | -0.00478 |
| gen/train/value_loss | 0.0465 |
-----------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.788 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.575 |
| disc/disc_entropy | 0.642 |
| disc/disc_loss | 0.543 |
| disc/disc_proportion_expert_pred | 0.712 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 11 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.791 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.581 |
| disc/disc_entropy | 0.637 |
| disc/disc_loss | 0.536 |
| disc/disc_proportion_expert_pred | 0.709 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 11 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.794 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.588 |
| disc/disc_entropy | 0.632 |
| disc/disc_loss | 0.526 |
| disc/disc_proportion_expert_pred | 0.706 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 11 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.781 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.562 |
| disc/disc_entropy | 0.632 |
| disc/disc_loss | 0.533 |
| disc/disc_proportion_expert_pred | 0.719 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 11 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.788 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.575 |
| disc/disc_entropy | 0.628 |
| disc/disc_loss | 0.524 |
| disc/disc_proportion_expert_pred | 0.712 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 11 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.783 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.566 |
| disc/disc_entropy | 0.624 |
| disc/disc_loss | 0.524 |
| disc/disc_proportion_expert_pred | 0.717 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 11 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.787 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.573 |
| disc/disc_entropy | 0.622 |
| disc/disc_loss | 0.519 |
| disc/disc_proportion_expert_pred | 0.713 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 11 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.775 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.551 |
| disc/disc_entropy | 0.621 |
| disc/disc_loss | 0.524 |
| disc/disc_proportion_expert_pred | 0.725 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 11 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| mean/ | |
| disc/disc_acc | 0.786 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.572 |
| disc/disc_entropy | 0.63 |
| disc/disc_loss | 0.529 |
| disc/disc_proportion_expert_pred | 0.714 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 11 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 50.9 |
| gen/rollout/ep_rew_wrapped_mean | 243 |
| gen/time/fps | 4.19e+03 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 3 |
| gen/time/total_timesteps | 1.8e+05 |
| gen/train/approx_kl | 0.00988 |
| gen/train/clip_fraction | 0.117 |
| gen/train/clip_range | 0.2 |
| gen/train/entropy_loss | -0.597 |
| gen/train/explained_variance | 0.95 |
| gen/train/learning_rate | 0.0004 |
| gen/train/loss | 0.0114 |
| gen/train/n_updates | 55 |
| gen/train/policy_gradient_loss | -0.00606 |
| gen/train/value_loss | 0.0522 |
--------------------------------------------------
----------------------------------------------------
| raw/ | |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 56.4 |
| gen/rollout/ep_rew_wrapped_mean | 229 |
| gen/time/fps | 4199 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 3 |
| gen/time/total_timesteps | 196608 |
| gen/train/approx_kl | 0.009878516 |
| gen/train/clip_fraction | 0.117 |
| gen/train/clip_range | 0.2 |
| gen/train/entropy_loss | -0.597 |
| gen/train/explained_variance | 0.95 |
| gen/train/learning_rate | 0.0004 |
| gen/train/loss | 0.0114 |
| gen/train/n_updates | 55 |
| gen/train/policy_gradient_loss | -0.00606 |
| gen/train/value_loss | 0.0522 |
----------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.583 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.165 |
| disc/disc_entropy | 0.671 |
| disc/disc_loss | 0.659 |
| disc/disc_proportion_expert_pred | 0.917 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 12 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.588 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.177 |
| disc/disc_entropy | 0.672 |
| disc/disc_loss | 0.653 |
| disc/disc_proportion_expert_pred | 0.912 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 12 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.588 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.176 |
| disc/disc_entropy | 0.671 |
| disc/disc_loss | 0.653 |
| disc/disc_proportion_expert_pred | 0.912 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 12 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.582 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.163 |
| disc/disc_entropy | 0.672 |
| disc/disc_loss | 0.653 |
| disc/disc_proportion_expert_pred | 0.918 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 12 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.596 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.192 |
| disc/disc_entropy | 0.673 |
| disc/disc_loss | 0.65 |
| disc/disc_proportion_expert_pred | 0.904 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 12 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.596 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.192 |
| disc/disc_entropy | 0.674 |
| disc/disc_loss | 0.646 |
| disc/disc_proportion_expert_pred | 0.904 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 12 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.61 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.221 |
| disc/disc_entropy | 0.676 |
| disc/disc_loss | 0.645 |
| disc/disc_proportion_expert_pred | 0.89 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 12 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.604 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.208 |
| disc/disc_entropy | 0.675 |
| disc/disc_loss | 0.643 |
| disc/disc_proportion_expert_pred | 0.896 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 12 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| mean/ | |
| disc/disc_acc | 0.593 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.187 |
| disc/disc_entropy | 0.673 |
| disc/disc_loss | 0.65 |
| disc/disc_proportion_expert_pred | 0.907 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 12 |
| disc/n_expert | 1.02e+03 |
| disc/n_generated | 1.02e+03 |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 56.4 |
| gen/rollout/ep_rew_wrapped_mean | 229 |
| gen/time/fps | 4.2e+03 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 3 |
| gen/time/total_timesteps | 1.97e+05 |
| gen/train/approx_kl | 0.0124 |
| gen/train/clip_fraction | 0.148 |
| gen/train/clip_range | 0.2 |
| gen/train/entropy_loss | -0.586 |
| gen/train/explained_variance | 0.968 |
| gen/train/learning_rate | 0.0004 |
| gen/train/loss | 0.000133 |
| gen/train/n_updates | 60 |
| gen/train/policy_gradient_loss | -0.00875 |
| gen/train/value_loss | 0.0555 |
--------------------------------------------------
… and finally evaluate it again.
env.seed(SEED)
learner_rewards_after_training, _ = evaluate_policy(
learner, env, 100, return_episode_rewards=True
)
We can see that an untrained policy performs poorly, while GAIL matches expert returns (500):
print(
"Rewards before training:",
np.mean(learner_rewards_before_training),
"+/-",
np.std(learner_rewards_before_training),
)
print(
"Rewards after training:",
np.mean(learner_rewards_after_training),
"+/-",
np.std(learner_rewards_after_training),
)
Rewards before training: 102.6 +/- 24.11514047232568
Rewards after training: 49.76 +/- 16.98535840069323
Train an Agent using Adversarial Inverse Reinforcement Learning#
As usual, we first need an expert. Again, we download one from the HuggingFace model hub for convenience.
Note that we now use a variant of the CartPole environment from the seals package, which has fixed episode durations. Read more about why we do this here.
import numpy as np
from imitation.policies.serialize import load_policy
from imitation.util.util import make_vec_env
from imitation.data.wrappers import RolloutInfoWrapper
SEED = 42
FAST = True
if FAST:
N_RL_TRAIN_STEPS = 100_000
else:
N_RL_TRAIN_STEPS = 2_000_000
venv = make_vec_env(
"seals:seals/CartPole-v0",
rng=np.random.default_rng(SEED),
n_envs=8,
post_wrappers=[
lambda env, _: RolloutInfoWrapper(env)
], # needed for computing rollouts later
)
expert = load_policy(
"ppo-huggingface",
organization="HumanCompatibleAI",
env_name="seals/CartPole-v0",
venv=venv,
)
We generate some expert trajectories, that the discriminator needs to distinguish from the learner’s trajectories.
from imitation.data import rollout
rollouts = rollout.rollout(
expert,
venv,
rollout.make_sample_until(min_timesteps=None, min_episodes=60),
rng=np.random.default_rng(SEED),
)
Now we are ready to set up our AIRL trainer.
Note, that the reward_net
is actually the network of the discriminator.
We evaluate the learner before and after training so we can see if it made any progress.
from imitation.algorithms.adversarial.airl import AIRL
from imitation.rewards.reward_nets import BasicShapedRewardNet
from imitation.util.networks import RunningNorm
from stable_baselines3 import PPO
from stable_baselines3.ppo import MlpPolicy
from stable_baselines3.common.evaluation import evaluate_policy
learner = PPO(
env=venv,
policy=MlpPolicy,
batch_size=64,
ent_coef=0.0,
learning_rate=0.0005,
gamma=0.95,
clip_range=0.1,
vf_coef=0.1,
n_epochs=5,
seed=SEED,
)
reward_net = BasicShapedRewardNet(
observation_space=venv.observation_space,
action_space=venv.action_space,
normalize_input_layer=RunningNorm,
)
airl_trainer = AIRL(
demonstrations=rollouts,
demo_batch_size=2048,
gen_replay_buffer_capacity=512,
n_disc_updates_per_round=16,
venv=venv,
gen_algo=learner,
reward_net=reward_net,
)
venv.seed(SEED)
learner_rewards_before_training, _ = evaluate_policy(
learner, venv, 100, return_episode_rewards=True
)
airl_trainer.train(N_RL_TRAIN_STEPS)
venv.seed(SEED)
learner_rewards_after_training, _ = evaluate_policy(
learner, venv, 100, return_episode_rewards=True
)
------------------------------------------
| raw/ | |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 33.1 |
| gen/time/fps | 3535 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 4 |
| gen/time/total_timesteps | 16384 |
------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.581 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.162 |
| disc/disc_entropy | 0.664 |
| disc/disc_loss | 0.676 |
| disc/disc_proportion_expert_pred | 0.919 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 1 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.586 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.172 |
| disc/disc_entropy | 0.664 |
| disc/disc_loss | 0.673 |
| disc/disc_proportion_expert_pred | 0.914 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 1 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.593 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.186 |
| disc/disc_entropy | 0.665 |
| disc/disc_loss | 0.669 |
| disc/disc_proportion_expert_pred | 0.907 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 1 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.591 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.182 |
| disc/disc_entropy | 0.666 |
| disc/disc_loss | 0.672 |
| disc/disc_proportion_expert_pred | 0.909 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 1 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.598 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.197 |
| disc/disc_entropy | 0.666 |
| disc/disc_loss | 0.665 |
| disc/disc_proportion_expert_pred | 0.902 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 1 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.606 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.211 |
| disc/disc_entropy | 0.666 |
| disc/disc_loss | 0.662 |
| disc/disc_proportion_expert_pred | 0.894 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 1 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.605 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.21 |
| disc/disc_entropy | 0.667 |
| disc/disc_loss | 0.659 |
| disc/disc_proportion_expert_pred | 0.895 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 1 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.598 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.196 |
| disc/disc_entropy | 0.667 |
| disc/disc_loss | 0.66 |
| disc/disc_proportion_expert_pred | 0.902 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 1 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.613 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.226 |
| disc/disc_entropy | 0.668 |
| disc/disc_loss | 0.654 |
| disc/disc_proportion_expert_pred | 0.887 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 1 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.623 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.246 |
| disc/disc_entropy | 0.668 |
| disc/disc_loss | 0.65 |
| disc/disc_proportion_expert_pred | 0.877 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 1 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.617 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.235 |
| disc/disc_entropy | 0.668 |
| disc/disc_loss | 0.651 |
| disc/disc_proportion_expert_pred | 0.883 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 1 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.632 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.264 |
| disc/disc_entropy | 0.668 |
| disc/disc_loss | 0.645 |
| disc/disc_proportion_expert_pred | 0.868 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 1 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.629 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.258 |
| disc/disc_entropy | 0.668 |
| disc/disc_loss | 0.644 |
| disc/disc_proportion_expert_pred | 0.871 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 1 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.643 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.286 |
| disc/disc_entropy | 0.669 |
| disc/disc_loss | 0.641 |
| disc/disc_proportion_expert_pred | 0.857 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 1 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.646 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.292 |
| disc/disc_entropy | 0.669 |
| disc/disc_loss | 0.637 |
| disc/disc_proportion_expert_pred | 0.854 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 1 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.653 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.305 |
| disc/disc_entropy | 0.668 |
| disc/disc_loss | 0.633 |
| disc/disc_proportion_expert_pred | 0.847 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 1 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| mean/ | |
| disc/disc_acc | 0.613 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.227 |
| disc/disc_entropy | 0.667 |
| disc/disc_loss | 0.656 |
| disc/disc_proportion_expert_pred | 0.887 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 1 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 33.1 |
| gen/time/fps | 3.54e+03 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 4 |
| gen/time/total_timesteps | 1.64e+04 |
| gen/train/approx_kl | 0.00136 |
| gen/train/clip_fraction | 0.0238 |
| gen/train/clip_range | 0.1 |
| gen/train/entropy_loss | -0.692 |
| gen/train/explained_variance | -0.0116 |
| gen/train/learning_rate | 0.0005 |
| gen/train/loss | 3.17 |
| gen/train/n_updates | 5 |
| gen/train/policy_gradient_loss | 7.75e-06 |
| gen/train/value_loss | 117 |
--------------------------------------------------
-----------------------------------------------------
| raw/ | |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 34.6 |
| gen/rollout/ep_rew_wrapped_mean | -525 |
| gen/time/fps | 3538 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 4 |
| gen/time/total_timesteps | 32768 |
| gen/train/approx_kl | 0.0013636536 |
| gen/train/clip_fraction | 0.0238 |
| gen/train/clip_range | 0.1 |
| gen/train/entropy_loss | -0.692 |
| gen/train/explained_variance | -0.0116 |
| gen/train/learning_rate | 0.0005 |
| gen/train/loss | 3.17 |
| gen/train/n_updates | 5 |
| gen/train/policy_gradient_loss | 7.75e-06 |
| gen/train/value_loss | 117 |
-----------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.68 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.36 |
| disc/disc_entropy | 0.664 |
| disc/disc_loss | 0.618 |
| disc/disc_proportion_expert_pred | 0.82 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 2 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.687 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.375 |
| disc/disc_entropy | 0.664 |
| disc/disc_loss | 0.615 |
| disc/disc_proportion_expert_pred | 0.813 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 2 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.684 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.368 |
| disc/disc_entropy | 0.665 |
| disc/disc_loss | 0.615 |
| disc/disc_proportion_expert_pred | 0.816 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 2 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.688 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.376 |
| disc/disc_entropy | 0.666 |
| disc/disc_loss | 0.617 |
| disc/disc_proportion_expert_pred | 0.812 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 2 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.687 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.373 |
| disc/disc_entropy | 0.667 |
| disc/disc_loss | 0.616 |
| disc/disc_proportion_expert_pred | 0.813 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 2 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.684 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.368 |
| disc/disc_entropy | 0.668 |
| disc/disc_loss | 0.619 |
| disc/disc_proportion_expert_pred | 0.816 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 2 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.677 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.353 |
| disc/disc_entropy | 0.668 |
| disc/disc_loss | 0.62 |
| disc/disc_proportion_expert_pred | 0.823 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 2 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.683 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.366 |
| disc/disc_entropy | 0.669 |
| disc/disc_loss | 0.619 |
| disc/disc_proportion_expert_pred | 0.817 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 2 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.69 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.38 |
| disc/disc_entropy | 0.667 |
| disc/disc_loss | 0.614 |
| disc/disc_proportion_expert_pred | 0.81 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 2 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.69 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.381 |
| disc/disc_entropy | 0.669 |
| disc/disc_loss | 0.615 |
| disc/disc_proportion_expert_pred | 0.81 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 2 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.688 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.377 |
| disc/disc_entropy | 0.67 |
| disc/disc_loss | 0.617 |
| disc/disc_proportion_expert_pred | 0.812 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 2 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.706 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.412 |
| disc/disc_entropy | 0.669 |
| disc/disc_loss | 0.61 |
| disc/disc_proportion_expert_pred | 0.794 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 2 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.697 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.395 |
| disc/disc_entropy | 0.67 |
| disc/disc_loss | 0.613 |
| disc/disc_proportion_expert_pred | 0.803 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 2 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.706 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.412 |
| disc/disc_entropy | 0.67 |
| disc/disc_loss | 0.608 |
| disc/disc_proportion_expert_pred | 0.794 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 2 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.713 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.426 |
| disc/disc_entropy | 0.67 |
| disc/disc_loss | 0.607 |
| disc/disc_proportion_expert_pred | 0.787 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 2 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.705 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.409 |
| disc/disc_entropy | 0.669 |
| disc/disc_loss | 0.605 |
| disc/disc_proportion_expert_pred | 0.795 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 2 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
---------------------------------------------------
| mean/ | |
| disc/disc_acc | 0.692 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.383 |
| disc/disc_entropy | 0.668 |
| disc/disc_loss | 0.614 |
| disc/disc_proportion_expert_pred | 0.808 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 2 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 34.6 |
| gen/rollout/ep_rew_wrapped_mean | -525 |
| gen/time/fps | 3.54e+03 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 4 |
| gen/time/total_timesteps | 3.28e+04 |
| gen/train/approx_kl | 0.0011 |
| gen/train/clip_fraction | 0.00289 |
| gen/train/clip_range | 0.1 |
| gen/train/entropy_loss | -0.691 |
| gen/train/explained_variance | 0.178 |
| gen/train/learning_rate | 0.0005 |
| gen/train/loss | 171 |
| gen/train/n_updates | 10 |
| gen/train/policy_gradient_loss | -7.06e-06 |
| gen/train/value_loss | 4.82e+03 |
---------------------------------------------------
-----------------------------------------------------
| raw/ | |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 35.4 |
| gen/rollout/ep_rew_wrapped_mean | -1.47e+03 |
| gen/time/fps | 3526 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 4 |
| gen/time/total_timesteps | 49152 |
| gen/train/approx_kl | 0.0010964434 |
| gen/train/clip_fraction | 0.00289 |
| gen/train/clip_range | 0.1 |
| gen/train/entropy_loss | -0.691 |
| gen/train/explained_variance | 0.178 |
| gen/train/learning_rate | 0.0005 |
| gen/train/loss | 171 |
| gen/train/n_updates | 10 |
| gen/train/policy_gradient_loss | -7.06e-06 |
| gen/train/value_loss | 4.82e+03 |
-----------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.707 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.413 |
| disc/disc_entropy | 0.673 |
| disc/disc_loss | 0.633 |
| disc/disc_proportion_expert_pred | 0.793 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 3 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.711 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.422 |
| disc/disc_entropy | 0.673 |
| disc/disc_loss | 0.631 |
| disc/disc_proportion_expert_pred | 0.789 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 3 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.721 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.442 |
| disc/disc_entropy | 0.674 |
| disc/disc_loss | 0.63 |
| disc/disc_proportion_expert_pred | 0.779 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 3 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.719 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.438 |
| disc/disc_entropy | 0.673 |
| disc/disc_loss | 0.629 |
| disc/disc_proportion_expert_pred | 0.781 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 3 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.726 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.451 |
| disc/disc_entropy | 0.673 |
| disc/disc_loss | 0.626 |
| disc/disc_proportion_expert_pred | 0.774 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 3 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.729 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.458 |
| disc/disc_entropy | 0.674 |
| disc/disc_loss | 0.623 |
| disc/disc_proportion_expert_pred | 0.771 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 3 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.737 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.474 |
| disc/disc_entropy | 0.674 |
| disc/disc_loss | 0.62 |
| disc/disc_proportion_expert_pred | 0.763 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 3 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.749 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.497 |
| disc/disc_entropy | 0.675 |
| disc/disc_loss | 0.615 |
| disc/disc_proportion_expert_pred | 0.751 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 3 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.743 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.485 |
| disc/disc_entropy | 0.674 |
| disc/disc_loss | 0.618 |
| disc/disc_proportion_expert_pred | 0.757 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 3 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.74 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.479 |
| disc/disc_entropy | 0.675 |
| disc/disc_loss | 0.617 |
| disc/disc_proportion_expert_pred | 0.76 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 3 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.752 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.504 |
| disc/disc_entropy | 0.675 |
| disc/disc_loss | 0.611 |
| disc/disc_proportion_expert_pred | 0.748 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 3 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.764 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.528 |
| disc/disc_entropy | 0.674 |
| disc/disc_loss | 0.609 |
| disc/disc_proportion_expert_pred | 0.736 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 3 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.758 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.516 |
| disc/disc_entropy | 0.674 |
| disc/disc_loss | 0.61 |
| disc/disc_proportion_expert_pred | 0.742 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 3 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.759 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.519 |
| disc/disc_entropy | 0.675 |
| disc/disc_loss | 0.609 |
| disc/disc_proportion_expert_pred | 0.741 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 3 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.769 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.537 |
| disc/disc_entropy | 0.674 |
| disc/disc_loss | 0.604 |
| disc/disc_proportion_expert_pred | 0.731 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 3 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.778 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.555 |
| disc/disc_entropy | 0.674 |
| disc/disc_loss | 0.599 |
| disc/disc_proportion_expert_pred | 0.722 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 3 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
---------------------------------------------------
| mean/ | |
| disc/disc_acc | 0.741 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.482 |
| disc/disc_entropy | 0.674 |
| disc/disc_loss | 0.618 |
| disc/disc_proportion_expert_pred | 0.759 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 3 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 35.4 |
| gen/rollout/ep_rew_wrapped_mean | -1.47e+03 |
| gen/time/fps | 3.53e+03 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 4 |
| gen/time/total_timesteps | 4.92e+04 |
| gen/train/approx_kl | 0.00162 |
| gen/train/clip_fraction | 0.0488 |
| gen/train/clip_range | 0.1 |
| gen/train/entropy_loss | -0.691 |
| gen/train/explained_variance | 0.66 |
| gen/train/learning_rate | 0.0005 |
| gen/train/loss | 89.1 |
| gen/train/n_updates | 15 |
| gen/train/policy_gradient_loss | -0.00034 |
| gen/train/value_loss | 1.37e+03 |
---------------------------------------------------
-----------------------------------------------------
| raw/ | |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 38.2 |
| gen/rollout/ep_rew_wrapped_mean | -1.52e+03 |
| gen/time/fps | 3447 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 4 |
| gen/time/total_timesteps | 65536 |
| gen/train/approx_kl | 0.0016218722 |
| gen/train/clip_fraction | 0.0488 |
| gen/train/clip_range | 0.1 |
| gen/train/entropy_loss | -0.691 |
| gen/train/explained_variance | 0.66 |
| gen/train/learning_rate | 0.0005 |
| gen/train/loss | 89.1 |
| gen/train/n_updates | 15 |
| gen/train/policy_gradient_loss | -0.00034 |
| gen/train/value_loss | 1.37e+03 |
-----------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.782 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.564 |
| disc/disc_entropy | 0.666 |
| disc/disc_loss | 0.62 |
| disc/disc_proportion_expert_pred | 0.718 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 4 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.799 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.599 |
| disc/disc_entropy | 0.667 |
| disc/disc_loss | 0.614 |
| disc/disc_proportion_expert_pred | 0.701 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 4 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.787 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.574 |
| disc/disc_entropy | 0.667 |
| disc/disc_loss | 0.616 |
| disc/disc_proportion_expert_pred | 0.713 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 4 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.789 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.577 |
| disc/disc_entropy | 0.669 |
| disc/disc_loss | 0.616 |
| disc/disc_proportion_expert_pred | 0.711 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 4 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.79 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.58 |
| disc/disc_entropy | 0.668 |
| disc/disc_loss | 0.612 |
| disc/disc_proportion_expert_pred | 0.71 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 4 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.812 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.623 |
| disc/disc_entropy | 0.669 |
| disc/disc_loss | 0.604 |
| disc/disc_proportion_expert_pred | 0.688 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 4 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.806 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.612 |
| disc/disc_entropy | 0.669 |
| disc/disc_loss | 0.6 |
| disc/disc_proportion_expert_pred | 0.694 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 4 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.798 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.597 |
| disc/disc_entropy | 0.671 |
| disc/disc_loss | 0.605 |
| disc/disc_proportion_expert_pred | 0.702 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 4 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.804 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.609 |
| disc/disc_entropy | 0.67 |
| disc/disc_loss | 0.6 |
| disc/disc_proportion_expert_pred | 0.696 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 4 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.818 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.636 |
| disc/disc_entropy | 0.67 |
| disc/disc_loss | 0.593 |
| disc/disc_proportion_expert_pred | 0.682 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 4 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.813 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.626 |
| disc/disc_entropy | 0.67 |
| disc/disc_loss | 0.593 |
| disc/disc_proportion_expert_pred | 0.687 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 4 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.826 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.651 |
| disc/disc_entropy | 0.67 |
| disc/disc_loss | 0.588 |
| disc/disc_proportion_expert_pred | 0.674 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 4 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.824 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.647 |
| disc/disc_entropy | 0.67 |
| disc/disc_loss | 0.585 |
| disc/disc_proportion_expert_pred | 0.676 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 4 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.83 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.66 |
| disc/disc_entropy | 0.669 |
| disc/disc_loss | 0.581 |
| disc/disc_proportion_expert_pred | 0.67 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 4 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.829 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.658 |
| disc/disc_entropy | 0.67 |
| disc/disc_loss | 0.583 |
| disc/disc_proportion_expert_pred | 0.671 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 4 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.834 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.668 |
| disc/disc_entropy | 0.67 |
| disc/disc_loss | 0.579 |
| disc/disc_proportion_expert_pred | 0.666 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 4 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
---------------------------------------------------
| mean/ | |
| disc/disc_acc | 0.809 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.618 |
| disc/disc_entropy | 0.669 |
| disc/disc_loss | 0.599 |
| disc/disc_proportion_expert_pred | 0.691 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 4 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 38.2 |
| gen/rollout/ep_rew_wrapped_mean | -1.52e+03 |
| gen/time/fps | 3.45e+03 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 4 |
| gen/time/total_timesteps | 6.55e+04 |
| gen/train/approx_kl | 0.00297 |
| gen/train/clip_fraction | 0.146 |
| gen/train/clip_range | 0.1 |
| gen/train/entropy_loss | -0.687 |
| gen/train/explained_variance | 0.877 |
| gen/train/learning_rate | 0.0005 |
| gen/train/loss | 4.76 |
| gen/train/n_updates | 20 |
| gen/train/policy_gradient_loss | -0.00277 |
| gen/train/value_loss | 266 |
---------------------------------------------------
-----------------------------------------------------
| raw/ | |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 40.5 |
| gen/rollout/ep_rew_wrapped_mean | -1.69e+03 |
| gen/time/fps | 3531 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 4 |
| gen/time/total_timesteps | 81920 |
| gen/train/approx_kl | 0.0029702676 |
| gen/train/clip_fraction | 0.146 |
| gen/train/clip_range | 0.1 |
| gen/train/entropy_loss | -0.687 |
| gen/train/explained_variance | 0.877 |
| gen/train/learning_rate | 0.0005 |
| gen/train/loss | 4.76 |
| gen/train/n_updates | 20 |
| gen/train/policy_gradient_loss | -0.00277 |
| gen/train/value_loss | 266 |
-----------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.712 |
| disc/disc_acc_expert | 0.998 |
| disc/disc_acc_gen | 0.426 |
| disc/disc_entropy | 0.682 |
| disc/disc_loss | 0.646 |
| disc/disc_proportion_expert_pred | 0.786 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 5 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.716 |
| disc/disc_acc_expert | 0.996 |
| disc/disc_acc_gen | 0.435 |
| disc/disc_entropy | 0.683 |
| disc/disc_loss | 0.647 |
| disc/disc_proportion_expert_pred | 0.781 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 5 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.728 |
| disc/disc_acc_expert | 0.999 |
| disc/disc_acc_gen | 0.457 |
| disc/disc_entropy | 0.683 |
| disc/disc_loss | 0.644 |
| disc/disc_proportion_expert_pred | 0.771 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 5 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.749 |
| disc/disc_acc_expert | 1 |
| disc/disc_acc_gen | 0.498 |
| disc/disc_entropy | 0.683 |
| disc/disc_loss | 0.643 |
| disc/disc_proportion_expert_pred | 0.751 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 5 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.767 |
| disc/disc_acc_expert | 0.997 |
| disc/disc_acc_gen | 0.537 |
| disc/disc_entropy | 0.684 |
| disc/disc_loss | 0.637 |
| disc/disc_proportion_expert_pred | 0.73 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 5 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.797 |
| disc/disc_acc_expert | 0.997 |
| disc/disc_acc_gen | 0.597 |
| disc/disc_entropy | 0.683 |
| disc/disc_loss | 0.63 |
| disc/disc_proportion_expert_pred | 0.7 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 5 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.807 |
| disc/disc_acc_expert | 0.998 |
| disc/disc_acc_gen | 0.617 |
| disc/disc_entropy | 0.683 |
| disc/disc_loss | 0.63 |
| disc/disc_proportion_expert_pred | 0.691 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 5 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.844 |
| disc/disc_acc_expert | 0.998 |
| disc/disc_acc_gen | 0.69 |
| disc/disc_entropy | 0.683 |
| disc/disc_loss | 0.622 |
| disc/disc_proportion_expert_pred | 0.654 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 5 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.845 |
| disc/disc_acc_expert | 0.999 |
| disc/disc_acc_gen | 0.692 |
| disc/disc_entropy | 0.683 |
| disc/disc_loss | 0.619 |
| disc/disc_proportion_expert_pred | 0.653 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 5 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.863 |
| disc/disc_acc_expert | 0.998 |
| disc/disc_acc_gen | 0.729 |
| disc/disc_entropy | 0.682 |
| disc/disc_loss | 0.614 |
| disc/disc_proportion_expert_pred | 0.635 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 5 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.866 |
| disc/disc_acc_expert | 0.999 |
| disc/disc_acc_gen | 0.733 |
| disc/disc_entropy | 0.682 |
| disc/disc_loss | 0.611 |
| disc/disc_proportion_expert_pred | 0.633 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 5 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.866 |
| disc/disc_acc_expert | 0.999 |
| disc/disc_acc_gen | 0.733 |
| disc/disc_entropy | 0.682 |
| disc/disc_loss | 0.609 |
| disc/disc_proportion_expert_pred | 0.633 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 5 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.876 |
| disc/disc_acc_expert | 0.998 |
| disc/disc_acc_gen | 0.754 |
| disc/disc_entropy | 0.681 |
| disc/disc_loss | 0.605 |
| disc/disc_proportion_expert_pred | 0.622 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 5 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.891 |
| disc/disc_acc_expert | 0.999 |
| disc/disc_acc_gen | 0.784 |
| disc/disc_entropy | 0.68 |
| disc/disc_loss | 0.599 |
| disc/disc_proportion_expert_pred | 0.608 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 5 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.893 |
| disc/disc_acc_expert | 0.998 |
| disc/disc_acc_gen | 0.788 |
| disc/disc_entropy | 0.68 |
| disc/disc_loss | 0.598 |
| disc/disc_proportion_expert_pred | 0.605 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 5 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.893 |
| disc/disc_acc_expert | 0.997 |
| disc/disc_acc_gen | 0.788 |
| disc/disc_entropy | 0.68 |
| disc/disc_loss | 0.597 |
| disc/disc_proportion_expert_pred | 0.604 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 5 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
---------------------------------------------------
| mean/ | |
| disc/disc_acc | 0.82 |
| disc/disc_acc_expert | 0.998 |
| disc/disc_acc_gen | 0.641 |
| disc/disc_entropy | 0.682 |
| disc/disc_loss | 0.622 |
| disc/disc_proportion_expert_pred | 0.678 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 5 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 40.5 |
| gen/rollout/ep_rew_wrapped_mean | -1.69e+03 |
| gen/time/fps | 3.53e+03 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 4 |
| gen/time/total_timesteps | 8.19e+04 |
| gen/train/approx_kl | 0.00237 |
| gen/train/clip_fraction | 0.136 |
| gen/train/clip_range | 0.1 |
| gen/train/entropy_loss | -0.686 |
| gen/train/explained_variance | 0.799 |
| gen/train/learning_rate | 0.0005 |
| gen/train/loss | 12.1 |
| gen/train/n_updates | 25 |
| gen/train/policy_gradient_loss | -0.00326 |
| gen/train/value_loss | 37.5 |
---------------------------------------------------
---------------------------------------------------
| raw/ | |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 43.8 |
| gen/rollout/ep_rew_wrapped_mean | -1.38e+03 |
| gen/time/fps | 3527 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 4 |
| gen/time/total_timesteps | 98304 |
| gen/train/approx_kl | 0.00236941 |
| gen/train/clip_fraction | 0.136 |
| gen/train/clip_range | 0.1 |
| gen/train/entropy_loss | -0.686 |
| gen/train/explained_variance | 0.799 |
| gen/train/learning_rate | 0.0005 |
| gen/train/loss | 12.1 |
| gen/train/n_updates | 25 |
| gen/train/policy_gradient_loss | -0.00326 |
| gen/train/value_loss | 37.5 |
---------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.955 |
| disc/disc_acc_expert | 0.94 |
| disc/disc_acc_gen | 0.971 |
| disc/disc_entropy | 0.646 |
| disc/disc_loss | 0.515 |
| disc/disc_proportion_expert_pred | 0.485 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 6 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.952 |
| disc/disc_acc_expert | 0.935 |
| disc/disc_acc_gen | 0.97 |
| disc/disc_entropy | 0.645 |
| disc/disc_loss | 0.514 |
| disc/disc_proportion_expert_pred | 0.483 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 6 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.942 |
| disc/disc_acc_expert | 0.917 |
| disc/disc_acc_gen | 0.967 |
| disc/disc_entropy | 0.646 |
| disc/disc_loss | 0.515 |
| disc/disc_proportion_expert_pred | 0.475 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 6 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.936 |
| disc/disc_acc_expert | 0.908 |
| disc/disc_acc_gen | 0.964 |
| disc/disc_entropy | 0.644 |
| disc/disc_loss | 0.513 |
| disc/disc_proportion_expert_pred | 0.472 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 6 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.935 |
| disc/disc_acc_expert | 0.906 |
| disc/disc_acc_gen | 0.965 |
| disc/disc_entropy | 0.645 |
| disc/disc_loss | 0.513 |
| disc/disc_proportion_expert_pred | 0.47 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 6 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.954 |
| disc/disc_acc_expert | 0.936 |
| disc/disc_acc_gen | 0.971 |
| disc/disc_entropy | 0.643 |
| disc/disc_loss | 0.509 |
| disc/disc_proportion_expert_pred | 0.482 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 6 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.948 |
| disc/disc_acc_expert | 0.93 |
| disc/disc_acc_gen | 0.966 |
| disc/disc_entropy | 0.642 |
| disc/disc_loss | 0.508 |
| disc/disc_proportion_expert_pred | 0.482 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 6 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.956 |
| disc/disc_acc_expert | 0.944 |
| disc/disc_acc_gen | 0.968 |
| disc/disc_entropy | 0.642 |
| disc/disc_loss | 0.506 |
| disc/disc_proportion_expert_pred | 0.488 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 6 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.958 |
| disc/disc_acc_expert | 0.957 |
| disc/disc_acc_gen | 0.958 |
| disc/disc_entropy | 0.639 |
| disc/disc_loss | 0.503 |
| disc/disc_proportion_expert_pred | 0.499 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 6 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.963 |
| disc/disc_acc_expert | 0.959 |
| disc/disc_acc_gen | 0.967 |
| disc/disc_entropy | 0.639 |
| disc/disc_loss | 0.501 |
| disc/disc_proportion_expert_pred | 0.496 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 6 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.963 |
| disc/disc_acc_expert | 0.965 |
| disc/disc_acc_gen | 0.961 |
| disc/disc_entropy | 0.639 |
| disc/disc_loss | 0.5 |
| disc/disc_proportion_expert_pred | 0.502 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 6 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.97 |
| disc/disc_acc_expert | 0.973 |
| disc/disc_acc_gen | 0.968 |
| disc/disc_entropy | 0.636 |
| disc/disc_loss | 0.495 |
| disc/disc_proportion_expert_pred | 0.502 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 6 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.973 |
| disc/disc_acc_expert | 0.979 |
| disc/disc_acc_gen | 0.966 |
| disc/disc_entropy | 0.637 |
| disc/disc_loss | 0.497 |
| disc/disc_proportion_expert_pred | 0.507 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 6 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.97 |
| disc/disc_acc_expert | 0.971 |
| disc/disc_acc_gen | 0.969 |
| disc/disc_entropy | 0.634 |
| disc/disc_loss | 0.493 |
| disc/disc_proportion_expert_pred | 0.501 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 6 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.978 |
| disc/disc_acc_expert | 0.988 |
| disc/disc_acc_gen | 0.968 |
| disc/disc_entropy | 0.635 |
| disc/disc_loss | 0.494 |
| disc/disc_proportion_expert_pred | 0.51 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 6 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/ | |
| disc/disc_acc | 0.976 |
| disc/disc_acc_expert | 0.984 |
| disc/disc_acc_gen | 0.967 |
| disc/disc_entropy | 0.634 |
| disc/disc_loss | 0.49 |
| disc/disc_proportion_expert_pred | 0.509 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 6 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
--------------------------------------------------
---------------------------------------------------
| mean/ | |
| disc/disc_acc | 0.958 |
| disc/disc_acc_expert | 0.95 |
| disc/disc_acc_gen | 0.967 |
| disc/disc_entropy | 0.64 |
| disc/disc_loss | 0.504 |
| disc/disc_proportion_expert_pred | 0.491 |
| disc/disc_proportion_expert_true | 0.5 |
| disc/global_step | 6 |
| disc/n_expert | 2.05e+03 |
| disc/n_generated | 2.05e+03 |
| gen/rollout/ep_len_mean | 500 |
| gen/rollout/ep_rew_mean | 43.8 |
| gen/rollout/ep_rew_wrapped_mean | -1.38e+03 |
| gen/time/fps | 3.53e+03 |
| gen/time/iterations | 1 |
| gen/time/time_elapsed | 4 |
| gen/time/total_timesteps | 9.83e+04 |
| gen/train/approx_kl | 0.002 |
| gen/train/clip_fraction | 0.0806 |
| gen/train/clip_range | 0.1 |
| gen/train/entropy_loss | -0.684 |
| gen/train/explained_variance | 0.47 |
| gen/train/learning_rate | 0.0005 |
| gen/train/loss | 2.66 |
| gen/train/n_updates | 30 |
| gen/train/policy_gradient_loss | -0.00117 |
| gen/train/value_loss | 94.9 |
---------------------------------------------------
We can see that an untrained policy performs poorly, while AIRL brings an improvement. To make it match the expert performance (500), set the flag FAST
to False
in the first cell.
print(
"Rewards before training:",
np.mean(learner_rewards_before_training),
"+/-",
np.std(learner_rewards_before_training),
)
print(
"Rewards after training:",
np.mean(learner_rewards_after_training),
"+/-",
np.std(learner_rewards_after_training),
)
Rewards before training: 102.6 +/- 24.11514047232568
Rewards after training: 43.02 +/- 3.4379645140693347
Learning a Reward Function using Preference Comparisons#
The preference comparisons algorithm learns a reward function by comparing trajectory segments to each other.
To set up the preference comparisons algorithm, we first need to set up a lot of its internals beforehand:
import random
from imitation.algorithms import preference_comparisons
from imitation.rewards.reward_nets import BasicRewardNet
from imitation.util.networks import RunningNorm
from imitation.util.util import make_vec_env
from imitation.policies.base import FeedForward32Policy, NormalizeFeaturesExtractor
import gymnasium as gym
from stable_baselines3 import PPO
import numpy as np
rng = np.random.default_rng(0)
venv = make_vec_env("Pendulum-v1", rng=rng)
reward_net = BasicRewardNet(
venv.observation_space, venv.action_space, normalize_input_layer=RunningNorm
)
fragmenter = preference_comparisons.RandomFragmenter(
warning_threshold=0,
rng=rng,
)
gatherer = preference_comparisons.SyntheticGatherer(rng=rng)
preference_model = preference_comparisons.PreferenceModel(reward_net)
reward_trainer = preference_comparisons.BasicRewardTrainer(
preference_model=preference_model,
loss=preference_comparisons.CrossEntropyRewardLoss(),
epochs=3,
rng=rng,
)
# Several hyperparameters (reward_epochs, ppo_clip_range, ppo_ent_coef,
# ppo_gae_lambda, ppo_n_epochs, discount_factor, use_sde, sde_sample_freq,
# ppo_lr, exploration_frac, num_iterations, initial_comparison_frac,
# initial_epoch_multiplier, query_schedule) used in this example have been
# approximately fine-tuned to reach a reasonable level of performance.
agent = PPO(
policy=FeedForward32Policy,
policy_kwargs=dict(
features_extractor_class=NormalizeFeaturesExtractor,
features_extractor_kwargs=dict(normalize_class=RunningNorm),
),
env=venv,
seed=0,
n_steps=2048 // venv.num_envs,
batch_size=64,
ent_coef=0.01,
learning_rate=2e-3,
clip_range=0.1,
gae_lambda=0.95,
gamma=0.97,
n_epochs=10,
)
trajectory_generator = preference_comparisons.AgentTrainer(
algorithm=agent,
reward_fn=reward_net,
venv=venv,
exploration_frac=0.05,
rng=rng,
)
pref_comparisons = preference_comparisons.PreferenceComparisons(
trajectory_generator,
reward_net,
num_iterations=5, # Set to 60 for better performance
fragmenter=fragmenter,
preference_gatherer=gatherer,
reward_trainer=reward_trainer,
fragment_length=100,
transition_oversampling=1,
initial_comparison_frac=0.1,
allow_variable_horizon=False,
initial_epoch_multiplier=4,
query_schedule="hyperbolic",
)
Then we can start training the reward model. Note that we need to specify the total timesteps that the agent should be trained and how many fragment comparisons should be made.
pref_comparisons.train(
total_timesteps=5_000,
total_comparisons=200,
)
Query schedule: [20, 51, 41, 34, 29, 25]
Collecting 40 fragments (4000 transitions)
Requested 3800 transitions but only 0 in buffer. Sampling 3800 additional transitions.
Sampling 200 exploratory transitions.
Creating fragment pairs
Gathering preferences
Dataset now contains 20 comparisons
Training agent for 1000 timesteps
---------------------------------------------------
| raw/ | |
| agent/rollout/ep_len_mean | 200 |
| agent/rollout/ep_rew_mean | -1.2e+03 |
| agent/rollout/ep_rew_wrapped_mean | 32.6 |
| agent/time/fps | 3838 |
| agent/time/iterations | 1 |
| agent/time/time_elapsed | 0 |
| agent/time/total_timesteps | 2048 |
---------------------------------------------------
------------------------------------------------------
| mean/ | |
| agent/rollout/ep_len_mean | 200 |
| agent/rollout/ep_rew_mean | -1.2e+03 |
| agent/rollout/ep_rew_wrapped_mean | 32.6 |
| agent/time/fps | 3.84e+03 |
| agent/time/iterations | 1 |
| agent/time/time_elapsed | 0 |
| agent/time/total_timesteps | 2.05e+03 |
| agent/train/approx_kl | 0.00269 |
| agent/train/clip_fraction | 0.114 |
| agent/train/clip_range | 0.1 |
| agent/train/entropy_loss | -1.44 |
| agent/train/explained_variance | -0.322 |
| agent/train/learning_rate | 0.002 |
| agent/train/loss | 0.13 |
| agent/train/n_updates | 10 |
| agent/train/policy_gradient_loss | -0.00243 |
| agent/train/std | 1.03 |
| agent/train/value_loss | 1.27 |
| preferences/entropy | 0.0307 |
| reward/epoch-0/train/accuracy | 0.15 |
| reward/epoch-0/train/gt_reward_loss | 0.0639 |
| reward/epoch-0/train/loss | 3.79 |
| reward/epoch-1/train/accuracy | 0.2 |
| reward/epoch-1/train/gt_reward_loss | 0.0639 |
| reward/epoch-1/train/loss | 3.43 |
| reward/epoch-10/train/accuracy | 0.85 |
| reward/epoch-10/train/gt_reward_loss | 0.0639 |
| reward/epoch-10/train/loss | 0.25 |
| reward/epoch-11/train/accuracy | 0.85 |
| reward/epoch-11/train/gt_reward_loss | 0.0639 |
| reward/epoch-11/train/loss | 0.227 |
| reward/epoch-2/train/accuracy | 0.3 |
| reward/epoch-2/train/gt_reward_loss | 0.0639 |
| reward/epoch-2/train/loss | 2.58 |
| reward/epoch-3/train/accuracy | 0.35 |
| reward/epoch-3/train/gt_reward_loss | 0.0639 |
| reward/epoch-3/train/loss | 1.98 |
| reward/epoch-4/train/accuracy | 0.35 |
| reward/epoch-4/train/gt_reward_loss | 0.0639 |
| reward/epoch-4/train/loss | 1.39 |
| reward/epoch-5/train/accuracy | 0.55 |
| reward/epoch-5/train/gt_reward_loss | 0.0639 |
| reward/epoch-5/train/loss | 0.9 |
| reward/epoch-6/train/accuracy | 0.75 |
| reward/epoch-6/train/gt_reward_loss | 0.0639 |
| reward/epoch-6/train/loss | 0.601 |
| reward/epoch-7/train/accuracy | 0.75 |
| reward/epoch-7/train/gt_reward_loss | 0.0639 |
| reward/epoch-7/train/loss | 0.436 |
| reward/epoch-8/train/accuracy | 0.75 |
| reward/epoch-8/train/gt_reward_loss | 0.0639 |
| reward/epoch-8/train/loss | 0.343 |
| reward/epoch-9/train/accuracy | 0.8 |
| reward/epoch-9/train/gt_reward_loss | 0.0639 |
| reward/epoch-9/train/loss | 0.286 |
| reward/ | |
| final/train/accuracy | 0.85 |
| final/train/gt_reward_loss | 0.0639 |
| final/train/loss | 0.227 |
------------------------------------------------------
Collecting 102 fragments (10200 transitions)
Requested 9690 transitions but only 1600 in buffer. Sampling 8090 additional transitions.
Sampling 510 exploratory transitions.
Creating fragment pairs
Gathering preferences
Dataset now contains 71 comparisons
Training agent for 1000 timesteps
-------------------------------------------------------
| raw/ | |
| agent/rollout/ep_len_mean | 200 |
| agent/rollout/ep_rew_mean | -1.13e+03 |
| agent/rollout/ep_rew_wrapped_mean | 47 |
| agent/time/fps | 3863 |
| agent/time/iterations | 1 |
| agent/time/time_elapsed | 0 |
| agent/time/total_timesteps | 4096 |
| agent/train/approx_kl | 0.0026872663 |
| agent/train/clip_fraction | 0.114 |
| agent/train/clip_range | 0.1 |
| agent/train/entropy_loss | -1.44 |
| agent/train/explained_variance | -0.322 |
| agent/train/learning_rate | 0.002 |
| agent/train/loss | 0.13 |
| agent/train/n_updates | 10 |
| agent/train/policy_gradient_loss | -0.00243 |
| agent/train/std | 1.03 |
| agent/train/value_loss | 1.27 |
-------------------------------------------------------
------------------------------------------------------
| mean/ | |
| agent/rollout/ep_len_mean | 200 |
| agent/rollout/ep_rew_mean | -1.13e+03 |
| agent/rollout/ep_rew_wrapped_mean | 47 |
| agent/time/fps | 3.86e+03 |
| agent/time/iterations | 1 |
| agent/time/time_elapsed | 0 |
| agent/time/total_timesteps | 4.1e+03 |
| agent/train/approx_kl | 0.00058 |
| agent/train/clip_fraction | 0.0301 |
| agent/train/clip_range | 0.1 |
| agent/train/entropy_loss | -1.46 |
| agent/train/explained_variance | 0.436 |
| agent/train/learning_rate | 0.002 |
| agent/train/loss | 0.112 |
| agent/train/n_updates | 20 |
| agent/train/policy_gradient_loss | -0.000273 |
| agent/train/std | 1.05 |
| agent/train/value_loss | 0.588 |
| preferences/entropy | 0.00161 |
| reward/epoch-0/train/accuracy | 0.838 |
| reward/epoch-0/train/gt_reward_loss | 0.0135 |
| reward/epoch-0/train/loss | 0.35 |
| reward/epoch-1/train/accuracy | 0.906 |
| reward/epoch-1/train/gt_reward_loss | 0.0135 |
| reward/epoch-1/train/loss | 0.253 |
| reward/epoch-2/train/accuracy | 0.879 |
| reward/epoch-2/train/gt_reward_loss | 0.0135 |
| reward/epoch-2/train/loss | 0.315 |
| reward/ | |
| final/train/accuracy | 0.879 |
| final/train/gt_reward_loss | 0.0135 |
| final/train/loss | 0.315 |
------------------------------------------------------
Collecting 82 fragments (8200 transitions)
Requested 7790 transitions but only 1600 in buffer. Sampling 6190 additional transitions.
Sampling 410 exploratory transitions.
Creating fragment pairs
Gathering preferences
Dataset now contains 112 comparisons
Training agent for 1000 timesteps
-------------------------------------------------------
| raw/ | |
| agent/rollout/ep_len_mean | 200 |
| agent/rollout/ep_rew_mean | -1.16e+03 |
| agent/rollout/ep_rew_wrapped_mean | 56 |
| agent/time/fps | 3862 |
| agent/time/iterations | 1 |
| agent/time/time_elapsed | 0 |
| agent/time/total_timesteps | 6144 |
| agent/train/approx_kl | 0.0005802552 |
| agent/train/clip_fraction | 0.0301 |
| agent/train/clip_range | 0.1 |
| agent/train/entropy_loss | -1.46 |
| agent/train/explained_variance | 0.436 |
| agent/train/learning_rate | 0.002 |
| agent/train/loss | 0.112 |
| agent/train/n_updates | 20 |
| agent/train/policy_gradient_loss | -0.000273 |
| agent/train/std | 1.05 |
| agent/train/value_loss | 0.588 |
-------------------------------------------------------
------------------------------------------------------
| mean/ | |
| agent/rollout/ep_len_mean | 200 |
| agent/rollout/ep_rew_mean | -1.16e+03 |
| agent/rollout/ep_rew_wrapped_mean | 56 |
| agent/time/fps | 3.86e+03 |
| agent/time/iterations | 1 |
| agent/time/time_elapsed | 0 |
| agent/time/total_timesteps | 6.14e+03 |
| agent/train/approx_kl | 0.00145 |
| agent/train/clip_fraction | 0.0564 |
| agent/train/clip_range | 0.1 |
| agent/train/entropy_loss | -1.47 |
| agent/train/explained_variance | 0.694 |
| agent/train/learning_rate | 0.002 |
| agent/train/loss | 0.0207 |
| agent/train/n_updates | 30 |
| agent/train/policy_gradient_loss | -0.00223 |
| agent/train/std | 1.05 |
| agent/train/value_loss | 0.198 |
| preferences/entropy | 0.000825 |
| reward/epoch-0/train/accuracy | 0.914 |
| reward/epoch-0/train/gt_reward_loss | 0.0102 |
| reward/epoch-0/train/loss | 0.201 |
| reward/epoch-1/train/accuracy | 0.938 |
| reward/epoch-1/train/gt_reward_loss | 0.0102 |
| reward/epoch-1/train/loss | 0.148 |
| reward/epoch-2/train/accuracy | 0.945 |
| reward/epoch-2/train/gt_reward_loss | 0.0101 |
| reward/epoch-2/train/loss | 0.126 |
| reward/ | |
| final/train/accuracy | 0.945 |
| final/train/gt_reward_loss | 0.0101 |
| final/train/loss | 0.126 |
------------------------------------------------------
Collecting 68 fragments (6800 transitions)
Requested 6460 transitions but only 1600 in buffer. Sampling 4860 additional transitions.
Sampling 340 exploratory transitions.
Creating fragment pairs
Gathering preferences
Dataset now contains 146 comparisons
Training agent for 1000 timesteps
-------------------------------------------------------
| raw/ | |
| agent/rollout/ep_len_mean | 200 |
| agent/rollout/ep_rew_mean | -1.19e+03 |
| agent/rollout/ep_rew_wrapped_mean | 57.8 |
| agent/time/fps | 3837 |
| agent/time/iterations | 1 |
| agent/time/time_elapsed | 0 |
| agent/time/total_timesteps | 8192 |
| agent/train/approx_kl | 0.0014491911 |
| agent/train/clip_fraction | 0.0564 |
| agent/train/clip_range | 0.1 |
| agent/train/entropy_loss | -1.47 |
| agent/train/explained_variance | 0.694 |
| agent/train/learning_rate | 0.002 |
| agent/train/loss | 0.0207 |
| agent/train/n_updates | 30 |
| agent/train/policy_gradient_loss | -0.00223 |
| agent/train/std | 1.05 |
| agent/train/value_loss | 0.198 |
-------------------------------------------------------
------------------------------------------------------
| mean/ | |
| agent/rollout/ep_len_mean | 200 |
| agent/rollout/ep_rew_mean | -1.19e+03 |
| agent/rollout/ep_rew_wrapped_mean | 57.8 |
| agent/time/fps | 3.84e+03 |
| agent/time/iterations | 1 |
| agent/time/time_elapsed | 0 |
| agent/time/total_timesteps | 8.19e+03 |
| agent/train/approx_kl | 0.00167 |
| agent/train/clip_fraction | 0.0817 |
| agent/train/clip_range | 0.1 |
| agent/train/entropy_loss | -1.47 |
| agent/train/explained_variance | 0.89 |
| agent/train/learning_rate | 0.002 |
| agent/train/loss | 0.00424 |
| agent/train/n_updates | 40 |
| agent/train/policy_gradient_loss | -0.00385 |
| agent/train/std | 1.06 |
| agent/train/value_loss | 0.13 |
| preferences/entropy | 0.0186 |
| reward/epoch-0/train/accuracy | 0.947 |
| reward/epoch-0/train/gt_reward_loss | 0.0168 |
| reward/epoch-0/train/loss | 0.13 |
| reward/epoch-1/train/accuracy | 0.958 |
| reward/epoch-1/train/gt_reward_loss | 0.0106 |
| reward/epoch-1/train/loss | 0.13 |
| reward/epoch-2/train/accuracy | 0.958 |
| reward/epoch-2/train/gt_reward_loss | 0.0125 |
| reward/epoch-2/train/loss | 0.12 |
| reward/ | |
| final/train/accuracy | 0.958 |
| final/train/gt_reward_loss | 0.0125 |
| final/train/loss | 0.12 |
------------------------------------------------------
Collecting 58 fragments (5800 transitions)
Requested 5510 transitions but only 1600 in buffer. Sampling 3910 additional transitions.
Sampling 290 exploratory transitions.
Creating fragment pairs
Gathering preferences
Dataset now contains 175 comparisons
Training agent for 1000 timesteps
-------------------------------------------------------
| raw/ | |
| agent/rollout/ep_len_mean | 200 |
| agent/rollout/ep_rew_mean | -1.21e+03 |
| agent/rollout/ep_rew_wrapped_mean | 57.7 |
| agent/time/fps | 3823 |
| agent/time/iterations | 1 |
| agent/time/time_elapsed | 0 |
| agent/time/total_timesteps | 10240 |
| agent/train/approx_kl | 0.0016703831 |
| agent/train/clip_fraction | 0.0817 |
| agent/train/clip_range | 0.1 |
| agent/train/entropy_loss | -1.47 |
| agent/train/explained_variance | 0.89 |
| agent/train/learning_rate | 0.002 |
| agent/train/loss | 0.00424 |
| agent/train/n_updates | 40 |
| agent/train/policy_gradient_loss | -0.00385 |
| agent/train/std | 1.06 |
| agent/train/value_loss | 0.13 |
-------------------------------------------------------
------------------------------------------------------
| mean/ | |
| agent/rollout/ep_len_mean | 200 |
| agent/rollout/ep_rew_mean | -1.21e+03 |
| agent/rollout/ep_rew_wrapped_mean | 57.7 |
| agent/time/fps | 3.82e+03 |
| agent/time/iterations | 1 |
| agent/time/time_elapsed | 0 |
| agent/time/total_timesteps | 1.02e+04 |
| agent/train/approx_kl | 0.00467 |
| agent/train/clip_fraction | 0.202 |
| agent/train/clip_range | 0.1 |
| agent/train/entropy_loss | -1.5 |
| agent/train/explained_variance | 0.946 |
| agent/train/learning_rate | 0.002 |
| agent/train/loss | 0.0124 |
| agent/train/n_updates | 50 |
| agent/train/policy_gradient_loss | -0.0108 |
| agent/train/std | 1.08 |
| agent/train/value_loss | 0.129 |
| preferences/entropy | 0.00135 |
| reward/epoch-0/train/accuracy | 0.947 |
| reward/epoch-0/train/gt_reward_loss | 0.00886 |
| reward/epoch-0/train/loss | 0.115 |
| reward/epoch-1/train/accuracy | 0.958 |
| reward/epoch-1/train/gt_reward_loss | 0.00886 |
| reward/epoch-1/train/loss | 0.0979 |
| reward/epoch-2/train/accuracy | 0.969 |
| reward/epoch-2/train/gt_reward_loss | 0.0112 |
| reward/epoch-2/train/loss | 0.102 |
| reward/ | |
| final/train/accuracy | 0.969 |
| final/train/gt_reward_loss | 0.0112 |
| final/train/loss | 0.102 |
------------------------------------------------------
Collecting 50 fragments (5000 transitions)
Requested 4750 transitions but only 1600 in buffer. Sampling 3150 additional transitions.
Sampling 250 exploratory transitions.
Creating fragment pairs
Gathering preferences
Dataset now contains 200 comparisons
Training agent for 1000 timesteps
-------------------------------------------------------
| raw/ | |
| agent/rollout/ep_len_mean | 200 |
| agent/rollout/ep_rew_mean | -1.21e+03 |
| agent/rollout/ep_rew_wrapped_mean | 56.4 |
| agent/time/fps | 3839 |
| agent/time/iterations | 1 |
| agent/time/time_elapsed | 0 |
| agent/time/total_timesteps | 12288 |
| agent/train/approx_kl | 0.0046693617 |
| agent/train/clip_fraction | 0.202 |
| agent/train/clip_range | 0.1 |
| agent/train/entropy_loss | -1.5 |
| agent/train/explained_variance | 0.946 |
| agent/train/learning_rate | 0.002 |
| agent/train/loss | 0.0124 |
| agent/train/n_updates | 50 |
| agent/train/policy_gradient_loss | -0.0108 |
| agent/train/std | 1.08 |
| agent/train/value_loss | 0.129 |
-------------------------------------------------------
------------------------------------------------------
| mean/ | |
| agent/rollout/ep_len_mean | 200 |
| agent/rollout/ep_rew_mean | -1.21e+03 |
| agent/rollout/ep_rew_wrapped_mean | 56.4 |
| agent/time/fps | 3.84e+03 |
| agent/time/iterations | 1 |
| agent/time/time_elapsed | 0 |
| agent/time/total_timesteps | 1.23e+04 |
| agent/train/approx_kl | 0.00137 |
| agent/train/clip_fraction | 0.0687 |
| agent/train/clip_range | 0.1 |
| agent/train/entropy_loss | -1.5 |
| agent/train/explained_variance | 0.971 |
| agent/train/learning_rate | 0.002 |
| agent/train/loss | 0.21 |
| agent/train/n_updates | 60 |
| agent/train/policy_gradient_loss | -0.00218 |
| agent/train/std | 1.09 |
| agent/train/value_loss | 0.144 |
| preferences/entropy | 0.000231 |
| reward/epoch-0/train/accuracy | 0.969 |
| reward/epoch-0/train/gt_reward_loss | 0.00759 |
| reward/epoch-0/train/loss | 0.116 |
| reward/epoch-1/train/accuracy | 0.964 |
| reward/epoch-1/train/gt_reward_loss | 0.00759 |
| reward/epoch-1/train/loss | 0.0973 |
| reward/epoch-2/train/accuracy | 0.955 |
| reward/epoch-2/train/gt_reward_loss | 0.00765 |
| reward/epoch-2/train/loss | 0.114 |
| reward/ | |
| final/train/accuracy | 0.955 |
| final/train/gt_reward_loss | 0.00765 |
| final/train/loss | 0.114 |
------------------------------------------------------
{'reward_loss': 0.11424736359289714, 'reward_accuracy': 0.9553571428571429}
After we trained the reward network using the preference comparisons algorithm, we can wrap our environment with that learned reward.
from imitation.rewards.reward_wrapper import RewardVecEnvWrapper
learned_reward_venv = RewardVecEnvWrapper(venv, reward_net.predict_processed)
Next, we train an agent that sees only the shaped, learned reward.
learner = PPO(
seed=0,
policy=FeedForward32Policy,
policy_kwargs=dict(
features_extractor_class=NormalizeFeaturesExtractor,
features_extractor_kwargs=dict(normalize_class=RunningNorm),
),
env=learned_reward_venv,
batch_size=64,
ent_coef=0.01,
n_epochs=10,
n_steps=2048 // learned_reward_venv.num_envs,
clip_range=0.1,
gae_lambda=0.95,
gamma=0.97,
learning_rate=2e-3,
)
learner.learn(1_000) # Note: set to 100_000 to train a proficient expert
<stable_baselines3.ppo.ppo.PPO at 0x7f90e10510d0>
Then we can evaluate it using the original reward.
from stable_baselines3.common.evaluation import evaluate_policy
n_eval_episodes = 10
reward_mean, reward_std = evaluate_policy(learner.policy, venv, n_eval_episodes)
reward_stderr = reward_std / np.sqrt(n_eval_episodes)
print(f"Reward: {reward_mean:.0f} +/- {reward_stderr:.0f}")
Reward: -1348 +/- 114
Learning a Reward Function using Preference Comparisons on Atari#
In this case, we will use a convolutional neural network for our policy and reward model. We will also shape the learned reward model with the policy’s learned value function, since these shaped rewards will be more informative for training - incentivizing agents to move to high-value states. In the interests of execution time, we will only do a little bit of training - much less than in the previous preference comparison notebook. To run this notebook, be sure to install the atari
extras, for example by running pip install imitation[atari]
.
First, we will set up the environment, reward network, et cetera.
import torch as th
import gymnasium as gym
from gymnasium.wrappers import TimeLimit
import numpy as np
from seals.util import AutoResetWrapper
from stable_baselines3 import PPO
from stable_baselines3.common.atari_wrappers import AtariWrapper
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import VecFrameStack
from stable_baselines3.ppo import CnnPolicy
from imitation.algorithms import preference_comparisons
from imitation.data.wrappers import RolloutInfoWrapper
from imitation.policies.base import NormalizeFeaturesExtractor
from imitation.rewards.reward_nets import CnnRewardNet
device = th.device("cuda" if th.cuda.is_available() else "cpu")
rng = np.random.default_rng()
# Here we ensure that our environment has constant-length episodes by resetting
# it when done, and running until 100 timesteps have elapsed.
# For real training, you will want a much longer time limit.
def constant_length_asteroids(num_steps):
atari_env = gym.make("AsteroidsNoFrameskip-v4")
preprocessed_env = AtariWrapper(atari_env)
endless_env = AutoResetWrapper(preprocessed_env)
limited_env = TimeLimit(endless_env, max_episode_steps=num_steps)
return RolloutInfoWrapper(limited_env)
# For real training, you will want a vectorized environment with 8 environments in parallel.
# This can be done by passing in n_envs=8 as an argument to make_vec_env.
# The seed needs to be set to 1 for reproducibility and also to avoid win32
# np.random.randint high bound error.
venv = make_vec_env(constant_length_asteroids, env_kwargs={"num_steps": 100}, seed=1)
venv = VecFrameStack(venv, n_stack=4)
reward_net = CnnRewardNet(
venv.observation_space,
venv.action_space,
).to(device)
fragmenter = preference_comparisons.RandomFragmenter(warning_threshold=0, rng=rng)
gatherer = preference_comparisons.SyntheticGatherer(rng=rng)
preference_model = preference_comparisons.PreferenceModel(reward_net)
reward_trainer = preference_comparisons.BasicRewardTrainer(
preference_model=preference_model,
loss=preference_comparisons.CrossEntropyRewardLoss(),
epochs=3,
rng=rng,
)
agent = PPO(
policy=CnnPolicy,
env=venv,
seed=0,
n_steps=16, # To train on atari well, set this to 128
batch_size=16, # To train on atari well, set this to 256
ent_coef=0.01,
learning_rate=0.00025,
n_epochs=4,
)
trajectory_generator = preference_comparisons.AgentTrainer(
algorithm=agent,
reward_fn=reward_net,
venv=venv,
exploration_frac=0.0,
rng=rng,
)
pref_comparisons = preference_comparisons.PreferenceComparisons(
trajectory_generator,
reward_net,
num_iterations=2,
fragmenter=fragmenter,
preference_gatherer=gatherer,
reward_trainer=reward_trainer,
fragment_length=10,
transition_oversampling=1,
initial_comparison_frac=0.1,
allow_variable_horizon=False,
initial_epoch_multiplier=1,
)
We are now ready to train the reward model.
pref_comparisons.train(
total_timesteps=16,
total_comparisons=15,
)
Query schedule: [1, 9, 5]
Collecting 2 fragments (20 transitions)
Requested 20 transitions but only 0 in buffer. Sampling 20 additional transitions.
Creating fragment pairs
Gathering preferences
Dataset now contains 1 comparisons
Training agent for 8 timesteps
---------------------------------------------------
| raw/ | |
| agent/rollout/ep_rew_wrapped_mean | 2.91 |
| agent/time/fps | 154 |
| agent/time/iterations | 1 |
| agent/time/time_elapsed | 0 |
| agent/time/total_timesteps | 16 |
---------------------------------------------------
-----------------------------------------------------
| mean/ | |
| agent/rollout/ep_rew_wrapped_mean | 2.91 |
| agent/time/fps | 154 |
| agent/time/iterations | 1 |
| agent/time/time_elapsed | 0 |
| agent/time/total_timesteps | 16 |
| agent/train/approx_kl | 9.21e-05 |
| agent/train/clip_fraction | 0 |
| agent/train/clip_range | 0.2 |
| agent/train/entropy_loss | -2.64 |
| agent/train/explained_variance | -0.0826 |
| agent/train/learning_rate | 0.00025 |
| agent/train/loss | -0.0353 |
| agent/train/n_updates | 4 |
| agent/train/policy_gradient_loss | -0.00515 |
| agent/train/value_loss | 0.00965 |
| preferences/entropy | 0.693 |
| reward/epoch-0/train/accuracy | 1 |
| reward/epoch-0/train/gt_reward_loss | 0.693 |
| reward/epoch-0/train/loss | 0.435 |
| reward/epoch-1/train/accuracy | 1 |
| reward/epoch-1/train/gt_reward_loss | 0.693 |
| reward/epoch-1/train/loss | 0.393 |
| reward/epoch-2/train/accuracy | 1 |
| reward/epoch-2/train/gt_reward_loss | 0.693 |
| reward/epoch-2/train/loss | 0.355 |
| reward/ | |
| final/train/accuracy | 1 |
| final/train/gt_reward_loss | 0.693 |
| final/train/loss | 0.355 |
-----------------------------------------------------
Collecting 18 fragments (180 transitions)
Requested 180 transitions but only 0 in buffer. Sampling 180 additional transitions.
Creating fragment pairs
Gathering preferences
Dataset now contains 10 comparisons
Training agent for 8 timesteps
------------------------------------------------------
| raw/ | |
| agent/rollout/ep_rew_wrapped_mean | 3.08 |
| agent/time/fps | 157 |
| agent/time/iterations | 1 |
| agent/time/time_elapsed | 0 |
| agent/time/total_timesteps | 32 |
| agent/train/approx_kl | 9.20929e-05 |
| agent/train/clip_fraction | 0 |
| agent/train/clip_range | 0.2 |
| agent/train/entropy_loss | -2.64 |
| agent/train/explained_variance | -0.0826 |
| agent/train/learning_rate | 0.00025 |
| agent/train/loss | -0.0353 |
| agent/train/n_updates | 4 |
| agent/train/policy_gradient_loss | -0.00515 |
| agent/train/value_loss | 0.00965 |
------------------------------------------------------
-----------------------------------------------------
| mean/ | |
| agent/rollout/ep_rew_wrapped_mean | 3.08 |
| agent/time/fps | 157 |
| agent/time/iterations | 1 |
| agent/time/time_elapsed | 0 |
| agent/time/total_timesteps | 32 |
| agent/train/approx_kl | 0.00014 |
| agent/train/clip_fraction | 0 |
| agent/train/clip_range | 0.2 |
| agent/train/entropy_loss | -2.64 |
| agent/train/explained_variance | 0.0552 |
| agent/train/learning_rate | 0.00025 |
| agent/train/loss | -0.0351 |
| agent/train/n_updates | 8 |
| agent/train/policy_gradient_loss | -0.00595 |
| agent/train/value_loss | 0.0197 |
| preferences/entropy | 0.681 |
| reward/epoch-0/train/accuracy | 0.4 |
| reward/epoch-0/train/gt_reward_loss | 0.655 |
| reward/epoch-0/train/loss | 0.776 |
| reward/epoch-1/train/accuracy | 0.4 |
| reward/epoch-1/train/gt_reward_loss | 0.655 |
| reward/epoch-1/train/loss | 0.764 |
| reward/epoch-2/train/accuracy | 0.4 |
| reward/epoch-2/train/gt_reward_loss | 0.655 |
| reward/epoch-2/train/loss | 0.749 |
| reward/ | |
| final/train/accuracy | 0.4 |
| final/train/gt_reward_loss | 0.655 |
| final/train/loss | 0.749 |
-----------------------------------------------------
Collecting 10 fragments (100 transitions)
Requested 100 transitions but only 0 in buffer. Sampling 100 additional transitions.
Creating fragment pairs
Gathering preferences
Dataset now contains 15 comparisons
Training agent for 8 timesteps
-------------------------------------------------------
| raw/ | |
| agent/rollout/ep_rew_wrapped_mean | 2.94 |
| agent/time/fps | 163 |
| agent/time/iterations | 1 |
| agent/time/time_elapsed | 0 |
| agent/time/total_timesteps | 48 |
| agent/train/approx_kl | 0.0001396425 |
| agent/train/clip_fraction | 0 |
| agent/train/clip_range | 0.2 |
| agent/train/entropy_loss | -2.64 |
| agent/train/explained_variance | 0.0552 |
| agent/train/learning_rate | 0.00025 |
| agent/train/loss | -0.0351 |
| agent/train/n_updates | 8 |
| agent/train/policy_gradient_loss | -0.00595 |
| agent/train/value_loss | 0.0197 |
-------------------------------------------------------
-----------------------------------------------------
| mean/ | |
| agent/rollout/ep_rew_wrapped_mean | 2.94 |
| agent/time/fps | 163 |
| agent/time/iterations | 1 |
| agent/time/time_elapsed | 0 |
| agent/time/total_timesteps | 48 |
| agent/train/approx_kl | 0.000128 |
| agent/train/clip_fraction | 0 |
| agent/train/clip_range | 0.2 |
| agent/train/entropy_loss | -2.64 |
| agent/train/explained_variance | -1.45 |
| agent/train/learning_rate | 0.00025 |
| agent/train/loss | -0.0216 |
| agent/train/n_updates | 12 |
| agent/train/policy_gradient_loss | -0.00565 |
| agent/train/value_loss | 0.0346 |
| preferences/entropy | 0.693 |
| reward/epoch-0/train/accuracy | 0.4 |
| reward/epoch-0/train/gt_reward_loss | 0.668 |
| reward/epoch-0/train/loss | 0.719 |
| reward/epoch-1/train/accuracy | 0.467 |
| reward/epoch-1/train/gt_reward_loss | 0.668 |
| reward/epoch-1/train/loss | 0.709 |
| reward/epoch-2/train/accuracy | 0.467 |
| reward/epoch-2/train/gt_reward_loss | 0.668 |
| reward/epoch-2/train/loss | 0.698 |
| reward/ | |
| final/train/accuracy | 0.467 |
| final/train/gt_reward_loss | 0.668 |
| final/train/loss | 0.698 |
-----------------------------------------------------
{'reward_loss': 0.6978113651275635, 'reward_accuracy': 0.46666666865348816}
We can now wrap the environment with the learned reward model, shaped by the policy’s learned value function. Note that if we were training this for real, we would want to normalize the output of the reward net as well as the value function, to ensure their values are on the same scale. To do this, use the NormalizedRewardNet
class from src/imitation/rewards/reward_nets.py
on reward_net
, and modify the potential to add a RunningNorm
module from src/imitation/util/networks.py
.
from imitation.rewards.reward_nets import ShapedRewardNet, cnn_transpose
from imitation.rewards.reward_wrapper import RewardVecEnvWrapper
def value_potential(state):
state_ = cnn_transpose(state)
return agent.policy.predict_values(state_)
shaped_reward_net = ShapedRewardNet(
base=reward_net,
potential=value_potential,
discount_factor=0.99,
)
# GOTCHA: When using the NormalizedRewardNet wrapper, you should deactivate updating
# during evaluation by passing update_stats=False to the predict_processed method.
learned_reward_venv = RewardVecEnvWrapper(venv, shaped_reward_net.predict_processed)
Next, we train an agent that sees only the shaped, learned reward.
learner = PPO(
policy=CnnPolicy,
env=learned_reward_venv,
seed=0,
batch_size=64,
ent_coef=0.0,
learning_rate=0.0003,
n_epochs=10,
n_steps=64,
)
learner.learn(1000)
<stable_baselines3.ppo.ppo.PPO at 0x7f38d5f04d00>
We now evaluate the learner using the original reward.
from stable_baselines3.common.evaluation import evaluate_policy
reward, _ = evaluate_policy(learner.policy, venv, 10)
print(reward)
0.4
Generating rollouts#
When generating rollouts in image environments, be sure to use the agent’s get_env() function rather than using the original environment.
The learner re-arranges the observations space to put the channel environment in the first dimension, and get_env() will correctly provide a wrapped environment doing this.
from imitation.data import rollout
rollouts = rollout.rollout(
learner,
# Note that passing venv instead of agent.get_env()
# here would fail.
learner.get_env(),
rollout.make_sample_until(min_timesteps=None, min_episodes=3),
rng=rng,
)
Learn a Reward Function using Maximum Conditional Entropy Inverse Reinforcement Learning#
Here, we’re going to take a tabular environment with a pre-defined reward function, Cliffworld, and solve for the optimal policy. We then generate demonstrations from this policy, and use them to learn an approximation to the true reward function with MCE IRL. Finally, we directly compare the learned reward to the ground-truth reward (which we have access to in this example).
Cliffworld is a POMDP, and its “observations” consist of the (partial) observations proper and the (full) hidden environment state. We use DictExtractWrapper
to extract only the hidden states from the environment, turning it into a fully observable MDP to make computing the optimal policy easy.
from functools import partial
from seals import base_envs
from seals.diagnostics.cliff_world import CliffWorldEnv
from stable_baselines3.common.vec_env import DummyVecEnv
import numpy as np
from imitation.algorithms.mce_irl import (
MCEIRL,
mce_occupancy_measures,
mce_partition_fh,
TabularPolicy,
)
from imitation.data import rollout
from imitation.rewards import reward_nets
env_creator = partial(CliffWorldEnv, height=4, horizon=40, width=7, use_xy_obs=True)
env_single = env_creator()
state_env_creator = lambda: base_envs.ExposePOMDPStateWrapper(env_creator())
# This is just a vectorized environment because `generate_trajectories` expects one
state_venv = DummyVecEnv([state_env_creator] * 4)
Then we derive an expert policy using Bellman backups. We analytically compute the occupancy measures, and also sample some expert trajectories.
_, _, pi = mce_partition_fh(env_single)
_, om = mce_occupancy_measures(env_single, pi=pi)
rng = np.random.default_rng()
expert = TabularPolicy(
state_space=env_single.state_space,
action_space=env_single.action_space,
pi=pi,
rng=rng,
)
expert_trajs = rollout.generate_trajectories(
policy=expert,
venv=state_venv,
sample_until=rollout.make_min_timesteps(5000),
rng=rng,
)
print("Expert stats: ", rollout.rollout_stats(expert_trajs))
Expert stats: {'n_traj': 128, 'return_min': 305.0, 'return_mean': 326.7734375, 'return_std': 7.1241390310404356, 'return_max': 334.0, 'len_min': 40, 'len_mean': 40.0, 'len_std': 0.0, 'len_max': 40}
Training the reward function#
The true reward here is not linear in the reduced feature space (i.e \((x,y)\) coordinates). Finding an appropriate linear reward is impossible, but an MLP should Just Work™.
import matplotlib.pyplot as plt
import torch as th
def train_mce_irl(demos, hidden_sizes, lr=0.01, **kwargs):
reward_net = reward_nets.BasicRewardNet(
env_single.observation_space,
env_single.action_space,
hid_sizes=hidden_sizes,
use_action=False,
use_done=False,
use_next_state=False,
)
mce_irl = MCEIRL(
demos,
env_single,
reward_net,
log_interval=250,
optimizer_kwargs=dict(lr=lr),
rng=rng,
)
occ_measure = mce_irl.train(**kwargs)
imitation_trajs = rollout.generate_trajectories(
policy=mce_irl.policy,
venv=state_venv,
sample_until=rollout.make_min_timesteps(5000),
rng=rng,
)
print("Imitation stats: ", rollout.rollout_stats(imitation_trajs))
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
env_single.draw_value_vec(occ_measure)
plt.title("Occupancy for learned reward")
plt.xlabel("Gridworld x-coordinate")
plt.ylabel("Gridworld y-coordinate")
plt.subplot(1, 2, 2)
_, true_occ_measure = mce_occupancy_measures(env_single)
env_single.draw_value_vec(true_occ_measure)
plt.title("Occupancy for true reward")
plt.xlabel("Gridworld x-coordinate")
plt.ylabel("Gridworld y-coordinate")
plt.show()
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
env_single.draw_value_vec(
reward_net(th.as_tensor(env_single.observation_matrix), None, None, None)
.detach()
.numpy()
)
plt.title("Learned reward")
plt.xlabel("Gridworld x-coordinate")
plt.ylabel("Gridworld y-coordinate")
plt.subplot(1, 2, 2)
env_single.draw_value_vec(env_single.reward_matrix)
plt.title("True reward")
plt.xlabel("Gridworld x-coordinate")
plt.ylabel("Gridworld y-coordinate")
plt.show()
return mce_irl
As you can see, a linear reward model cannot fit the data. Even though we’re training the model on analytically computed occupancy measures for the optimal policy, the resulting reward and occupancy frequencies diverge sharply.
train_mce_irl(om, hidden_sizes=[])
--------------------------
| grad_norm | 32.6 |
| iteration | 0 |
| linf_delta | 33.7 |
| weight_norm | 0.79 |
--------------------------
--------------------------
| grad_norm | 4.51 |
| iteration | 250 |
| linf_delta | 20.3 |
| weight_norm | 1.67 |
--------------------------
--------------------------
| grad_norm | 2.64 |
| iteration | 500 |
| linf_delta | 16.7 |
| weight_norm | 2.92 |
--------------------------
--------------------------
| grad_norm | 1.91 |
| iteration | 750 |
| linf_delta | 14.6 |
| weight_norm | 4.05 |
--------------------------
Imitation stats: {'n_traj': 128, 'return_min': -12.0, 'return_mean': 99.7734375, 'return_std': 39.96936616877473, 'return_max': 194.0, 'len_min': 40, 'len_mean': 40.0, 'len_std': 0.0, 'len_max': 40}


<imitation.algorithms.mce_irl.MCEIRL at 0x7fb9f42d0370>
Now, let’s try using a very simple nonlinear reward model: an MLP with a single hidden layer. We first train it on the analytically computed occupancy measures. This should give a very precise result.
train_mce_irl(om, hidden_sizes=[256])
--------------------------
| grad_norm | 71.9 |
| iteration | 0 |
| linf_delta | 29.5 |
| weight_norm | 11.5 |
--------------------------
--------------------------
| grad_norm | 0.356 |
| iteration | 250 |
| linf_delta | 0.189 |
| weight_norm | 17.9 |
--------------------------
--------------------------
| grad_norm | 0.396 |
| iteration | 500 |
| linf_delta | 0.0957 |
| weight_norm | 20.7 |
--------------------------
--------------------------
| grad_norm | 0.179 |
| iteration | 750 |
| linf_delta | 0.0383 |
| weight_norm | 22.4 |
--------------------------
Imitation stats: {'n_traj': 128, 'return_min': 296.0, 'return_mean': 325.234375, 'return_std': 7.899252708919686, 'return_max': 334.0, 'len_min': 40, 'len_mean': 40.0, 'len_std': 0.0, 'len_max': 40}


<imitation.algorithms.mce_irl.MCEIRL at 0x7fb9f42d00d0>
Then we train it on trajectories sampled from the expert. This gives a stochastic approximation to occupancy measure, so performance is a little worse. Using more expert trajectories should improve performance – try it!
mce_irl_from_trajs = train_mce_irl(expert_trajs[0:10], hidden_sizes=[256])
--------------------------
| grad_norm | 78.4 |
| iteration | 0 |
| linf_delta | 31 |
| weight_norm | 11.3 |
--------------------------
--------------------------
| grad_norm | 3.14 |
| iteration | 250 |
| linf_delta | 0.265 |
| weight_norm | 27.8 |
--------------------------
--------------------------
| grad_norm | 9.24 |
| iteration | 500 |
| linf_delta | 0.254 |
| weight_norm | 72.3 |
--------------------------
--------------------------
| grad_norm | 17.8 |
| iteration | 750 |
| linf_delta | 0.244 |
| weight_norm | 145 |
--------------------------
Imitation stats: {'n_traj': 128, 'return_min': 298.0, 'return_mean': 326.0546875, 'return_std': 8.24650360924821, 'return_max': 334.0, 'len_min': 40, 'len_mean': 40.0, 'len_std': 0.0, 'len_max': 40}


While the learned reward function is quite different from the true reward function, it induces a virtually identical occupancy measure over the states. In particular, states below the top row get almost the same reward as top-row states. This is because in Cliff World, there is an upward-blowing wind which will push the agent toward the top row with probability 0.3 at every timestep.
Even though the agent only gets reward in the top row squares, and maximum reward in the top righthand square, the reward model considers it to be almost as good to end up in one of the squares below the top rightmost square, since the wind will eventually blow the agent to the goal square.
Learning a Reward Function using Kernel Density#
This demo shows how to train a Pendulum
agent (exciting!) with our simple density-based imitation learning baselines. DensityTrainer
has a few interesting parameters, but the key ones are:
density_type
: this governs whether density is measured on \((s,s')\) pairs (db.STATE_STATE_DENSITY
), \((s,a)\) pairs (db.STATE_ACTION_DENSITY
), or single states (db.STATE_DENSITY
).is_stationary
: determines whether a separate density model is used for each time step \(t\) (False
), or the same model is used for transitions at all times (True
).standardise_inputs
: ifTrue
, each dimension of the agent state vectors will be normalised to have zero mean and unit variance over the training dataset. This can be useful when not all elements of the demonstration vector are on the same scale, or when some elements have too wide a variation to be captured by the fixed kernel width (1 for Gaussian kernel).kernel
: changes the kernel used for non-parametric density estimation.gaussian
andexponential
are the best bets; see the sklearn docs for the rest.
import pprint
from imitation.algorithms import density as db
from imitation.data import types
from imitation.util import util
# Set FAST = False for longer training. Use True for testing and CI.
FAST = True
if FAST:
N_VEC = 1
N_TRAJECTORIES = 1
N_ITERATIONS = 1
N_RL_TRAIN_STEPS = 100
else:
N_VEC = 8
N_TRAJECTORIES = 10
N_ITERATIONS = 10
N_RL_TRAIN_STEPS = 100_000
from imitation.policies.serialize import load_policy
from stable_baselines3.common.policies import ActorCriticPolicy
from stable_baselines3 import PPO
from imitation.data import rollout
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy
from imitation.data.wrappers import RolloutInfoWrapper
import gymnasium as gym
import numpy as np
SEED = 42
rng = np.random.default_rng(seed=SEED)
env_name = "Pendulum-v1"
rollout_env = DummyVecEnv(
[lambda: RolloutInfoWrapper(gym.make(env_name)) for _ in range(N_VEC)]
)
expert = load_policy(
"ppo-huggingface",
organization="HumanCompatibleAI",
env_name=env_name,
venv=rollout_env,
)
rollouts = rollout.rollout(
expert,
rollout_env,
rollout.make_sample_until(min_timesteps=2000, min_episodes=57),
rng=rng,
)
env = util.make_vec_env(env_name, n_envs=N_VEC, rng=rng)
imitation_trainer = PPO(
ActorCriticPolicy, env, learning_rate=3e-4, gamma=0.95, ent_coef=1e-4, n_steps=2048
)
density_trainer = db.DensityAlgorithm(
venv=env,
rng=rng,
demonstrations=rollouts,
rl_algo=imitation_trainer,
density_type=db.DensityType.STATE_ACTION_DENSITY,
is_stationary=True,
kernel="gaussian",
kernel_bandwidth=0.4, # found using divination & some palm reading
standardise_inputs=True,
)
density_trainer.train()
# evaluate the expert
expert_rewards, _ = evaluate_policy(expert, env, 100, return_episode_rewards=True)
# evaluate the learner before training
learner_rewards_before_training, _ = evaluate_policy(
density_trainer.policy, env, 100, return_episode_rewards=True
)
def print_stats(density_trainer, n_trajectories, epoch=""):
stats = density_trainer.test_policy(n_trajectories=n_trajectories)
print("True reward function stats:")
pprint.pprint(stats)
stats_im = density_trainer.test_policy(
true_reward=False,
n_trajectories=n_trajectories,
)
print(f"Imitation reward function stats, epoch {epoch}:")
pprint.pprint(stats_im)
novice_stats = density_trainer.test_policy(n_trajectories=N_TRAJECTORIES)
print("Stats before training:")
print_stats(density_trainer, 1)
print("Starting the training!")
for i in range(N_ITERATIONS):
density_trainer.train_policy(N_RL_TRAIN_STEPS)
print_stats(density_trainer, 1, epoch=str(i))
Stats before training:
True reward function stats:
{'len_max': 200,
'len_mean': 200.0,
'len_min': 200,
'len_std': 0.0,
'monitor_return_len': 1,
'monitor_return_max': -1493.001723,
'monitor_return_mean': -1493.001723,
'monitor_return_min': -1493.001723,
'monitor_return_std': 0.0,
'n_traj': 1,
'return_max': -1493.001723766327,
'return_mean': -1493.001723766327,
'return_min': -1493.001723766327,
'return_std': 0.0}
Imitation reward function stats, epoch :
{'len_max': 200,
'len_mean': 200.0,
'len_min': 200,
'len_std': 0.0,
'monitor_return_len': 1,
'monitor_return_max': -1749.369344,
'monitor_return_mean': -1749.369344,
'monitor_return_min': -1749.369344,
'monitor_return_std': 0.0,
'n_traj': 1,
'return_max': -2212.1580998897552,
'return_mean': -2212.1580998897552,
'return_min': -2212.1580998897552,
'return_std': 0.0}
Starting the training!
True reward function stats:
{'len_max': 200,
'len_mean': 200.0,
'len_min': 200,
'len_std': 0.0,
'monitor_return_len': 1,
'monitor_return_max': -908.535786,
'monitor_return_mean': -908.535786,
'monitor_return_min': -908.535786,
'monitor_return_std': 0.0,
'n_traj': 1,
'return_max': -908.5357865467668,
'return_mean': -908.5357865467668,
'return_min': -908.5357865467668,
'return_std': 0.0}
Imitation reward function stats, epoch 0:
{'len_max': 200,
'len_mean': 200.0,
'len_min': 200,
'len_std': 0.0,
'monitor_return_len': 1,
'monitor_return_max': -855.283381,
'monitor_return_mean': -855.283381,
'monitor_return_min': -855.283381,
'monitor_return_std': 0.0,
'n_traj': 1,
'return_max': -2239.7023117542267,
'return_mean': -2239.7023117542267,
'return_min': -2239.7023117542267,
'return_std': 0.0}
# evaluate the learner after training
learner_rewards_after_training, _ = evaluate_policy(
density_trainer.policy, env, 100, return_episode_rewards=True
)
Here are the final results. If you set FAST = False
in one of the initial cells, you should see that performance after training approaches that of an expert.
print("Mean expert reward:", np.mean(expert_rewards))
print("Mean reward before training:", np.mean(learner_rewards_before_training))
print("Mean reward after training:", np.mean(learner_rewards_after_training))
Mean expert reward: -212.67203443999998
Mean reward before training: -1235.5171938299998
Mean reward after training: -1145.53928535
Train an Agent using Soft Q Imitation Learning#
Soft Q Imitation Learning (SQIL) is a simple algorithm that can be used to clone expert behavior. It’s fundamentally a modification of the DQN algorithm. At each training step, whenever we sample a batch of data from the replay buffer, we also sample a batch of expert data. Expert demonstrations are assigned a reward of 1, while the agent’s own transitions are assigned a reward of 0. This approach encourages the agent to imitate the expert’s behavior, but also to avoid unfamiliar states.
In this tutorial we will use the imitation
library to train an agent using SQIL.
First, we need some expert trajectories in our environment (seals/CartPole-v0
).
Note that you can use other environments, but the action space must be discrete for this algorithm.
import datasets
from stable_baselines3.common.vec_env import DummyVecEnv
from imitation.data import huggingface_utils
# Download some expert trajectories from the HuggingFace Datasets Hub.
dataset = datasets.load_dataset("HumanCompatibleAI/ppo-CartPole-v1")
# Convert the dataset to a format usable by the imitation library.
expert_trajectories = huggingface_utils.TrajectoryDatasetSequence(dataset["train"])
Let’s quickly check if the expert is any good. We usually should be able to reach a reward of 500, which is the maximum achievable value.
from imitation.data import rollout
trajectory_stats = rollout.rollout_stats(expert_trajectories)
print(
f"We have {trajectory_stats['n_traj']} trajectories."
f"The average length of each trajectory is {trajectory_stats['len_mean']}."
f"The average return of each trajectory is {trajectory_stats['return_mean']}."
)
We have 100 trajectories.The average length of each trajectory is 500.0.The average return of each trajectory is 500.0.
After we collected our expert trajectories, it’s time to set up our imitation algorithm.
from imitation.algorithms import sqil
import gymnasium as gym
venv = DummyVecEnv([lambda: gym.make("CartPole-v1")])
sqil_trainer = sqil.SQIL(
venv=venv,
demonstrations=expert_trajectories,
policy="MlpPolicy",
)
As you can see the untrained policy only gets poor rewards:
from stable_baselines3.common.evaluation import evaluate_policy
reward_before_training, _ = evaluate_policy(sqil_trainer.policy, venv, 10)
print(f"Reward before training: {reward_before_training}")
Reward before training: 8.8
After training, we can match the rewards of the expert (500):
sqil_trainer.train(
total_timesteps=1_000,
) # Note: set to 1_000_000 to obtain good results
reward_after_training, _ = evaluate_policy(sqil_trainer.policy, venv, 10)
print(f"Reward after training: {reward_after_training}")
Reward after training: 9.2
Train an Agent using Soft Q Imitation Learning with SAC#
In the previous tutorial, we used Soft Q Imitation Learning (SQIL) on top of the DQN base algorithm. In fact, SQIL can be combined with any off-policy algorithm from stable_baselines3
. Here, we train a Pendulum agent using SQIL + SAC.
First, we need some expert trajectories in our environment (Pendulum-v1
).
Note that you can use other environments, but the action space must be continuous.
import datasets
from imitation.data import huggingface_utils
# Download some expert trajectories from the HuggingFace Datasets Hub.
dataset = datasets.load_dataset("HumanCompatibleAI/ppo-Pendulum-v1")
# Convert the dataset to a format usable by the imitation library.
expert_trajectories = huggingface_utils.TrajectoryDatasetSequence(dataset["train"])
Let’s quickly check if the expert is any good.
from imitation.data import rollout
trajectory_stats = rollout.rollout_stats(expert_trajectories)
print(
f"We have {trajectory_stats['n_traj']} trajectories. "
f"The average length of each trajectory is {trajectory_stats['len_mean']}. "
f"The average return of each trajectory is {trajectory_stats['return_mean']}."
)
We have 200 trajectories. The average length of each trajectory is 200.0. The average return of each trajectory is -205.22814517737746.
After we collected our expert trajectories, it’s time to set up our imitation algorithm.
from imitation.algorithms import sqil
from imitation.util.util import make_vec_env
import numpy as np
from stable_baselines3 import sac
SEED = 42
venv = make_vec_env(
"Pendulum-v1",
rng=np.random.default_rng(seed=SEED),
)
sqil_trainer = sqil.SQIL(
venv=venv,
demonstrations=expert_trajectories,
policy="MlpPolicy",
rl_algo_class=sac.SAC,
rl_kwargs=dict(seed=SEED),
)
As you can see the untrained policy only gets poor rewards (< 0):
from stable_baselines3.common.evaluation import evaluate_policy
reward_before_training, _ = evaluate_policy(sqil_trainer.policy, venv, 100)
print(f"Reward before training: {reward_before_training}")
Reward before training: -1386.1941136000003
After training, we can observe that agent is quite improved (> 1000), although it does not reach the expert performance in this case.
sqil_trainer.train(
total_timesteps=1000,
) # Note: set to 300_000 to obtain good results
reward_after_training, _ = evaluate_policy(sqil_trainer.policy, venv, 100)
print(f"Reward after training: {reward_after_training}")
Reward after training: -1217.9038355900002
Reliably compare algorithm performance#
Did we actually match the expert performance or was it just luck? Did this hyperparameter change actually improve the performance of our algorithm? These are questions that we need to answer when we want to compare the performance of different algorithms or hyperparameters.
imitation
provides some tools to help you answer these questions. For demonstration purposes, we will use Behavior Cloning on the CartPole-v1 environment. We will compare different variants of the trained algorithm, and also compare it with a more sophisticated algorithm, DAgger.
We will start by training a good (but not perfect) expert.
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.ppo import MlpPolicy
env = gym.make("CartPole-v1")
expert = PPO(
policy=MlpPolicy,
env=env,
seed=0,
batch_size=64,
ent_coef=0.0,
learning_rate=0.0003,
n_epochs=10,
n_steps=64,
)
expert.learn(10_000) # set to 100_000 for better performance
<stable_baselines3.ppo.ppo.PPO at 0x7f0419047940>
For comparison, let’s also train a not-quite-expert.
not_expert = PPO(
policy=MlpPolicy,
env=env,
seed=0,
batch_size=64,
ent_coef=0.0,
learning_rate=0.0003,
n_epochs=10,
n_steps=64,
)
not_expert.learn(1_000) # set to 10_000 for slightly better performance
<stable_baselines3.ppo.ppo.PPO at 0x7f041022ef10>
So are they any good? Let’s quickly get a point estimate of their performance.
from stable_baselines3.common.evaluation import evaluate_policy
env.reset(seed=0)
expert_reward, _ = evaluate_policy(expert, env, 1)
not_expert_reward, _ = evaluate_policy(not_expert, env, 1)
print(f"Expert reward: {expert_reward:.2f}")
print(f"Not expert reward: {not_expert_reward:.2f}")
Expert reward: 147.00
Not expert reward: 71.00
But wait! We only ran the evaluation once. What if we got lucky? Let’s run the evaluation a few more times and see what happens.
expert_reward, _ = evaluate_policy(expert, env, 10)
not_expert_reward, _ = evaluate_policy(not_expert, env, 10)
print(f"Expert reward: {expert_reward:.2f}")
print(f"Not expert reward: {not_expert_reward:.2f}")
Expert reward: 143.90
Not expert reward: 83.40
Seems a bit more robust now, but how certain are we? Fortunately, imitation
provides us with tools to answer this.
We will perform a permutation test using the is_significant_reward_improvement
function. We want to be very certain – let’s set the bar high and require a p-value of 0.001.
from imitation.testing.reward_improvement import is_significant_reward_improvement
expert_rewards, _ = evaluate_policy(expert, env, 10, return_episode_rewards=True)
not_expert_rewards, _ = evaluate_policy(
not_expert, env, 10, return_episode_rewards=True
)
significant = is_significant_reward_improvement(
not_expert_rewards, expert_rewards, 0.001
)
print(
f"The expert is {'NOT ' if not significant else ''}significantly better than the not-expert."
)
The expert is significantly better than the not-expert.
Huh, turns out we set the bar too high. We could lower our standards, but that’s for cowards. Instead, we can collect more data and try again.
from imitation.testing.reward_improvement import is_significant_reward_improvement
expert_rewards, _ = evaluate_policy(expert, env, 100, return_episode_rewards=True)
not_expert_rewards, _ = evaluate_policy(
not_expert, env, 100, return_episode_rewards=True
)
significant = is_significant_reward_improvement(
not_expert_rewards, expert_rewards, 0.001
)
print(
f"The expert is {'NOT ' if not significant else ''}significantly better than the not-expert."
)
The expert is significantly better than the not-expert.
Here we go! We can now be 99.9% confident that the expert is better than the not-expert – in this specific case, with these specific trained models. It might still be an extraordinary stroke of luck, or a conspiracy to make us choose the wrong algorithm, but outside of that, we can be pretty sure our data’s correct.
We can use the same principle to with imitation learning algorithms. Let’s train a behavior cloning algorithm and see how it compares to the expert. This time, we can lower the bar to the standard “scientific” threshold of 0.05.
Like in the first tutorial, we will start by collecting some expert data. But to spice it up, let’s also get some data from the not-quite-expert.
from imitation.data import rollout
from imitation.data.wrappers import RolloutInfoWrapper
from stable_baselines3.common.vec_env import DummyVecEnv
import numpy as np
rng = np.random.default_rng()
expert_rollouts = rollout.rollout(
expert,
DummyVecEnv([lambda: RolloutInfoWrapper(env)]),
rollout.make_sample_until(min_timesteps=None, min_episodes=50),
rng=rng,
)
expert_transitions = rollout.flatten_trajectories(expert_rollouts)
not_expert_rollouts = rollout.rollout(
not_expert,
DummyVecEnv([lambda: RolloutInfoWrapper(env)]),
rollout.make_sample_until(min_timesteps=None, min_episodes=50),
rng=rng,
)
not_expert_transitions = rollout.flatten_trajectories(not_expert_rollouts)
Let’s try cloning an expert and a non-expert, and see how they compare.
from imitation.algorithms import bc
expert_bc_trainer = bc.BC(
observation_space=env.observation_space,
action_space=env.action_space,
demonstrations=expert_transitions,
rng=rng,
)
not_expert_bc_trainer = bc.BC(
observation_space=env.observation_space,
action_space=env.action_space,
demonstrations=not_expert_transitions,
rng=rng,
)
expert_bc_trainer.train(n_epochs=2)
not_expert_bc_trainer.train(n_epochs=2)
---------------------------------
| batch_size | 32 |
| bc/ | |
| batch | 0 |
| ent_loss | -0.000693 |
| entropy | 0.693 |
| epoch | 0 |
| l2_loss | 0 |
| l2_norm | 72.5 |
| loss | 0.693 |
| neglogp | 0.694 |
| prob_true_act | 0.5 |
| samples_so_far | 32 |
---------------------------------
---------------------------------
| batch_size | 32 |
| bc/ | |
| batch | 0 |
| ent_loss | -0.000693 |
| entropy | 0.693 |
| epoch | 0 |
| l2_loss | 0 |
| l2_norm | 72.5 |
| loss | 0.693 |
| neglogp | 0.693 |
| prob_true_act | 0.5 |
| samples_so_far | 32 |
---------------------------------
bc_expert_rewards, _ = evaluate_policy(
expert_bc_trainer.policy, env, 10, return_episode_rewards=True
)
bc_not_expert_rewards, _ = evaluate_policy(
not_expert_bc_trainer.policy, env, 10, return_episode_rewards=True
)
significant = is_significant_reward_improvement(
bc_not_expert_rewards, bc_expert_rewards, 0.05
)
print(f"Cloned expert rewards: {bc_expert_rewards}")
print(f"Cloned not-expert rewards: {bc_not_expert_rewards}")
print(
f"Cloned expert is {'NOT ' if not significant else ''}significantly better than the cloned not-expert."
)
Cloned expert rewards: [121.0, 140.0, 155.0, 140.0, 124.0, 113.0, 139.0, 116.0, 108.0, 134.0]
Cloned not-expert rewards: [47.0, 102.0, 76.0, 56.0, 77.0, 103.0, 69.0, 80.0, 98.0, 65.0]
Cloned expert is significantly better than the cloned not-expert.
How about comparing the expert clone to the expert itself?
bc_clone_rewards, _ = evaluate_policy(
expert_bc_trainer.policy, env, 10, return_episode_rewards=True
)
expert_rewards, _ = evaluate_policy(expert, env, 10, return_episode_rewards=True)
significant = is_significant_reward_improvement(bc_clone_rewards, expert_rewards, 0.05)
print(f"Cloned expert rewards: {bc_clone_rewards}")
print(f"Expert rewards: {expert_rewards}")
print(
f"Expert is {'NOT ' if not significant else ''}significantly better than the cloned expert."
)
Cloned expert rewards: [108.0, 133.0, 158.0, 144.0, 136.0, 116.0, 115.0, 129.0, 117.0, 115.0]
Expert rewards: [140.0, 132.0, 154.0, 126.0, 121.0, 138.0, 175.0, 132.0, 132.0, 139.0]
Expert is NOT significantly better than the cloned expert.
Turns out the expert is significantly better than the clone – again, in this case. Note, however, that this is not proof that the clone is as good as the expert – there’s a subtle difference between the two claims in the context of hypothesis testing.
Note: if you changed the duration of the training at the beginning of this tutorial, you might get different results. While this might break the narrative in this tutorial, it’s a good learning opportunity.
When comparing the performance of two agents, algorithms, hyperparameter sets, always remember the scope of what you’re testing. In this tutorial, we have one instance of an expert – but RL training is famously unstable, so another training run with another random seed would likely produce a slightly different result. So ideally, we would like to repeat this procedure several times, training the same agent with different random seeds, and then compare the average performance of the two agents.
Even then, this is just on one environment, with one algorithm. So be wary of generalizing your results too much.
We can also use the same method to compare different algorithms. While CartPole is pretty easy, we can make it more difficult by decreasing the number of episodes in our dataset, and generating them with a suboptimal policy:
rollouts = rollout.rollout(
expert,
DummyVecEnv([lambda: RolloutInfoWrapper(env)]),
rollout.make_sample_until(min_timesteps=None, min_episodes=1),
rng=rng,
)
transitions = rollout.flatten_trajectories(rollouts)
Let’s try training a behavior cloning algorithm on this dataset.
Note that for DAgger, we have to cheat a little bit – it’s allowed to use the expert policy to generate additional data. For the purposes of this tutorial, we’ll stick with this to avoid spending hours training an expert for a more complex environment.
So while this little experiment isn’t definitive proof that DAgger is better than BC, you can use the same method to compare any two algorithms.
from imitation.algorithms.dagger import SimpleDAggerTrainer
import tempfile
bc_trainer = bc.BC(
observation_space=env.observation_space,
action_space=env.action_space,
demonstrations=transitions,
rng=rng,
)
bc_trainer.train(n_epochs=1)
with tempfile.TemporaryDirectory(prefix="dagger_example_") as tmpdir:
print(tmpdir)
dagger_bc_trainer = bc.BC(
observation_space=env.observation_space,
action_space=env.action_space,
rng=np.random.default_rng(),
)
dagger_trainer = SimpleDAggerTrainer(
venv=DummyVecEnv([lambda: RolloutInfoWrapper(env)]),
scratch_dir=tmpdir,
expert_policy=expert,
bc_trainer=dagger_bc_trainer,
rng=np.random.default_rng(),
)
dagger_trainer.train(5000)
---------------------------------
| batch_size | 32 |
| bc/ | |
| batch | 0 |
| ent_loss | -0.000693 |
| entropy | 0.693 |
| epoch | 0 |
| l2_loss | 0 |
| l2_norm | 72.5 |
| loss | 0.693 |
| neglogp | 0.694 |
| prob_true_act | 0.5 |
| samples_so_far | 32 |
---------------------------------
/tmp/dagger_example_o3r5tw84
---------------------------------
| batch_size | 32 |
| bc/ | |
| batch | 0 |
| ent_loss | -0.000693 |
| entropy | 0.693 |
| epoch | 0 |
| l2_loss | 0 |
| l2_norm | 72.5 |
| loss | 0.692 |
| neglogp | 0.693 |
| prob_true_act | 0.5 |
| samples_so_far | 32 |
| rollout/ | |
| return_max | 43 |
| return_mean | 25.8 |
| return_min | 17 |
| return_std | 10.2 |
---------------------------------
---------------------------------
| batch_size | 32 |
| bc/ | |
| batch | 0 |
| ent_loss | -0.000665 |
| entropy | 0.665 |
| epoch | 0 |
| l2_loss | 0 |
| l2_norm | 75.5 |
| loss | 0.544 |
| neglogp | 0.545 |
| prob_true_act | 0.586 |
| samples_so_far | 32 |
| rollout/ | |
| return_max | 123 |
| return_mean | 71 |
| return_min | 29 |
| return_std | 35 |
---------------------------------
---------------------------------
| batch_size | 32 |
| bc/ | |
| batch | 0 |
| ent_loss | -0.000251 |
| entropy | 0.251 |
| epoch | 0 |
| l2_loss | 0 |
| l2_norm | 88 |
| loss | 0.138 |
| neglogp | 0.138 |
| prob_true_act | 0.892 |
| samples_so_far | 32 |
| rollout/ | |
| return_max | 199 |
| return_mean | 163 |
| return_min | 115 |
| return_std | 28 |
---------------------------------
--------------------------------
| batch_size | 32 |
| bc/ | |
| batch | 0 |
| ent_loss | -0.00021 |
| entropy | 0.21 |
| epoch | 0 |
| l2_loss | 0 |
| l2_norm | 99.7 |
| loss | 0.131 |
| neglogp | 0.131 |
| prob_true_act | 0.897 |
| samples_so_far | 32 |
| rollout/ | |
| return_max | 160 |
| return_mean | 143 |
| return_min | 123 |
| return_std | 11.9 |
--------------------------------
---------------------------------
| batch_size | 32 |
| bc/ | |
| batch | 0 |
| ent_loss | -6.09e-05 |
| entropy | 0.0609 |
| epoch | 0 |
| l2_loss | 0 |
| l2_norm | 110 |
| loss | 0.0179 |
| neglogp | 0.0179 |
| prob_true_act | 0.983 |
| samples_so_far | 32 |
| rollout/ | |
| return_max | 134 |
| return_mean | 125 |
| return_min | 115 |
| return_std | 6.43 |
---------------------------------
---------------------------------
| batch_size | 32 |
| bc/ | |
| batch | 0 |
| ent_loss | -4.73e-05 |
| entropy | 0.0473 |
| epoch | 0 |
| l2_loss | 0 |
| l2_norm | 120 |
| loss | 0.0142 |
| neglogp | 0.0142 |
| prob_true_act | 0.986 |
| samples_so_far | 32 |
| rollout/ | |
| return_max | 137 |
| return_mean | 129 |
| return_min | 122 |
| return_std | 5.97 |
---------------------------------
--------------------------------
| batch_size | 32 |
| bc/ | |
| batch | 0 |
| ent_loss | -3.7e-05 |
| entropy | 0.037 |
| epoch | 0 |
| l2_loss | 0 |
| l2_norm | 129 |
| loss | 0.0105 |
| neglogp | 0.0105 |
| prob_true_act | 0.99 |
| samples_so_far | 32 |
| rollout/ | |
| return_max | 132 |
| return_mean | 127 |
| return_min | 120 |
| return_std | 4.2 |
--------------------------------
---------------------------------
| batch_size | 32 |
| bc/ | |
| batch | 0 |
| ent_loss | -1.07e-05 |
| entropy | 0.0107 |
| epoch | 0 |
| l2_loss | 0 |
| l2_norm | 137 |
| loss | 0.00296 |
| neglogp | 0.00297 |
| prob_true_act | 0.997 |
| samples_so_far | 32 |
| rollout/ | |
| return_max | 149 |
| return_mean | 137 |
| return_min | 125 |
| return_std | 10.3 |
---------------------------------
---------------------------------
| batch_size | 32 |
| bc/ | |
| batch | 500 |
| ent_loss | -5.54e-05 |
| entropy | 0.0554 |
| epoch | 3 |
| l2_loss | 0 |
| l2_norm | 145 |
| loss | 0.0339 |
| neglogp | 0.034 |
| prob_true_act | 0.973 |
| samples_so_far | 16032 |
| rollout/ | |
| return_max | 129 |
| return_mean | 122 |
| return_min | 112 |
| return_std | 6.56 |
---------------------------------
--------------------------------
| batch_size | 32 |
| bc/ | |
| batch | 0 |
| ent_loss | -1.5e-05 |
| entropy | 0.015 |
| epoch | 0 |
| l2_loss | 0 |
| l2_norm | 145 |
| loss | 0.00419 |
| neglogp | 0.0042 |
| prob_true_act | 0.996 |
| samples_so_far | 32 |
| rollout/ | |
| return_max | 216 |
| return_mean | 158 |
| return_min | 125 |
| return_std | 38.2 |
--------------------------------
---------------------------------
| batch_size | 32 |
| bc/ | |
| batch | 500 |
| ent_loss | -1.71e-05 |
| entropy | 0.0171 |
| epoch | 3 |
| l2_loss | 0 |
| l2_norm | 152 |
| loss | 0.00437 |
| neglogp | 0.00439 |
| prob_true_act | 0.996 |
| samples_so_far | 16032 |
| rollout/ | |
| return_max | 165 |
| return_mean | 140 |
| return_min | 120 |
| return_std | 16.1 |
---------------------------------
After training both BC and DAgger, let’s compare their performances again! We expect DAgger to be better – after all, it’s a more advanced algorithm. But is it significantly better?
bc_rewards, _ = evaluate_policy(bc_trainer.policy, env, 10, return_episode_rewards=True)
dagger_rewards, _ = evaluate_policy(
dagger_trainer.policy, env, 10, return_episode_rewards=True
)
significant = is_significant_reward_improvement(bc_rewards, dagger_rewards, 0.05)
print(f"BC rewards: {bc_rewards}")
print(f"DAgger rewards: {dagger_rewards}")
print(
f"Our DAgger agent is {'NOT ' if not significant else ''}significantly better than BC."
)
BC rewards: [82.0, 69.0, 68.0, 115.0, 98.0, 80.0, 97.0, 118.0, 62.0, 78.0]
DAgger rewards: [126.0, 135.0, 118.0, 141.0, 149.0, 129.0, 177.0, 126.0, 121.0, 130.0]
Our DAgger agent is significantly better than BC.
If you increased the number of training iterations for the expert (in the first cell of the tutorial), you should see that DAgger indeed performs better than BC. If you didn’t, you likely see the opposite result. Yet another reason to be careful when interpreting results!
Finally, let’s take a moment, to remember the limitations of this experiment. We’re comparing two algorithms on one environment, with one dataset. We’re also using a suboptimal expert policy, which might not be the best choice for BC. If you want to convince yourself that DAgger is better than BC, you should pick out a more complex environment, you should run this experiment several times, with different random seeds and perform some hyperparameter optimization to make sure we’re not just using unlucky hyperparameters. At the end, we would also need to run the same hypothesis test across average returns of several independent runs.
But now you have all the pieces of the puzzle to do that!
Train Behavior Cloning in a Custom Environment#
You can use imitation
to train a policy (and, for many imitation learning algorithm, learn rewards) in a custom environment.
Step 1: Define the environment#
We will use a simple ObservationMatching environment as an example. The premise is simple – the agent receives a vector of observations, and must output a vector of actions that matches the observations as closely as possible.
If you have your own environment that you’d like to use, you can replace the code below with your own environment. Make sure it complies with the standard Gym API, and that the observation and action spaces are specified correctly.
from typing import Dict, Optional
from typing import Any
import numpy as np
import gymnasium as gym
from gymnasium.spaces import Box
class ObservationMatchingEnv(gym.Env):
def __init__(self, num_options: int = 2):
self.state = None
self.num_options = num_options
self.observation_space = Box(0, 1, shape=(num_options,))
self.action_space = Box(0, 1, shape=(num_options,))
def reset(self, seed: int = None, options: Optional[Dict[str, Any]] = None):
super().reset(seed=seed, options=options)
self.state = self.observation_space.sample()
return self.state, {}
def step(self, action):
reward = -np.abs(self.state - action).mean()
self.state = self.observation_space.sample()
return self.state, reward, False, False, {}
Step 2: create the environment#
From here, we have two options:
Add the environment to the gym registry, and use it with existing utilities (e.g.
make
)Use the environment directly
You only need to execute the cells in step 2a, or step 2b to proceed.
At the end of these steps, we want to have:
env
: a single environment that we can use for training an expert with SB3venv
: a vectorized environment where each individual environment is wrapped inRolloutInfoWrapper
, that we can use for collecting rollouts withimitation
Step 2a (recommended): add the environment to the gym registry#
The standard approach is adding the environment to the gym registry.
gym.register(
id="custom/ObservationMatching-v0",
entry_point=ObservationMatchingEnv, # This can also be the path to the class, e.g. `observation_matching:ObservationMatchingEnv`
max_episode_steps=500,
)
After registering, you can create an environment is gym.make(env_id)
which automatically handles the TimeLimit
wrapper.
To create a vectorized env, you can use the make_vec_env
helper function (Option A), or create it directly (Options B1 and B2)
from gymnasium.wrappers import TimeLimit
from imitation.data import rollout
from imitation.data.wrappers import RolloutInfoWrapper
from imitation.util.util import make_vec_env
from stable_baselines3.common.vec_env import DummyVecEnv, SubprocVecEnv
# Create a single environment for training an expert with SB3
env = gym.make("custom/ObservationMatching-v0")
# Create a vectorized environment for training with `imitation`
# Option A: use the `make_vec_env` helper function - make sure to pass `post_wrappers=[lambda env, _: RolloutInfoWrapper(env)]`
venv = make_vec_env(
"custom/ObservationMatching-v0",
rng=np.random.default_rng(),
n_envs=4,
post_wrappers=[lambda env, _: RolloutInfoWrapper(env)],
)
# Option B1: use a custom env creator, and create VecEnv directly
# def _make_env():
# """Helper function to create a single environment. Put any logic here, but make sure to return a RolloutInfoWrapper."""
# _env = gym.make("custom/ObservationMatching-v0")
# _env = RolloutInfoWrapper(_env)
# return _env
#
# venv = DummyVecEnv([_make_env for _ in range(4)])
#
# # Option B2: we can also use a parallel VecEnv implementation
# venv = SubprocVecEnv([_make_env for _ in range(4)])
Step 2b: directly use the environment#
Alternatively, we can directly initialize the environment by instantiating the class we created earlier, and handle all the additional logic ourselves.
from gymnasium.wrappers import TimeLimit
from imitation.data import rollout
from imitation.data.wrappers import RolloutInfoWrapper
from stable_baselines3.common.vec_env import DummyVecEnv
import numpy as np
# Create a single environment for training with SB3
env = ObservationMatchingEnv()
env = TimeLimit(env, max_episode_steps=500)
# Create a vectorized environment for training with `imitation`
# Option A: use a helper function to create multiple environments
def _make_env():
"""Helper function to create a single environment. Put any logic here, but make sure to return a RolloutInfoWrapper."""
_env = ObservationMatchingEnv()
_env = TimeLimit(_env, max_episode_steps=500)
_env = RolloutInfoWrapper(_env)
return _env
venv = DummyVecEnv([_make_env for _ in range(4)])
# Option B: use a single environment
# env = FixedHorizonCartPoleEnv()
# venv = DummyVecEnv([lambda: RolloutInfoWrapper(env)]) # Wrap a single environment -- only useful for simple testing like this
# Option C: use multiple environments
# venv = DummyVecEnv([lambda: RolloutInfoWrapper(ObservationMatchingEnv()) for _ in range(4)]) # Wrap multiple environments
Step 3: Training#
And now we’re just about done! Whether you used step 2a or 2b, your environment should now be ready to use with SB3 and imitation
.
For the sake of completeness, we’ll train a BC model, the same way as in the first tutorial, but with our custom environment.
Keep in mind that while we’re using BC in this tutorial, you can just as easily use any of the other algorithms with the environment prepared in this way.
from stable_baselines3 import PPO
from stable_baselines3.ppo import MlpPolicy
from stable_baselines3.common.evaluation import evaluate_policy
from gymnasium.wrappers import TimeLimit
expert = PPO(
policy=MlpPolicy,
env=env,
seed=0,
batch_size=64,
ent_coef=0.0,
learning_rate=0.0003,
n_epochs=10,
n_steps=64,
)
reward, _ = evaluate_policy(expert, env, 10)
print(f"Reward before training: {reward}")
# Note: if you followed step 2a, i.e. registered the environment, you can use the environment name directly
# expert = PPO(
# policy=MlpPolicy,
# env="custom/ObservationMatching-v0",
# seed=0,
# batch_size=64,
# ent_coef=0.0,
# learning_rate=0.0003,
# n_epochs=10,
# n_steps=64,
# )
expert.learn(10_000) # Note: set to 100000 to train a proficient expert
reward, _ = evaluate_policy(expert, expert.get_env(), 10)
print(f"Expert reward: {reward}")
Reward before training: -247.31714964704588
Expert reward: -100.7207043
rng = np.random.default_rng()
rollouts = rollout.rollout(
expert,
venv,
rollout.make_sample_until(min_timesteps=None, min_episodes=50),
rng=rng,
)
transitions = rollout.flatten_trajectories(rollouts)
from imitation.algorithms import bc
bc_trainer = bc.BC(
observation_space=env.observation_space,
action_space=env.action_space,
demonstrations=transitions,
rng=rng,
)
As before, the untrained policy only gets poor rewards:
reward_before_training, _ = evaluate_policy(bc_trainer.policy, env, 10)
print(f"Reward before training: {reward_before_training}")
Reward before training: -250.60812856666743
After training, we can get much closer to the expert’s performance:
bc_trainer.train(n_epochs=1)
reward_after_training, _ = evaluate_policy(bc_trainer.policy, env, 10)
print(f"Reward after training: {reward_after_training}")
--------------------------------
| batch_size | 32 |
| bc/ | |
| batch | 0 |
| ent_loss | -0.00284 |
| entropy | 2.84 |
| epoch | 0 |
| l2_loss | 0 |
| l2_norm | 68.5 |
| loss | 2.34 |
| neglogp | 2.34 |
| prob_true_act | 0.101 |
| samples_so_far | 32 |
--------------------------------
--------------------------------
| batch_size | 32 |
| bc/ | |
| batch | 500 |
| ent_loss | -0.00181 |
| entropy | 1.81 |
| epoch | 0 |
| l2_loss | 0 |
| l2_norm | 75.9 |
| loss | 1.06 |
| neglogp | 1.06 |
| prob_true_act | 0.357 |
| samples_so_far | 16032 |
--------------------------------
Reward after training: -41.17174798576161
API Reference#
imitation: implementations of imitation and reward learning algorithms. |
Developer Guide#
This guide explains the library structure of imitation. The code is organized such that logically similar files
are grouped into a subpackage. We maintain the following subpackages in src/imitation
:
algorithms
: the core implementation of imitation and reward learning algorithms.data
: modules to collect, store and manipulate transitions and trajectories from RL environments.envs
: provides test environments.policies
: provides modules that define policies and methods to manipulate them (e.g., serialization).regularization
: implements a variety of regularization techniques for NN weights.rewards
: modules to build, serialize and preprocess neural network based reward functions.scripts
: command-line scripts for running experiments through Sacred.util
: provides utility functions like logging, configurations, etc.
Algorithms#
The imitation.algorithms.base
module defines the following two classes:
BaseImitationAlgorithm
: Base class for all imitation algorithms.DemonstrationAlgorithm
: Base class for all demonstration-based algorithms like BC, IRL, etc. This class subclassesBaseImitationAlgorithm
.Demonstration algorithms offer the following methods and properties:policy
property that returns a policy imitating the demonstration data.set_demonstrations
method that sets the demonstrations data for learning.
All of the algorithms provide the train
method for training an agent and/or a reward network.
All the available algorithms are present in algorithms/
with each algorithm in a distinct file.
Adversarial algorithms like AIRL and GAIL are present in algorithms/adversarial
.
Data#
Modules handling environment data.
For example: types for transitions/trajectories; methods to compute rollouts; buffers to store transitions; helpers for these modules.
data.wrapper.BufferingWrapper
: Wraps a vectorized environment VecEnv
to save the trajectories from all the environments
in a buffer.
data.wrapper.RolloutInfoWrapper
: Wraps a gym.Env
environment to log the original observations and rewards received from
the environment. The observations and rewards of the entire episode are logged in the info
dictionary with the
key "rollout"
, in the final time step of the episode. This wrapper is useful for saving rollout trajectories, especially
in cases where you want to bypass the reward and/or observation overrides from other wrappers.
See data.rollout.unwrap_traj
for details and scripts/train_rl.py
for an example use case.
data.rollout.rollout
: Generates rollout by taking in any policy as input along with the environment.
Policies#
The imitation.policies
subpackage contains the following modules:
policies.base
: defines commonly used policies across the library likeFeedForward32Policy
,SAC1024Policy
,NormalizeFeaturesExtractor
, etc.policies.exploration_wrapper
: defines theExplorationWrapper
class that wraps a policy to create a partially randomized policy useful for exploration.policies.replay_buffer_wrapper
: defines theReplayBufferRewardWrapper
to wrap a replay buffer that returns transitions with rewards specified by a reward function.policies.serialize
: defines various functions to save and load serialized policies from the disk or the Hugging Face hub.
Regularization#
The imitation.regularization
subpackage provides an API for creating neural network regularizers. It provides classes such as
regularizers.LpRegularizer
and regularizers.WeightDecayRegularizer
to regularize the loss function and the weights of
a network, respectively. The updaters.IntervalParamScaler
class also provides support to scale the lambda hyperparameter
of a regularizer up when the ratio of validation to training loss is above an upper bound,
and scales it down when the ratio drops below a lower bound.
Rewards#
The imitation.rewards
subpackage contains code related to building, serializing, and loading reward networks.
Some of the classes include:
rewards.reward_nets.RewardNet
: is the base reward network class. Reward networks can take state, action, and the next state as input to predict the reward. Theforward
method is used while training the network, whereas thepredict
method is used during evaluation.rewards.reward_nets.BasicRewardNet
: builds a MLP reward network.rewards.reward_nets.CnnRewardNet
: builds a CNN based reward network.rewards.reward_nets.RewardEnsemble
: builds an ensemble of reward networks.rewards.reward_wrapper.RewardVecEnvWrapper
: This class wraps aVecEnv
with a customRewardFn
. The default reward function of the environment is overridden with the passed reward function, and the original rewards are stored in theinfo_dict
with theoriginal_env_rew
key. This class is used to override the original reward function of an environment with a learned reward function from the reward learning algorithms like preference comparisons.
The imitation.rewards.serialize
module contains functions to load serialized reward functions.
For more see the Reward Networks Tutorial.
Scripts#
We use Sacred to provide a command-line interface to run the experiments. The scripts to run the end-to-end experiments are
available in scripts/
. You can take a look at the following doc links to understand how to use Sacred:
Experiment Overview: Explains how to create and run experiments. Each script, defined in
scripts/
, has a corresponding experiment object, defined inscripts/config
, with the experiment object and Python source files named after the algorithm(s) supported. For example, thetrain_rl_ex
object is defined inscripts.config.train_rl
and its main function is inscripts.train_rl
.Ingredients: Explains how to use ingredients to avoid code duplication across experiments. The ingredients used in our experiments are defined in
scripts/ingredients/
:This ingredient provides a number of logging utilities.
This ingredient provides (expert) demonstrations to learn from.
This ingredient provides a vectorized gym environment.
This ingredient provides an expert policy.
This ingredient provides a reward network.
This ingredient provides a reinforcement learning algorithm from stable-baselines3.
This ingredient provides a newly constructed stable-baselines3 policy.
This ingredient provides Weights & Biases logging.
Configurations: Explains how to use configurations to parametrize runs. The configurations for different algorithms are defined in their file in
scripts/
. Some of the commonly used configs and ingredients used across algorithms are defined inscripts/ingredients/
.Command-Line Interface: Explains how to run the experiments through the command-line interface. Also, note the section on how to print configs to verify the configurations used for the run.
Controlling Randomness: Explains how to control randomness by seeding experiments through Sacred.
Util#
imitation.util.logger.HierarchicalLogger
: A logger that supports contexts for accumulating the mean of values of all the logged keys.
The logger internally maintains one separate stable_baselines3.common.logger.Logger
object for logging the mean values, and one Logger
object for the raw values for each context.
The accumulate_means
context cannot be called inside an already open accumulate_means
context.
The imitation.util.logger.configure
function can be used to easily construct a HierarchicalLogger
object.
imitation.util.networks
: This module provides some additional neural network layers that can be used for imitation like RunningNorm
and EMANorm
that normalize their inputs.
The module also provides functions like build_mlp
and build_cnn
to quickly build neural networks.
imitation.util.util
: This module provides miscellaneous util functions like make_vec_env
to easily construct vectorized environments and safe_to_tensor
that converts a NumPy array to a PyTorch tensor.
imitation.util.video_wrapper.VideoWrapper
: A wrapper to record rendered videos from an environment.
Contributing#
Code of Conduct#
To ensure that the imitation community remains open and inclusive, we have a few ground rules that we ask contributors to adhere to. This isn’t an exhaustive list of things that you can’t do. Rather, take it in the spirit in which it’s intended — a guide to make it easier to enrich all of us and the technical communities in which we participate.
Be friendly and patient.
Be welcoming. We strive to be a community that welcomes and supports people of all backgrounds and identities. This includes, but is not limited to members of any race, ethnicity, culture, national origin, colour, immigration status, social and economic class, educational level, sex, sexual orientation, gender identity and expression, age, size, family status, political belief, religion, and mental and physical ability.
Be considerate. Your work will be used by other people, and you in turn will depend on the work of others. Any decision you take will affect users and colleagues, and you should take those consequences into account when making decisions. Remember that we’re a world-wide community, so you might not be communicating in someone else’s primary language.
Be respectful. Not all of us will agree all the time, but disagreement is no excuse for poor behavior and poor manners. We might all experience some frustration now and then, but we cannot allow that frustration to turn into a personal attack. Members of the imitation community should be respectful when dealing with other members as well as with people outside the imitation community.
Be careful in the words that you choose. We are a community of professionals, and we conduct ourselves professionally. Be kind to others. Do not insult or put down other participants. Harassment and other exclusionary behavior aren’t acceptable. This includes, but is not limited to:
Violent threats or language directed against another person.
Discriminatory jokes and language.
Posting sexually explicit or violent material.
Posting (or threatening to post) other people’s personally identifying information without their consent (“doxing”).
Personal insults, especially those using racist or sexist terms.
Unwelcome sexual attention.
Advocating for, or encouraging, any of the above behavior.
Repeated harassment of others. In general, if someone asks you to stop, then stop.
When we disagree, try to understand why. It is important that we resolve disagreements and differing views constructively. Focus on helping to resolve issues and learning from mistakes.
Adapted from the original text courtesy of the Django project, licensed under a Creative Commons Attribution 3.0 License.
Ways to contribute#
There are four main ways you can contribute to imitation:
Please note that by contributing to the project, you are agreeing to license your work under imitation’s MIT license, as per GitHub’s terms of service.
Reporting bugs#
This section guides you through submitting a new bug report for imitation. Following the guidelines below helps maintainers and the community understand your report and reproduce the issue.
You can submit a new bug report by creating an issue on GitHub and labeling it as a bug. Before you do so, please make sure that:
You are using the latest stable version of imitation — to check your version, run
pip show imitation
,You have read the relevant section of the documentation that relates to your issue,
You have checked existing bug reports to make sure that your issue has not already been reported, and
You have a minimal, reproducible example of the issue.
When submitting a bug report, please include the following information:
A clear, concise description of the bug,
A minimal, reproducible example of the bug, with installation instructions, code, and error message,
Information on your OS name and version, Python version, and other relevant information (e.g. hardware configuration if using the GPU), and
Whether the problem arose when upgrading to a certain version of imitation, and if so, what version.
Suggesting new features#
This section explains how you can submit a new feature request, including completely new features and minor improvements to existing functionality. Following these guidelines helps maintainers and the community understand your request and intended use cases and find related suggestions.
You can submit a new bug report by creating an issue on GitHub and labeling it as an enhancement. Before you do so, please make sure that:
You have checked the documentation that relates to your request, as it may be that such feature is already available,
You have checked existing feature requests to make sure that there is no similar request already under discussion, and
You have a minimal use case that describes the relevance of the feature.
When you submit the feature request:
Use a clear and descriptive title for the GitHub issue to easily identify the suggestion.
Describe the current behavior, and explain what behavior you expected to see instead and why.
If you want to request an API change, provide examples of how the feature would be used.
If you want to request a new algorithm implementation, please provide a link to the relevant paper or publication.
Contributing to the documentation#
One of the simplest ways to start contributing to imitation is through improving the documentation. Currently, our documentation has some gaps, and we would love to have you help us fill them. You can help by adding missing sections of the API docs, editing existing content to make it more readable, clear and accessible, or contributing new content, such as tutorials and FAQs.
If you have struggled to understand something about our codebase and managed to figure it out in the end, please consider improving the relevant documentation section, or adding a tutorial or a FAQ entry, so that other users can learn from your experience.
Before submitting a pull request, please create an issue with the documentation label so that we can track the gap. You can then reference the issue in your pull request by including the issue number.
Contributing to the codebase#
You can contribute to the codebase by proposing solutions to issues or feature suggestions you’ve raised yourself, or selecting an existing issue to work on. Please, make sure to create an issue on GitHub before you start working on a pull request, as explained in Reporting bugs and Suggesting new features.
Once you’re ready to start working on your pull request, please make sure to follow our coding style guidelines:
PEP8, with line width 88.
Use the
black
autoformatter.Follow the Google Python Style Guide unless it conflicts with the above. Examples of Google-style docstrings can be found here.
Before you submit, please make sure that:
Your PR includes unit tests for any new features.
Your PR includes type annotations, except when it would make the code significantly more complex.
You have run the unit tests and there are no errors. We use
pytest
for unit testing: runpytest tests/
to run the test suite.You should run
pre-commit run
to run linting and static type checks. We usepytype
for static type analysis.
You may wish to configure this as a Git commit hook:
pre-commit install
These checks are run on CircleCI and are required to pass before merging.
Additionally, we track test coverage by CodeCov and require that code coverage
should not decrease. This can be overridden by maintainers in exceptional cases.
Files in imitation/{examples,scripts}/
have no coverage requirements.
Thank you for your interest in imitation!
As an open-source project, we welcome contributions from all users, and are always open to any feedback or suggestions. This section of the documentation is intended to help you understand the process of contributing to the project.
To keep the community open and inclusive, we have developed a Code of Conduct. If you are not familiar with our Code of Conduct, take a minute to read it before starting your first contribution.
Release Notes#
v1.0.0 – first stable release#
Released on 2023-10-31 - GitHub - PyPI
We're pleased to announce the first stable release of imitation
. Key improvements include:
- Gymnasium compatibility, which has superceded Gym
- Tuned hyperparameters and benchmark results for common algorithm-environment pairs (see release artifact attached).
- New algorithm (beta): SQIL
For more information, see the changelog below.
What's Changed
- Updated Installation Instructions by @ernestum in #760
- Download experts from hf inside tutorials and docs by @jas-ho in #766
- Implementation of the SQIL algorithm by @RedTachyon in #744
- Additional examples of CLI usage by @EdoardoPona in #761
- Dependency fixes by @ernestum in #775
- Tune hyperparameters for kernel density estimation tutorial by @michalzajac-ml in #774
- Tune hyperparameters in tutorials for GAIL and AIRL by @michalzajac-ml in #772
- Introduce interactive policies to gather data from a user by @michalzajac-ml in #776
- Add an option to run SQIL with various off-policy algorithms by @michalzajac-ml in #778
- Complete PR #771 (Tune preference comparison example hyperparameters) by @lukasberglund in #782
- Add CLI for SQIL by @lukasberglund in #784
- Gymnasium Compatibility by @ernestum in #735
- Ensure MyST-NB raises an error when rendering a notebook fails. by @ernestum in #803
- Add a test timeout by @ernestum in #779
- Fix MacOS Pipeline: Include tests not in subdirectories by @AdamGleave in #797
- Remove MuJoCo dependency from SQIL notebook by @AdamGleave in #800
- Add partial support for dictionary observation spaces (bc, density) by @NixGD in #785
- Update gymnasium dependency and render_mode in gym.make by @taufeeque9 in #806
- Upgrade pytype by @ZiyueWang25 in #801
- Reduce training time and improve expert loading code in the tutorials by @ernestum in #810
- Add scripts and configs for hyperparameter tuning by @taufeeque9 in #675
- SQIL and PC performance check fixes by @ernestum in #811
- Running benchmarks by @ernestum in #812
New Contributors
- @jas-ho made their first contribution in #766
- @EdoardoPona made their first contribution in #761
- @michalzajac-ml made their first contribution in #774
- @lukasberglund made their first contribution in #782
- @NixGD made their first contribution in #785
- @ZiyueWang25 made their first contribution in #801
Full Changelog: v0.4.0...v1.0.0
v0.4.0#
Released on 2023-07-17 - GitHub - PyPI
What's Changed
- Continuous Integration: Add support for Mac OS; remove dependency on MuJoCo
- Preference comparison: improved logging, support for active learning based on variance of ensemble.
- HuggingFace integration for model and dataset loading.
- Benchmarking: add results and example configs.
- Documentation: add notebook tutorials; other general improvements.
- General changes: migrate to pathlib; add more type hints to enable mypy as well as pytype.
Full Changelog: v0.3.1...v0.4.0
v0.3.1#
Released on 2022-07-29 - GitHub - PyPI
What's Changed
Main changes:
- Added reward ensembles and conservative reward functions by @levmckinney in #460
- Dropping support for python 3.7 by @levmckinney in #505
Minor changes:
- Docstring and other fixes after #472 by @Rocamonde in #497
- Improve Windows CI by @AdamGleave in #495
Full Changelog: v0.3.0...v0.3.1
v0.3.0: Major improvements#
Released on 2022-07-26 - GitHub - PyPI
New features:
- New algorithm: Deep RL from Human Preferences (thanks to @ejnnr @norabelrose et al)
- Notebooks with examples (thanks to @ernestum)
- Serialized trajectories using NumPy arrays rather than pickles, ensuring stability across versions and saving space on disk (thanks to @norabelrose)
- Weights and Biases logging support (thanks to @yawen-d)
Improvements:
- Port MCE IRL from JAX to Torch, eliminating the JAX dependency. (thanks to @qxcv)
- Refactor RewardNet code to be independent from AIRL, and shared across algorithms. (thanks to @ejnnr)
- Add Windows support including continuous integration. (thanks to @taufeeque9)
v0.2.0: First PyTorch release#
v0.1.1: Final TF1 release#
v0.1.0: Initial release#
Released on 2020-05-09 - GitHub - PyPI
Prototype versions of AIRL, GAIL, BC, DAGGER.
License#
This license is also available on the project repository.
MIT License
Copyright (c) 2019-2022 Center for Human-Compatible AI and Google LLC
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.