Reliably compare algorithm performance#

Did we actually match the expert performance or was it just luck? Did this hyperparameter change actually improve the performance of our algorithm? These are questions that we need to answer when we want to compare the performance of different algorithms or hyperparameters.

imitation provides some tools to help you answer these questions. For demonstration purposes, we will use Behavior Cloning on the CartPole-v1 environment. We will compare different variants of the trained algorithm, and also compare it with a more sophisticated algorithm, DAgger.

We will start by training a good (but not perfect) expert.

import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.ppo import MlpPolicy

env = gym.make("CartPole-v1")
expert = PPO(
    policy=MlpPolicy,
    env=env,
    seed=0,
    batch_size=64,
    ent_coef=0.0,
    learning_rate=0.0003,
    n_epochs=10,
    n_steps=64,
)
expert.learn(10_000)  # set to 100_000 for better performance

<stable_baselines3.ppo.ppo.PPO at 0x7f0419047940>

For comparison, let’s also train a not-quite-expert.

not_expert = PPO(
    policy=MlpPolicy,
    env=env,
    seed=0,
    batch_size=64,
    ent_coef=0.0,
    learning_rate=0.0003,
    n_epochs=10,
    n_steps=64,
)

not_expert.learn(1_000)  # set to 10_000 for slightly better performance

<stable_baselines3.ppo.ppo.PPO at 0x7f041022ef10>

So are they any good? Let’s quickly get a point estimate of their performance.

from stable_baselines3.common.evaluation import evaluate_policy

env.reset(seed=0)

expert_reward, _ = evaluate_policy(expert, env, 1)
not_expert_reward, _ = evaluate_policy(not_expert, env, 1)

print(f"Expert reward: {expert_reward:.2f}")
print(f"Not expert reward: {not_expert_reward:.2f}")

Expert reward: 147.00
Not expert reward: 71.00

But wait! We only ran the evaluation once. What if we got lucky? Let’s run the evaluation a few more times and see what happens.

expert_reward, _ = evaluate_policy(expert, env, 10)
not_expert_reward, _ = evaluate_policy(not_expert, env, 10)

print(f"Expert reward: {expert_reward:.2f}")
print(f"Not expert reward: {not_expert_reward:.2f}")

Expert reward: 143.90
Not expert reward: 83.40

Seems a bit more robust now, but how certain are we? Fortunately, imitation provides us with tools to answer this.

We will perform a permutation test using the is_significant_reward_improvement function. We want to be very certain – let’s set the bar high and require a p-value of 0.001.

from imitation.testing.reward_improvement import is_significant_reward_improvement

expert_rewards, _ = evaluate_policy(expert, env, 10, return_episode_rewards=True)
not_expert_rewards, _ = evaluate_policy(
    not_expert, env, 10, return_episode_rewards=True
)

significant = is_significant_reward_improvement(
    not_expert_rewards, expert_rewards, 0.001
)

print(
    f"The expert is {'NOT ' if not significant else ''}significantly better than the not-expert."
)

The expert is significantly better than the not-expert.

Huh, turns out we set the bar too high. We could lower our standards, but that’s for cowards. Instead, we can collect more data and try again.

from imitation.testing.reward_improvement import is_significant_reward_improvement

expert_rewards, _ = evaluate_policy(expert, env, 100, return_episode_rewards=True)
not_expert_rewards, _ = evaluate_policy(
    not_expert, env, 100, return_episode_rewards=True
)

significant = is_significant_reward_improvement(
    not_expert_rewards, expert_rewards, 0.001
)

print(
    f"The expert is {'NOT ' if not significant else ''}significantly better than the not-expert."
)

The expert is significantly better than the not-expert.

Here we go! We can now be 99.9% confident that the expert is better than the not-expert – in this specific case, with these specific trained models. It might still be an extraordinary stroke of luck, or a conspiracy to make us choose the wrong algorithm, but outside of that, we can be pretty sure our data’s correct.

We can use the same principle to with imitation learning algorithms. Let’s train a behavior cloning algorithm and see how it compares to the expert. This time, we can lower the bar to the standard “scientific” threshold of 0.05.

Like in the first tutorial, we will start by collecting some expert data. But to spice it up, let’s also get some data from the not-quite-expert.

from imitation.data import rollout
from imitation.data.wrappers import RolloutInfoWrapper
from stable_baselines3.common.vec_env import DummyVecEnv
import numpy as np

rng = np.random.default_rng()
expert_rollouts = rollout.rollout(
    expert,
    DummyVecEnv([lambda: RolloutInfoWrapper(env)]),
    rollout.make_sample_until(min_timesteps=None, min_episodes=50),
    rng=rng,
)
expert_transitions = rollout.flatten_trajectories(expert_rollouts)


not_expert_rollouts = rollout.rollout(
    not_expert,
    DummyVecEnv([lambda: RolloutInfoWrapper(env)]),
    rollout.make_sample_until(min_timesteps=None, min_episodes=50),
    rng=rng,
)
not_expert_transitions = rollout.flatten_trajectories(not_expert_rollouts)

Let’s try cloning an expert and a non-expert, and see how they compare.

from imitation.algorithms import bc

expert_bc_trainer = bc.BC(
    observation_space=env.observation_space,
    action_space=env.action_space,
    demonstrations=expert_transitions,
    rng=rng,
)

not_expert_bc_trainer = bc.BC(
    observation_space=env.observation_space,
    action_space=env.action_space,
    demonstrations=not_expert_transitions,
    rng=rng,
)

expert_bc_trainer.train(n_epochs=2)
not_expert_bc_trainer.train(n_epochs=2)

---------------------------------
| batch_size        | 32        |
| bc/               |           |
|    batch          | 0         |
|    ent_loss       | -0.000693 |
|    entropy        | 0.693     |
|    epoch          | 0         |
|    l2_loss        | 0         |
|    l2_norm        | 72.5      |
|    loss           | 0.693     |
|    neglogp        | 0.694     |
|    prob_true_act  | 0.5       |
|    samples_so_far | 32        |
---------------------------------
---------------------------------
| batch_size        | 32        |
| bc/               |           |
|    batch          | 0         |
|    ent_loss       | -0.000693 |
|    entropy        | 0.693     |
|    epoch          | 0         |
|    l2_loss        | 0         |
|    l2_norm        | 72.5      |
|    loss           | 0.693     |
|    neglogp        | 0.693     |
|    prob_true_act  | 0.5       |
|    samples_so_far | 32        |
---------------------------------

bc_expert_rewards, _ = evaluate_policy(
    expert_bc_trainer.policy, env, 10, return_episode_rewards=True
)
bc_not_expert_rewards, _ = evaluate_policy(
    not_expert_bc_trainer.policy, env, 10, return_episode_rewards=True
)
significant = is_significant_reward_improvement(
    bc_not_expert_rewards, bc_expert_rewards, 0.05
)
print(f"Cloned expert rewards: {bc_expert_rewards}")
print(f"Cloned not-expert rewards: {bc_not_expert_rewards}")

print(
    f"Cloned expert is {'NOT ' if not significant else ''}significantly better than the cloned not-expert."
)

Cloned expert rewards: [121.0, 140.0, 155.0, 140.0, 124.0, 113.0, 139.0, 116.0, 108.0, 134.0]
Cloned not-expert rewards: [47.0, 102.0, 76.0, 56.0, 77.0, 103.0, 69.0, 80.0, 98.0, 65.0]
Cloned expert is significantly better than the cloned not-expert.

How about comparing the expert clone to the expert itself?

bc_clone_rewards, _ = evaluate_policy(
    expert_bc_trainer.policy, env, 10, return_episode_rewards=True
)

expert_rewards, _ = evaluate_policy(expert, env, 10, return_episode_rewards=True)

significant = is_significant_reward_improvement(bc_clone_rewards, expert_rewards, 0.05)

print(f"Cloned expert rewards: {bc_clone_rewards}")
print(f"Expert rewards: {expert_rewards}")

print(
    f"Expert is {'NOT ' if not significant else ''}significantly better than the cloned expert."
)

Cloned expert rewards: [108.0, 133.0, 158.0, 144.0, 136.0, 116.0, 115.0, 129.0, 117.0, 115.0]
Expert rewards: [140.0, 132.0, 154.0, 126.0, 121.0, 138.0, 175.0, 132.0, 132.0, 139.0]
Expert is NOT significantly better than the cloned expert.

Turns out the expert is significantly better than the clone – again, in this case. Note, however, that this is not proof that the clone is as good as the expert – there’s a subtle difference between the two claims in the context of hypothesis testing.

Note: if you changed the duration of the training at the beginning of this tutorial, you might get different results. While this might break the narrative in this tutorial, it’s a good learning opportunity.

When comparing the performance of two agents, algorithms, hyperparameter sets, always remember the scope of what you’re testing. In this tutorial, we have one instance of an expert – but RL training is famously unstable, so another training run with another random seed would likely produce a slightly different result. So ideally, we would like to repeat this procedure several times, training the same agent with different random seeds, and then compare the average performance of the two agents.

Even then, this is just on one environment, with one algorithm. So be wary of generalizing your results too much.

We can also use the same method to compare different algorithms. While CartPole is pretty easy, we can make it more difficult by decreasing the number of episodes in our dataset, and generating them with a suboptimal policy:

rollouts = rollout.rollout(
    expert,
    DummyVecEnv([lambda: RolloutInfoWrapper(env)]),
    rollout.make_sample_until(min_timesteps=None, min_episodes=1),
    rng=rng,
)
transitions = rollout.flatten_trajectories(rollouts)

Let’s try training a behavior cloning algorithm on this dataset.

Note that for DAgger, we have to cheat a little bit – it’s allowed to use the expert policy to generate additional data. For the purposes of this tutorial, we’ll stick with this to avoid spending hours training an expert for a more complex environment.

So while this little experiment isn’t definitive proof that DAgger is better than BC, you can use the same method to compare any two algorithms.

from imitation.algorithms.dagger import SimpleDAggerTrainer
import tempfile

bc_trainer = bc.BC(
    observation_space=env.observation_space,
    action_space=env.action_space,
    demonstrations=transitions,
    rng=rng,
)

bc_trainer.train(n_epochs=1)


with tempfile.TemporaryDirectory(prefix="dagger_example_") as tmpdir:
    print(tmpdir)
    dagger_bc_trainer = bc.BC(
        observation_space=env.observation_space,
        action_space=env.action_space,
        rng=np.random.default_rng(),
    )
    dagger_trainer = SimpleDAggerTrainer(
        venv=DummyVecEnv([lambda: RolloutInfoWrapper(env)]),
        scratch_dir=tmpdir,
        expert_policy=expert,
        bc_trainer=dagger_bc_trainer,
        rng=np.random.default_rng(),
    )

    dagger_trainer.train(5000)

---------------------------------
| batch_size        | 32        |
| bc/               |           |
|    batch          | 0         |
|    ent_loss       | -0.000693 |
|    entropy        | 0.693     |
|    epoch          | 0         |
|    l2_loss        | 0         |
|    l2_norm        | 72.5      |
|    loss           | 0.693     |
|    neglogp        | 0.694     |
|    prob_true_act  | 0.5       |
|    samples_so_far | 32        |
---------------------------------
/tmp/dagger_example_o3r5tw84
---------------------------------
| batch_size        | 32        |
| bc/               |           |
|    batch          | 0         |
|    ent_loss       | -0.000693 |
|    entropy        | 0.693     |
|    epoch          | 0         |
|    l2_loss        | 0         |
|    l2_norm        | 72.5      |
|    loss           | 0.692     |
|    neglogp        | 0.693     |
|    prob_true_act  | 0.5       |
|    samples_so_far | 32        |
| rollout/          |           |
|    return_max     | 43        |
|    return_mean    | 25.8      |
|    return_min     | 17        |
|    return_std     | 10.2      |
---------------------------------
---------------------------------
| batch_size        | 32        |
| bc/               |           |
|    batch          | 0         |
|    ent_loss       | -0.000665 |
|    entropy        | 0.665     |
|    epoch          | 0         |
|    l2_loss        | 0         |
|    l2_norm        | 75.5      |
|    loss           | 0.544     |
|    neglogp        | 0.545     |
|    prob_true_act  | 0.586     |
|    samples_so_far | 32        |
| rollout/          |           |
|    return_max     | 123       |
|    return_mean    | 71        |
|    return_min     | 29        |
|    return_std     | 35        |
---------------------------------
---------------------------------
| batch_size        | 32        |
| bc/               |           |
|    batch          | 0         |
|    ent_loss       | -0.000251 |
|    entropy        | 0.251     |
|    epoch          | 0         |
|    l2_loss        | 0         |
|    l2_norm        | 88        |
|    loss           | 0.138     |
|    neglogp        | 0.138     |
|    prob_true_act  | 0.892     |
|    samples_so_far | 32        |
| rollout/          |           |
|    return_max     | 199       |
|    return_mean    | 163       |
|    return_min     | 115       |
|    return_std     | 28        |
---------------------------------
--------------------------------
| batch_size        | 32       |
| bc/               |          |
|    batch          | 0        |
|    ent_loss       | -0.00021 |
|    entropy        | 0.21     |
|    epoch          | 0        |
|    l2_loss        | 0        |
|    l2_norm        | 99.7     |
|    loss           | 0.131    |
|    neglogp        | 0.131    |
|    prob_true_act  | 0.897    |
|    samples_so_far | 32       |
| rollout/          |          |
|    return_max     | 160      |
|    return_mean    | 143      |
|    return_min     | 123      |
|    return_std     | 11.9     |
--------------------------------
---------------------------------
| batch_size        | 32        |
| bc/               |           |
|    batch          | 0         |
|    ent_loss       | -6.09e-05 |
|    entropy        | 0.0609    |
|    epoch          | 0         |
|    l2_loss        | 0         |
|    l2_norm        | 110       |
|    loss           | 0.0179    |
|    neglogp        | 0.0179    |
|    prob_true_act  | 0.983     |
|    samples_so_far | 32        |
| rollout/          |           |
|    return_max     | 134       |
|    return_mean    | 125       |
|    return_min     | 115       |
|    return_std     | 6.43      |
---------------------------------
---------------------------------
| batch_size        | 32        |
| bc/               |           |
|    batch          | 0         |
|    ent_loss       | -4.73e-05 |
|    entropy        | 0.0473    |
|    epoch          | 0         |
|    l2_loss        | 0         |
|    l2_norm        | 120       |
|    loss           | 0.0142    |
|    neglogp        | 0.0142    |
|    prob_true_act  | 0.986     |
|    samples_so_far | 32        |
| rollout/          |           |
|    return_max     | 137       |
|    return_mean    | 129       |
|    return_min     | 122       |
|    return_std     | 5.97      |
---------------------------------
--------------------------------
| batch_size        | 32       |
| bc/               |          |
|    batch          | 0        |
|    ent_loss       | -3.7e-05 |
|    entropy        | 0.037    |
|    epoch          | 0        |
|    l2_loss        | 0        |
|    l2_norm        | 129      |
|    loss           | 0.0105   |
|    neglogp        | 0.0105   |
|    prob_true_act  | 0.99     |
|    samples_so_far | 32       |
| rollout/          |          |
|    return_max     | 132      |
|    return_mean    | 127      |
|    return_min     | 120      |
|    return_std     | 4.2      |
--------------------------------
---------------------------------
| batch_size        | 32        |
| bc/               |           |
|    batch          | 0         |
|    ent_loss       | -1.07e-05 |
|    entropy        | 0.0107    |
|    epoch          | 0         |
|    l2_loss        | 0         |
|    l2_norm        | 137       |
|    loss           | 0.00296   |
|    neglogp        | 0.00297   |
|    prob_true_act  | 0.997     |
|    samples_so_far | 32        |
| rollout/          |           |
|    return_max     | 149       |
|    return_mean    | 137       |
|    return_min     | 125       |
|    return_std     | 10.3      |
---------------------------------
---------------------------------
| batch_size        | 32        |
| bc/               |           |
|    batch          | 500       |
|    ent_loss       | -5.54e-05 |
|    entropy        | 0.0554    |
|    epoch          | 3         |
|    l2_loss        | 0         |
|    l2_norm        | 145       |
|    loss           | 0.0339    |
|    neglogp        | 0.034     |
|    prob_true_act  | 0.973     |
|    samples_so_far | 16032     |
| rollout/          |           |
|    return_max     | 129       |
|    return_mean    | 122       |
|    return_min     | 112       |
|    return_std     | 6.56      |
---------------------------------
--------------------------------
| batch_size        | 32       |
| bc/               |          |
|    batch          | 0        |
|    ent_loss       | -1.5e-05 |
|    entropy        | 0.015    |
|    epoch          | 0        |
|    l2_loss        | 0        |
|    l2_norm        | 145      |
|    loss           | 0.00419  |
|    neglogp        | 0.0042   |
|    prob_true_act  | 0.996    |
|    samples_so_far | 32       |
| rollout/          |          |
|    return_max     | 216      |
|    return_mean    | 158      |
|    return_min     | 125      |
|    return_std     | 38.2     |
--------------------------------
---------------------------------
| batch_size        | 32        |
| bc/               |           |
|    batch          | 500       |
|    ent_loss       | -1.71e-05 |
|    entropy        | 0.0171    |
|    epoch          | 3         |
|    l2_loss        | 0         |
|    l2_norm        | 152       |
|    loss           | 0.00437   |
|    neglogp        | 0.00439   |
|    prob_true_act  | 0.996     |
|    samples_so_far | 16032     |
| rollout/          |           |
|    return_max     | 165       |
|    return_mean    | 140       |
|    return_min     | 120       |
|    return_std     | 16.1      |
---------------------------------

After training both BC and DAgger, let’s compare their performances again! We expect DAgger to be better – after all, it’s a more advanced algorithm. But is it significantly better?

bc_rewards, _ = evaluate_policy(bc_trainer.policy, env, 10, return_episode_rewards=True)
dagger_rewards, _ = evaluate_policy(
    dagger_trainer.policy, env, 10, return_episode_rewards=True
)

significant = is_significant_reward_improvement(bc_rewards, dagger_rewards, 0.05)

print(f"BC rewards: {bc_rewards}")
print(f"DAgger rewards: {dagger_rewards}")

print(
    f"Our DAgger agent is {'NOT ' if not significant else ''}significantly better than BC."
)

BC rewards: [82.0, 69.0, 68.0, 115.0, 98.0, 80.0, 97.0, 118.0, 62.0, 78.0]
DAgger rewards: [126.0, 135.0, 118.0, 141.0, 149.0, 129.0, 177.0, 126.0, 121.0, 130.0]
Our DAgger agent is significantly better than BC.

If you increased the number of training iterations for the expert (in the first cell of the tutorial), you should see that DAgger indeed performs better than BC. If you didn’t, you likely see the opposite result. Yet another reason to be careful when interpreting results!

Finally, let’s take a moment, to remember the limitations of this experiment. We’re comparing two algorithms on one environment, with one dataset. We’re also using a suboptimal expert policy, which might not be the best choice for BC. If you want to convince yourself that DAgger is better than BC, you should pick out a more complex environment, you should run this experiment several times, with different random seeds and perform some hyperparameter optimization to make sure we’re not just using unlucky hyperparameters. At the end, we would also need to run the same hypothesis test across average returns of several independent runs.

But now you have all the pieces of the puzzle to do that!