Train an Agent using the DAgger Algorithm#

The DAgger algorithm is an extension of behavior cloning. In behavior cloning, the training trajectories are recorded directly from an expert. In DAgger, the learner generates the trajectories but an expert corrects the actions with the optimal actions in each of the visited states. This ensures that the state distribution of the training data matches that of the learner’s current policy.

First we need an expert to learn from. For convenience we download one from the HuggingFace model hub.

import numpy as np
import gymnasium as gym
from imitation.policies.serialize import load_policy
from imitation.util.util import make_vec_env

env = make_vec_env(
    "seals:seals/CartPole-v0",
    rng=np.random.default_rng(),
    n_envs=1,
)
expert = load_policy(
    "ppo-huggingface",
    organization="HumanCompatibleAI",
    env_name="seals/CartPole-v0",
    venv=env,
)

Then we can construct a DAgger trainer und use it to train the policy on the cartpole environment.

import tempfile

from imitation.algorithms import bc
from imitation.algorithms.dagger import SimpleDAggerTrainer

bc_trainer = bc.BC(
    observation_space=env.observation_space,
    action_space=env.action_space,
    rng=np.random.default_rng(),
)

with tempfile.TemporaryDirectory(prefix="dagger_example_") as tmpdir:
    print(tmpdir)
    dagger_trainer = SimpleDAggerTrainer(
        venv=env,
        scratch_dir=tmpdir,
        expert_policy=expert,
        bc_trainer=bc_trainer,
        rng=np.random.default_rng(),
    )

    dagger_trainer.train(2000)

/tmp/dagger_example_aroskjbg
---------------------------------
| batch_size        | 32        |
| bc/               |           |
|    batch          | 0         |
|    ent_loss       | -0.000693 |
|    entropy        | 0.693     |
|    epoch          | 0         |
|    l2_loss        | 0         |
|    l2_norm        | 72.5      |
|    loss           | 0.692     |
|    neglogp        | 0.692     |
|    prob_true_act  | 0.5       |
|    samples_so_far | 32        |
| rollout/          |           |
|    return_max     | 23        |
|    return_mean    | 16.8      |
|    return_min     | 9         |
|    return_std     | 4.75      |
---------------------------------
---------------------------------
| batch_size        | 32        |
| bc/               |           |
|    batch          | 0         |
|    ent_loss       | -0.000362 |
|    entropy        | 0.362     |
|    epoch          | 0         |
|    l2_loss        | 0         |
|    l2_norm        | 86.6      |
|    loss           | 0.305     |
|    neglogp        | 0.305     |
|    prob_true_act  | 0.78      |
|    samples_so_far | 32        |
| rollout/          |           |
|    return_max     | 154       |
|    return_mean    | 83.2      |
|    return_min     | 51        |
|    return_std     | 37.4      |
---------------------------------

Finally, the evaluation shows, that we actually trained a policy that solves the environment (500 is the max reward).

from stable_baselines3.common.evaluation import evaluate_policy

reward, _ = evaluate_policy(dagger_trainer.policy, env, 20)
print(reward)

500.0