Train an Agent using Soft Q Imitation Learning#

Soft Q Imitation Learning (SQIL) is a simple algorithm that can be used to clone expert behavior. It’s fundamentally a modification of the DQN algorithm. At each training step, whenever we sample a batch of data from the replay buffer, we also sample a batch of expert data. Expert demonstrations are assigned a reward of 1, while the agent’s own transitions are assigned a reward of 0. This approach encourages the agent to imitate the expert’s behavior, but also to avoid unfamiliar states.

In this tutorial we will use the imitation library to train an agent using SQIL.

First, we need some expert trajectories in our environment (seals/CartPole-v0). Note that you can use other environments, but the action space must be discrete for this algorithm.

import datasets
from stable_baselines3.common.vec_env import DummyVecEnv

from imitation.data import huggingface_utils

# Download some expert trajectories from the HuggingFace Datasets Hub.
dataset = datasets.load_dataset("HumanCompatibleAI/ppo-CartPole-v1")

# Convert the dataset to a format usable by the imitation library.
expert_trajectories = huggingface_utils.TrajectoryDatasetSequence(dataset["train"])

Let’s quickly check if the expert is any good. We usually should be able to reach a reward of 500, which is the maximum achievable value.

from imitation.data import rollout

trajectory_stats = rollout.rollout_stats(expert_trajectories)

print(
    f"We have {trajectory_stats['n_traj']} trajectories."
    f"The average length of each trajectory is {trajectory_stats['len_mean']}."
    f"The average return of each trajectory is {trajectory_stats['return_mean']}."
)

We have 100 trajectories.The average length of each trajectory is 500.0.The average return of each trajectory is 500.0.

After we collected our expert trajectories, it’s time to set up our imitation algorithm.

from imitation.algorithms import sqil
import gymnasium as gym

venv = DummyVecEnv([lambda: gym.make("CartPole-v1")])
sqil_trainer = sqil.SQIL(
    venv=venv,
    demonstrations=expert_trajectories,
    policy="MlpPolicy",
)

As you can see the untrained policy only gets poor rewards:

from stable_baselines3.common.evaluation import evaluate_policy

reward_before_training, _ = evaluate_policy(sqil_trainer.policy, venv, 10)
print(f"Reward before training: {reward_before_training}")

Reward before training: 8.8

After training, we can match the rewards of the expert (500):

sqil_trainer.train(
    total_timesteps=1_000,
)  # Note: set to 1_000_000 to obtain good results
reward_after_training, _ = evaluate_policy(sqil_trainer.policy, venv, 10)
print(f"Reward after training: {reward_after_training}")

Reward after training: 9.2