Train an Agent using Soft Q Imitation Learning with SAC#
In the previous tutorial, we used Soft Q Imitation Learning (SQIL) on top of the DQN base algorithm. In fact, SQIL can be combined with any off-policy algorithm from stable_baselines3
. Here, we train a Pendulum agent using SQIL + SAC.
First, we need some expert trajectories in our environment (Pendulum-v1
).
Note that you can use other environments, but the action space must be continuous.
import datasets
from imitation.data import huggingface_utils
# Download some expert trajectories from the HuggingFace Datasets Hub.
dataset = datasets.load_dataset("HumanCompatibleAI/ppo-Pendulum-v1")
# Convert the dataset to a format usable by the imitation library.
expert_trajectories = huggingface_utils.TrajectoryDatasetSequence(dataset["train"])
Let’s quickly check if the expert is any good.
from imitation.data import rollout
trajectory_stats = rollout.rollout_stats(expert_trajectories)
print(
f"We have {trajectory_stats['n_traj']} trajectories. "
f"The average length of each trajectory is {trajectory_stats['len_mean']}. "
f"The average return of each trajectory is {trajectory_stats['return_mean']}."
)
We have 200 trajectories. The average length of each trajectory is 200.0. The average return of each trajectory is -205.22814517737746.
After we collected our expert trajectories, it’s time to set up our imitation algorithm.
from imitation.algorithms import sqil
from imitation.util.util import make_vec_env
import numpy as np
from stable_baselines3 import sac
SEED = 42
venv = make_vec_env(
"Pendulum-v1",
rng=np.random.default_rng(seed=SEED),
)
sqil_trainer = sqil.SQIL(
venv=venv,
demonstrations=expert_trajectories,
policy="MlpPolicy",
rl_algo_class=sac.SAC,
rl_kwargs=dict(seed=SEED),
)
As you can see the untrained policy only gets poor rewards (< 0):
from stable_baselines3.common.evaluation import evaluate_policy
reward_before_training, _ = evaluate_policy(sqil_trainer.policy, venv, 100)
print(f"Reward before training: {reward_before_training}")
Reward before training: -1386.1941136000003
After training, we can observe that agent is quite improved (> 1000), although it does not reach the expert performance in this case.
sqil_trainer.train(
total_timesteps=1000,
) # Note: set to 300_000 to obtain good results
reward_after_training, _ = evaluate_policy(sqil_trainer.policy, venv, 100)
print(f"Reward after training: {reward_after_training}")
Reward after training: -1217.9038355900002