Behavioral Cloning (BC)#

Behavioral cloning directly learns a policy by using supervised learning on observation-action pairs from expert demonstrations. It is a simple approach to learning a policy, but the policy often generalizes poorly and does not recover well from errors.

Alternatives to behavioral cloning include DAgger (similar but gathers on-policy demonstrations) and GAIL/AIRL (more robust approaches to learning from demonstrations).

Example#

Detailed example notebook: Train an Agent using Behavior Cloning

import numpy as np
import gymnasium as gym
from stable_baselines3.common.evaluation import evaluate_policy

from imitation.algorithms import bc
from imitation.data import rollout
from imitation.data.wrappers import RolloutInfoWrapper
from imitation.policies.serialize import load_policy
from imitation.util.util import make_vec_env

rng = np.random.default_rng(0)
env = make_vec_env(
    "seals:seals/CartPole-v0",
    rng=rng,
    n_envs=1,
    post_wrappers=[lambda env, _: RolloutInfoWrapper(env)],  # for computing rollouts
)
expert = load_policy(
    "ppo-huggingface",
    organization="HumanCompatibleAI",
    env_name="seals-CartPole-v0",
    venv=env,
)
rollouts = rollout.rollout(
    expert,
    env,
    rollout.make_sample_until(min_timesteps=None, min_episodes=50),
    rng=rng,
)
transitions = rollout.flatten_trajectories(rollouts)

bc_trainer = bc.BC(
    observation_space=env.observation_space,
    action_space=env.action_space,
    demonstrations=transitions,
    rng=rng,
)
bc_trainer.train(n_epochs=1)
reward, _ = evaluate_policy(bc_trainer.policy, env, 10)
print("Reward:", reward)

API#

class imitation.algorithms.bc.BC(*, observation_space, action_space, rng, policy=None, demonstrations=None, batch_size=32, minibatch_size=None, optimizer_cls=<class 'torch.optim.adam.Adam'>, optimizer_kwargs=None, ent_weight=0.001, l2_weight=0.0, device='auto', custom_logger=None)[source]

Bases: DemonstrationAlgorithm

Behavioral cloning (BC).

Recovers a policy via supervised learning from observation-action pairs.

__init__(*, observation_space, action_space, rng, policy=None, demonstrations=None, batch_size=32, minibatch_size=None, optimizer_cls=<class 'torch.optim.adam.Adam'>, optimizer_kwargs=None, ent_weight=0.001, l2_weight=0.0, device='auto', custom_logger=None)[source]

Builds BC.

Parameters
  • observation_space (Space) – the observation space of the environment.

  • action_space (Space) – the action space of the environment.

  • rng (Generator) – the random state to use for the random number generator.

  • policy (Optional[ActorCriticPolicy]) – a Stable Baselines3 policy; if unspecified, defaults to FeedForward32Policy.

  • demonstrations (Union[Iterable[Trajectory], Iterable[TransitionMapping], TransitionsMinimal, None]) – Demonstrations from an expert (optional). Transitions expressed directly as a types.TransitionsMinimal object, a sequence of trajectories, or an iterable of transition batches (mappings from keywords to arrays containing observations, etc).

  • batch_size (int) – The number of samples in each batch of expert data.

  • minibatch_size (Optional[int]) – size of minibatch to calculate gradients over. The gradients are accumulated until batch_size examples are processed before making an optimization step. This is useful in GPU training to reduce memory usage, since fewer examples are loaded into memory at once, facilitating training with larger batch sizes, but is generally slower. Must be a factor of batch_size. Optional, defaults to batch_size.

  • optimizer_cls (Type[Optimizer]) – optimiser to use for supervised training.

  • optimizer_kwargs (Optional[Mapping[str, Any]]) – keyword arguments, excluding learning rate and weight decay, for optimiser construction.

  • ent_weight (float) – scaling applied to the policy’s entropy regularization.

  • l2_weight (float) – scaling applied to the policy’s L2 regularization.

  • device (Union[str, device]) – name/identity of device to place policy on.

  • custom_logger (Optional[HierarchicalLogger]) – Where to log to; if None (default), creates a new logger.

Raises

ValueError – If weight_decay is specified in optimizer_kwargs (use the parameter l2_weight instead), or if the batch size is not a multiple of the minibatch size.

allow_variable_horizon: bool

If True, allow variable horizon trajectories; otherwise error if detected.

property policy: ActorCriticPolicy

Returns a policy imitating the demonstration data.

Return type

ActorCriticPolicy

set_demonstrations(demonstrations)[source]

Sets the demonstration data.

Changing the demonstration data on-demand can be useful for interactive algorithms like DAgger.

Parameters

demonstrations (Union[Iterable[Trajectory], Iterable[TransitionMapping], TransitionsMinimal]) – Either a Torch DataLoader, any other iterator that yields dictionaries containing “obs” and “acts” Tensors or NumPy arrays, TransitionKind instance, or a Sequence of Trajectory objects.

Return type

None

train(*, n_epochs=None, n_batches=None, on_epoch_end=None, on_batch_end=None, log_interval=500, log_rollouts_venv=None, log_rollouts_n_episodes=5, progress_bar=True, reset_tensorboard=False)[source]

Train with supervised learning for some number of epochs.

Here an ‘epoch’ is just a complete pass through the expert data loader, as set by self.set_expert_data_loader(). Note, that when you specify n_batches smaller than the number of batches in an epoch, the on_epoch_end callback will never be called.

Parameters
  • n_epochs (Optional[int]) – Number of complete passes made through expert data before ending training. Provide exactly one of n_epochs and n_batches.

  • n_batches (Optional[int]) – Number of batches loaded from dataset before ending training. Provide exactly one of n_epochs and n_batches.

  • on_epoch_end (Optional[Callable[[], None]]) – Optional callback with no parameters to run at the end of each epoch.

  • on_batch_end (Optional[Callable[[], None]]) – Optional callback with no parameters to run at the end of each batch.

  • log_interval (int) – Log stats after every log_interval batches.

  • log_rollouts_venv (Optional[VecEnv]) – If not None, then this VecEnv (whose observation and actions spaces must match self.observation_space and self.action_space) is used to generate rollout stats, including average return and average episode length. If None, then no rollouts are generated.

  • log_rollouts_n_episodes (int) – Number of rollouts to generate when calculating rollout stats. Non-positive number disables rollouts.

  • progress_bar (bool) – If True, then show a progress bar during training.

  • reset_tensorboard (bool) – If True, then start plotting to Tensorboard from x=0 even if .train() logged to Tensorboard previously. Has no practical effect if .train() is being called for the first time.