imitation.algorithms.density#

Density-based baselines for imitation learning.

Each of these algorithms learns a density estimate on some aspect of the demonstrations, then rewards the agent for following that estimate.

Classes

DensityAlgorithm(*, demonstrations, venv, rng)

Learns a reward function based on density modeling.

DensityType(value)

Input type the density model should use.

class imitation.algorithms.density.DensityAlgorithm(*, demonstrations, venv, rng, density_type=DensityType.STATE_ACTION_DENSITY, kernel='gaussian', kernel_bandwidth=0.5, rl_algo=None, is_stationary=True, standardise_inputs=True, custom_logger=None, allow_variable_horizon=False)[source]#

Bases: DemonstrationAlgorithm

Learns a reward function based on density modeling.

Specifically, it constructs a non-parametric estimate of p(s), p(s,a), p(s,s’) and then computes a reward using the log of these probabilities.

__init__(*, demonstrations, venv, rng, density_type=DensityType.STATE_ACTION_DENSITY, kernel='gaussian', kernel_bandwidth=0.5, rl_algo=None, is_stationary=True, standardise_inputs=True, custom_logger=None, allow_variable_horizon=False)[source]#

Builds DensityAlgorithm.

Parameters
  • demonstrations (Union[Iterable[Trajectory], Iterable[TransitionMapping], TransitionsMinimal, None]) – expert demonstration trajectories.

  • density_type (DensityType) – type of density to train on: single state, state-action pairs, or state-state pairs.

  • kernel (str) – kernel to use for density estimation with sklearn.KernelDensity.

  • kernel_bandwidth (float) – bandwidth of kernel. If standardise_inputs is true and you are using a Gaussian kernel, then it probably makes sense to set this somewhere between 0.1 and 1.

  • venv (VecEnv) – The environment to learn a reward model in. We don’t actually need any environment interaction to fit the reward model, but we use this to extract the observation and action space, and to train the RL algorithm rl_algo (if specified).

  • rng (Generator) – random state for sampling from demonstrations.

  • rl_algo (Optional[BaseAlgorithm]) – An RL algorithm to train on the resulting reward model (optional).

  • is_stationary (bool) – if True, share same density models for all timesteps; if False, use a different density model for each timestep. A non-stationary model is particularly likely to be useful when using STATE_DENSITY, to encourage agent to imitate entire trajectories, not just a few states that have high frequency in the demonstration dataset. If non-stationary, demonstrations must be trajectories, not transitions (which do not contain timesteps).

  • standardise_inputs (bool) – if True, then the inputs to the reward model will be standardised to have zero mean and unit variance over the demonstration trajectories. Otherwise, inputs will be passed to the reward model with their ordinary scale.

  • custom_logger (Optional[HierarchicalLogger]) – Where to log to; if None (default), creates a new logger.

  • allow_variable_horizon (bool) – If False (default), algorithm will raise an exception if it detects trajectories of different length during training. If True, overrides this safety check. WARNING: variable horizon episodes leak information about the reward via termination condition, and can seriously confound evaluation. Read https://imitation.readthedocs.io/en/latest/guide/variable_horizon.html before overriding this.

buffering_wrapper: BufferingWrapper#
density_type: DensityType#
is_stationary: bool#
kernel: str#
kernel_bandwidth: float#
property policy: BasePolicy#

Returns a policy imitating the demonstration data.

Return type

BasePolicy

rl_algo: Optional[BaseAlgorithm]#
set_demonstrations(demonstrations)[source]#

Sets the demonstration data.

Return type

None

standardise: bool#
test_policy(*, n_trajectories=10, true_reward=True)[source]#

Test current imitation policy on environment & give some rollout stats.

Parameters
  • n_trajectories (int) – number of rolled-out trajectories.

  • true_reward (bool) – should this use ground truth reward from underlying environment (True), or imitation reward (False)?

Returns

rollout statistics collected by

imitation.utils.rollout.rollout_stats().

Return type

dict

train()[source]#

Fits the density model to demonstration data self.transitions.

Return type

None

train_policy(n_timesteps=1000000, **kwargs)[source]#

Train the imitation policy for a given number of timesteps.

Parameters
  • n_timesteps (int) – number of timesteps to train the policy for.

  • kwargs (dict) – extra arguments that will be passed to the learn() method of the imitation RL model. Refer to Stable Baselines docs for details.

Return type

None

transitions: Dict[Optional[int], ndarray]#
venv: VecEnv#
venv_wrapped: RewardVecEnvWrapper#
wrapper_callback: WrappedRewardCallback#
class imitation.algorithms.density.DensityType(value)[source]#

Bases: Enum

Input type the density model should use.

STATE_ACTION_DENSITY = 2#

Density on (s,a) pairs.

STATE_DENSITY = 1#

Density on state s.

STATE_STATE_DENSITY = 3#

Density on (s,s’) pairs.