imitation.algorithms.density#
Density-based baselines for imitation learning.
Each of these algorithms learns a density estimate on some aspect of the demonstrations, then rewards the agent for following that estimate.
Classes
|
Learns a reward function based on density modeling. |
|
Input type the density model should use. |
- class imitation.algorithms.density.DensityAlgorithm(*, demonstrations, venv, rng, density_type=DensityType.STATE_ACTION_DENSITY, kernel='gaussian', kernel_bandwidth=0.5, rl_algo=None, is_stationary=True, standardise_inputs=True, custom_logger=None, allow_variable_horizon=False)[source]#
Bases:
DemonstrationAlgorithm
Learns a reward function based on density modeling.
Specifically, it constructs a non-parametric estimate of p(s), p(s,a), p(s,s’) and then computes a reward using the log of these probabilities.
- __init__(*, demonstrations, venv, rng, density_type=DensityType.STATE_ACTION_DENSITY, kernel='gaussian', kernel_bandwidth=0.5, rl_algo=None, is_stationary=True, standardise_inputs=True, custom_logger=None, allow_variable_horizon=False)[source]#
Builds DensityAlgorithm.
- Parameters
demonstrations (
Union
[Iterable
[Trajectory
],Iterable
[TransitionMapping
],TransitionsMinimal
,None
]) – expert demonstration trajectories.density_type (
DensityType
) – type of density to train on: single state, state-action pairs, or state-state pairs.kernel (
str
) – kernel to use for density estimation with sklearn.KernelDensity.kernel_bandwidth (
float
) – bandwidth of kernel. If standardise_inputs is true and you are using a Gaussian kernel, then it probably makes sense to set this somewhere between 0.1 and 1.venv (
VecEnv
) – The environment to learn a reward model in. We don’t actually need any environment interaction to fit the reward model, but we use this to extract the observation and action space, and to train the RL algorithm rl_algo (if specified).rng (
Generator
) – random state for sampling from demonstrations.rl_algo (
Optional
[BaseAlgorithm
]) – An RL algorithm to train on the resulting reward model (optional).is_stationary (
bool
) – if True, share same density models for all timesteps; if False, use a different density model for each timestep. A non-stationary model is particularly likely to be useful when using STATE_DENSITY, to encourage agent to imitate entire trajectories, not just a few states that have high frequency in the demonstration dataset. If non-stationary, demonstrations must be trajectories, not transitions (which do not contain timesteps).standardise_inputs (
bool
) – if True, then the inputs to the reward model will be standardised to have zero mean and unit variance over the demonstration trajectories. Otherwise, inputs will be passed to the reward model with their ordinary scale.custom_logger (
Optional
[HierarchicalLogger
]) – Where to log to; if None (default), creates a new logger.allow_variable_horizon (
bool
) – If False (default), algorithm will raise an exception if it detects trajectories of different length during training. If True, overrides this safety check. WARNING: variable horizon episodes leak information about the reward via termination condition, and can seriously confound evaluation. Read https://imitation.readthedocs.io/en/latest/guide/variable_horizon.html before overriding this.
- buffering_wrapper: BufferingWrapper#
- density_type: DensityType#
- is_stationary: bool#
- kernel: str#
- kernel_bandwidth: float#
- property policy: BasePolicy#
Returns a policy imitating the demonstration data.
- Return type
BasePolicy
- rl_algo: Optional[BaseAlgorithm]#
- standardise: bool#
- test_policy(*, n_trajectories=10, true_reward=True)[source]#
Test current imitation policy on environment & give some rollout stats.
- Parameters
n_trajectories (
int
) – number of rolled-out trajectories.true_reward (
bool
) – should this use ground truth reward from underlying environment (True), or imitation reward (False)?
- Returns
- rollout statistics collected by
imitation.utils.rollout.rollout_stats().
- Return type
dict
- train_policy(n_timesteps=1000000, **kwargs)[source]#
Train the imitation policy for a given number of timesteps.
- Parameters
n_timesteps (
int
) – number of timesteps to train the policy for.kwargs (dict) – extra arguments that will be passed to the learn() method of the imitation RL model. Refer to Stable Baselines docs for details.
- Return type
None
- transitions: Dict[Optional[int], ndarray]#
- venv: VecEnv#
- venv_wrapped: RewardVecEnvWrapper#
- wrapper_callback: WrappedRewardCallback#