Density-Based Reward Modeling#
Density-based reward modeling is an inverse reinforcement learning (IRL) technique that assigns higher rewards to states or state-action pairs that occur more frequently in an expert’s demonstrations. This variant utilizes kernel density estimation to model the underlying distribution of expert demonstrations. It assigns rewards to states or state-action pairs based on their estimated log-likelihood under the distribution of expert demonstrations.
The key intuition behind this method is to incentivize the agent to take actions that resemble the expert’s actions in similar states.
While this approach is relatively simple, it does have several drawbacks:
It assumes that the expert demonstrations are representative of the expert’s behavior, which may not always be true.
It does not provide an interpretable reward function.
The kernel density estimation is not well-suited for high-dimensional state-action spaces.
Example#
Detailed example notebook: Learning a Reward Function using Kernel Density
import pprint
import numpy as np
from stable_baselines3 import PPO
from stable_baselines3.common.policies import ActorCriticPolicy
from imitation.algorithms import density as db
from imitation.data import serialize
from imitation.util import util
rng = np.random.default_rng(0)
env = util.make_vec_env("Pendulum-v1", rng=rng, n_envs=2)
rollouts = serialize.load("../tests/testdata/expert_models/pendulum_0/rollouts/final.npz")
imitation_trainer = PPO(
ActorCriticPolicy, env, learning_rate=3e-4, gamma=0.95, ent_coef=1e-4, n_steps=2048
)
density_trainer = db.DensityAlgorithm(
venv=env,
rng=rng,
demonstrations=rollouts,
rl_algo=imitation_trainer,
density_type=db.DensityType.STATE_ACTION_DENSITY,
is_stationary=True,
kernel="gaussian",
kernel_bandwidth=0.4,
standardise_inputs=True,
)
density_trainer.train()
def print_stats(density_trainer, n_trajectories):
stats = density_trainer.test_policy(n_trajectories=n_trajectories)
print("True reward function stats:")
pprint.pprint(stats)
stats_im = density_trainer.test_policy(true_reward=False, n_trajectories=n_trajectories)
print("Imitation reward function stats:")
pprint.pprint(stats_im)
print("Stats before training:")
print_stats(density_trainer, 1)
density_trainer.train_policy(100) # Train for 1_000_000 steps to approach expert performance.
print("Stats after training:")
print_stats(density_trainer, 1)
API#
- class imitation.algorithms.density.DensityAlgorithm(*, demonstrations, venv, rng, density_type=DensityType.STATE_ACTION_DENSITY, kernel='gaussian', kernel_bandwidth=0.5, rl_algo=None, is_stationary=True, standardise_inputs=True, custom_logger=None, allow_variable_horizon=False)[source]
Bases:
DemonstrationAlgorithm
Learns a reward function based on density modeling.
Specifically, it constructs a non-parametric estimate of p(s), p(s,a), p(s,s’) and then computes a reward using the log of these probabilities.
- __init__(*, demonstrations, venv, rng, density_type=DensityType.STATE_ACTION_DENSITY, kernel='gaussian', kernel_bandwidth=0.5, rl_algo=None, is_stationary=True, standardise_inputs=True, custom_logger=None, allow_variable_horizon=False)[source]
Builds DensityAlgorithm.
- Parameters
demonstrations (
Union
[Iterable
[Trajectory
],Iterable
[TransitionMapping
],TransitionsMinimal
,None
]) – expert demonstration trajectories.density_type (
DensityType
) – type of density to train on: single state, state-action pairs, or state-state pairs.kernel (
str
) – kernel to use for density estimation with sklearn.KernelDensity.kernel_bandwidth (
float
) – bandwidth of kernel. If standardise_inputs is true and you are using a Gaussian kernel, then it probably makes sense to set this somewhere between 0.1 and 1.venv (
VecEnv
) – The environment to learn a reward model in. We don’t actually need any environment interaction to fit the reward model, but we use this to extract the observation and action space, and to train the RL algorithm rl_algo (if specified).rng (
Generator
) – random state for sampling from demonstrations.rl_algo (
Optional
[BaseAlgorithm
]) – An RL algorithm to train on the resulting reward model (optional).is_stationary (
bool
) – if True, share same density models for all timesteps; if False, use a different density model for each timestep. A non-stationary model is particularly likely to be useful when using STATE_DENSITY, to encourage agent to imitate entire trajectories, not just a few states that have high frequency in the demonstration dataset. If non-stationary, demonstrations must be trajectories, not transitions (which do not contain timesteps).standardise_inputs (
bool
) – if True, then the inputs to the reward model will be standardised to have zero mean and unit variance over the demonstration trajectories. Otherwise, inputs will be passed to the reward model with their ordinary scale.custom_logger (
Optional
[HierarchicalLogger
]) – Where to log to; if None (default), creates a new logger.allow_variable_horizon (
bool
) – If False (default), algorithm will raise an exception if it detects trajectories of different length during training. If True, overrides this safety check. WARNING: variable horizon episodes leak information about the reward via termination condition, and can seriously confound evaluation. Read https://imitation.readthedocs.io/en/latest/guide/variable_horizon.html before overriding this.
- allow_variable_horizon: bool
If True, allow variable horizon trajectories; otherwise error if detected.
- buffering_wrapper: BufferingWrapper
- density_type: DensityType
- is_stationary: bool
- kernel: str
- kernel_bandwidth: float
- property logger: HierarchicalLogger
- Return type
- property policy: BasePolicy
Returns a policy imitating the demonstration data.
- Return type
BasePolicy
- rl_algo: Optional[BaseAlgorithm]
- set_demonstrations(demonstrations)[source]
Sets the demonstration data.
- Return type
None
- standardise: bool
- test_policy(*, n_trajectories=10, true_reward=True)[source]
Test current imitation policy on environment & give some rollout stats.
- Parameters
n_trajectories (
int
) – number of rolled-out trajectories.true_reward (
bool
) – should this use ground truth reward from underlying environment (True), or imitation reward (False)?
- Returns
- rollout statistics collected by
imitation.utils.rollout.rollout_stats().
- Return type
dict
- train()[source]
Fits the density model to demonstration data self.transitions.
- Return type
None
- train_policy(n_timesteps=1000000, **kwargs)[source]
Train the imitation policy for a given number of timesteps.
- Parameters
n_timesteps (
int
) – number of timesteps to train the policy for.kwargs (dict) – extra arguments that will be passed to the learn() method of the imitation RL model. Refer to Stable Baselines docs for details.
- Return type
None
- transitions: Dict[Optional[int], ndarray]
- venv: VecEnv
- venv_wrapped: RewardVecEnvWrapper
- wrapper_callback: WrappedRewardCallback