Density-Based Reward Modeling#

Density-based reward modeling is an inverse reinforcement learning (IRL) technique that assigns higher rewards to states or state-action pairs that occur more frequently in an expert’s demonstrations. This variant utilizes kernel density estimation to model the underlying distribution of expert demonstrations. It assigns rewards to states or state-action pairs based on their estimated log-likelihood under the distribution of expert demonstrations.

The key intuition behind this method is to incentivize the agent to take actions that resemble the expert’s actions in similar states.

While this approach is relatively simple, it does have several drawbacks:

It assumes that the expert demonstrations are representative of the expert’s behavior, which may not always be true.
It does not provide an interpretable reward function.
The kernel density estimation is not well-suited for high-dimensional state-action spaces.

Example#

Detailed example notebook: Learning a Reward Function using Kernel Density

import pprint
import numpy as np

from stable_baselines3 import PPO
from stable_baselines3.common.policies import ActorCriticPolicy

from imitation.algorithms import density as db
from imitation.data import serialize
from imitation.util import util

rng = np.random.default_rng(0)

env = util.make_vec_env("Pendulum-v1", rng=rng, n_envs=2)
rollouts = serialize.load("../tests/testdata/expert_models/pendulum_0/rollouts/final.npz")

imitation_trainer = PPO(
    ActorCriticPolicy, env, learning_rate=3e-4, gamma=0.95, ent_coef=1e-4, n_steps=2048
)
density_trainer = db.DensityAlgorithm(
    venv=env,
    rng=rng,
    demonstrations=rollouts,
    rl_algo=imitation_trainer,
    density_type=db.DensityType.STATE_ACTION_DENSITY,
    is_stationary=True,
    kernel="gaussian",
    kernel_bandwidth=0.4,
    standardise_inputs=True,
)
density_trainer.train()

def print_stats(density_trainer, n_trajectories):
    stats = density_trainer.test_policy(n_trajectories=n_trajectories)
    print("True reward function stats:")
    pprint.pprint(stats)
    stats_im = density_trainer.test_policy(true_reward=False, n_trajectories=n_trajectories)
    print("Imitation reward function stats:")
    pprint.pprint(stats_im)

print("Stats before training:")
print_stats(density_trainer, 1)

density_trainer.train_policy(100)  # Train for 1_000_000 steps to approach expert performance.

print("Stats after training:")
print_stats(density_trainer, 1)

API#

class imitation.algorithms.density.DensityAlgorithm(*, demonstrations, venv, rng, density_type=DensityType.STATE_ACTION_DENSITY, kernel='gaussian', kernel_bandwidth=0.5, rl_algo=None, is_stationary=True, standardise_inputs=True, custom_logger=None, allow_variable_horizon=False)[source]

Bases: DemonstrationAlgorithm

Learns a reward function based on density modeling.

Specifically, it constructs a non-parametric estimate of p(s), p(s,a), p(s,s’) and then computes a reward using the log of these probabilities.

__init__(*, demonstrations, venv, rng, density_type=DensityType.STATE_ACTION_DENSITY, kernel='gaussian', kernel_bandwidth=0.5, rl_algo=None, is_stationary=True, standardise_inputs=True, custom_logger=None, allow_variable_horizon=False)[source]

Builds DensityAlgorithm.

Parameters

demonstrations (Union[Iterable[Trajectory], Iterable[TransitionMapping], TransitionsMinimal, None]) – expert demonstration trajectories.
density_type (DensityType) – type of density to train on: single state, state-action pairs, or state-state pairs.
kernel (str) – kernel to use for density estimation with sklearn.KernelDensity.
kernel_bandwidth (float) – bandwidth of kernel. If standardise_inputs is true and you are using a Gaussian kernel, then it probably makes sense to set this somewhere between 0.1 and 1.
venv (VecEnv) – The environment to learn a reward model in. We don’t actually need any environment interaction to fit the reward model, but we use this to extract the observation and action space, and to train the RL algorithm rl_algo (if specified).
rng (Generator) – random state for sampling from demonstrations.
rl_algo (Optional[BaseAlgorithm]) – An RL algorithm to train on the resulting reward model (optional).
is_stationary (bool) – if True, share same density models for all timesteps; if False, use a different density model for each timestep. A non-stationary model is particularly likely to be useful when using STATE_DENSITY, to encourage agent to imitate entire trajectories, not just a few states that have high frequency in the demonstration dataset. If non-stationary, demonstrations must be trajectories, not transitions (which do not contain timesteps).
standardise_inputs (bool) – if True, then the inputs to the reward model will be standardised to have zero mean and unit variance over the demonstration trajectories. Otherwise, inputs will be passed to the reward model with their ordinary scale.
custom_logger (Optional[HierarchicalLogger]) – Where to log to; if None (default), creates a new logger.
allow_variable_horizon (bool) – If False (default), algorithm will raise an exception if it detects trajectories of different length during training. If True, overrides this safety check. WARNING: variable horizon episodes leak information about the reward via termination condition, and can seriously confound evaluation. Read https://imitation.readthedocs.io/en/latest/guide/variable_horizon.html before overriding this.

allow_variable_horizon: bool: If True, allow variable horizon trajectories; otherwise error if detected.

buffering_wrapper: BufferingWrapper

density_type: DensityType

is_stationary: bool

kernel: str

kernel_bandwidth: float

property logger: HierarchicalLogger

Return type: HierarchicalLogger

property policy: BasePolicy

Returns a policy imitating the demonstration data.

Return type: BasePolicy

rl_algo: Optional[BaseAlgorithm]

set_demonstrations(demonstrations)[source]

Sets the demonstration data.

Return type: None

standardise: bool

test_policy(*, n_trajectories=10, true_reward=True)[source]

Test current imitation policy on environment & give some rollout stats.

Parameters

n_trajectories (int) – number of rolled-out trajectories.
true_reward (bool) – should this use ground truth reward from underlying environment (True), or imitation reward (False)?

Returns

rollout statistics collected by: imitation.utils.rollout.rollout_stats().

Return type

dict

train()[source]

Fits the density model to demonstration data self.transitions.

Return type: None

train_policy(n_timesteps=1000000, **kwargs)[source]

Train the imitation policy for a given number of timesteps.

Parameters

n_timesteps (int) – number of timesteps to train the policy for.
kwargs (dict) – extra arguments that will be passed to the learn() method of the imitation RL model. Refer to Stable Baselines docs for details.

Return type

None

transitions: Dict[Optional[int], ndarray]

venv: VecEnv

venv_wrapped: RewardVecEnvWrapper

wrapper_callback: WrappedRewardCallback