Preference Comparisons#

The preference comparison algorithm learns a reward function from preferences between pairs of trajectories. The comparisons are modeled as being generated from a Bradley-Terry (or Boltzmann rational) model, where the probability of preferring trajectory A over B is proportional to the exponential of the difference between the return of trajectory A minus B. In other words, the difference in returns forms a logit for a binary classification problem, and accordingly the reward function is trained using a cross-entropy loss to predict the preference comparison.

Note

  • Our implementation is based on the Deep Reinforcement Learning from Human Preferences algorithm.

  • An ensemble of reward networks can also be trained instead of a single network. The uncertainty in the preference between the member networks can be used to actively select preference queries.

Example#

You can copy this example to train PPO on Pendulum using a reward model trained on 200 synthetic preference comparisons. For a more detailed example, refer to Learning a Reward Function using Preference Comparisons.

import numpy as np

from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.ppo import MlpPolicy

from imitation.algorithms import preference_comparisons
from imitation.policies.base import FeedForward32Policy, NormalizeFeaturesExtractor
from imitation.rewards.reward_nets import BasicRewardNet
from imitation.rewards.reward_wrapper import RewardVecEnvWrapper
from imitation.util.networks import RunningNorm
from imitation.util.util import make_vec_env

rng = np.random.default_rng(0)

venv = make_vec_env("Pendulum-v1", rng=rng)

reward_net = BasicRewardNet(
    venv.observation_space, venv.action_space, normalize_input_layer=RunningNorm,
)

fragmenter = preference_comparisons.RandomFragmenter(warning_threshold=0, rng=rng)
gatherer = preference_comparisons.SyntheticGatherer(rng=rng)
preference_model = preference_comparisons.PreferenceModel(reward_net)
reward_trainer = preference_comparisons.BasicRewardTrainer(
    preference_model=preference_model,
    loss=preference_comparisons.CrossEntropyRewardLoss(),
    epochs=10,
    rng=rng,
)

agent = PPO(
    policy=FeedForward32Policy,
    policy_kwargs=dict(
        features_extractor_class=NormalizeFeaturesExtractor,
        features_extractor_kwargs=dict(normalize_class=RunningNorm),
    ),
    env=venv,
    n_steps=2048 // venv.num_envs,
    clip_range=0.1,
    ent_coef=0.01,
    gae_lambda=0.95,
    n_epochs=10,
    gamma=0.97,
    learning_rate=2e-3,
)

trajectory_generator = preference_comparisons.AgentTrainer(
    algorithm=agent,
    reward_fn=reward_net,
    venv=venv,
    exploration_frac=0.05,
    rng=rng,
)

pref_comparisons = preference_comparisons.PreferenceComparisons(
    trajectory_generator,
    reward_net,
    num_iterations=5, # Set to 60 for better performance
    fragmenter=fragmenter,
    preference_gatherer=gatherer,
    reward_trainer=reward_trainer,
    initial_epoch_multiplier=4,
    initial_comparison_frac=0.1,
    query_schedule="hyperbolic",
)
pref_comparisons.train(total_timesteps=50_000, total_comparisons=200)

n_eval_episodes = 10
reward_mean, reward_std = evaluate_policy(agent.policy, venv, n_eval_episodes)
reward_stderr = reward_std/np.sqrt(n_eval_episodes)
print(f"Reward: {reward_mean:.0f} +/- {reward_stderr:.0f}")

API#

class imitation.algorithms.preference_comparisons.PreferenceComparisons(trajectory_generator, reward_model, num_iterations, fragmenter=None, preference_gatherer=None, reward_trainer=None, comparison_queue_size=None, fragment_length=100, transition_oversampling=1, initial_comparison_frac=0.1, initial_epoch_multiplier=200.0, custom_logger=None, allow_variable_horizon=False, rng=None, query_schedule='hyperbolic')[source]

Bases: BaseImitationAlgorithm

Main interface for reward learning using preference comparisons.

__init__(trajectory_generator, reward_model, num_iterations, fragmenter=None, preference_gatherer=None, reward_trainer=None, comparison_queue_size=None, fragment_length=100, transition_oversampling=1, initial_comparison_frac=0.1, initial_epoch_multiplier=200.0, custom_logger=None, allow_variable_horizon=False, rng=None, query_schedule='hyperbolic')[source]

Initialize the preference comparison trainer.

The loggers of all subcomponents are overridden with the logger used by this class.

Parameters
  • trajectory_generator (TrajectoryGenerator) – generates trajectories while optionally training an RL agent on the learned reward function (can also be a sampler from a static dataset of trajectories though).

  • reward_model (RewardNet) – a RewardNet instance to be used for learning the reward

  • num_iterations (int) – number of times to train the agent against the reward model and then train the reward model against newly gathered preferences.

  • fragmenter (Optional[Fragmenter]) – takes in a set of trajectories and returns pairs of fragments for which preferences will be gathered. These fragments could be random, or they could be selected more deliberately (active learning). Default is a random fragmenter.

  • preference_gatherer (Optional[PreferenceGatherer]) – how to get preferences between trajectory fragments. Default (and currently the only option) is to use synthetic preferences based on ground-truth rewards. Human preferences could be implemented here in the future.

  • reward_trainer (Optional[RewardTrainer]) – trains the reward model based on pairs of fragments and associated preferences. Default is to use the preference model and loss function from DRLHP.

  • comparison_queue_size (Optional[int]) – the maximum number of comparisons to keep in the queue for training the reward model. If None, the queue will grow without bound as new comparisons are added.

  • fragment_length (int) – number of timesteps per fragment that is used to elicit preferences

  • transition_oversampling (float) – factor by which to oversample transitions before creating fragments. Since fragments are sampled with replacement, this is usually chosen > 1 to avoid having the same transition in too many fragments.

  • initial_comparison_frac (float) – fraction of the total_comparisons argument to train() that will be sampled before the rest of training begins (using a randomly initialized agent). This can be used to pretrain the reward model before the agent is trained on the learned reward, to help avoid irreversibly learning a bad policy from an untrained reward. Note that there will often be some additional pretraining comparisons since comparisons_per_iteration won’t exactly divide the total number of comparisons. How many such comparisons there are depends discontinuously on total_comparisons and comparisons_per_iteration.

  • initial_epoch_multiplier (float) – before agent training begins, train the reward model for this many more epochs than usual (on fragments sampled from a random agent).

  • custom_logger (Optional[HierarchicalLogger]) – Where to log to; if None (default), creates a new logger.

  • allow_variable_horizon (bool) – If False (default), algorithm will raise an exception if it detects trajectories of different length during training. If True, overrides this safety check. WARNING: variable horizon episodes leak information about the reward via termination condition, and can seriously confound evaluation. Read https://imitation.readthedocs.io/en/latest/guide/variable_horizon.html before overriding this.

  • rng (Optional[Generator]) – random number generator to use for initializing subcomponents such as fragmenter. Only used when default components are used; if you instantiate your own fragmenter, preference gatherer, etc., you are responsible for seeding them!

  • query_schedule (Union[str, Callable[[float], float]]) – one of (“constant”, “hyperbolic”, “inverse_quadratic”), or a function that takes in a float between 0 and 1 inclusive, representing a fraction of the total number of timesteps elapsed up to some time T, and returns a potentially unnormalized probability indicating the fraction of total_comparisons that should be queried at that iteration. This function will be called num_iterations times in __init__() with values from np.linspace(0, 1, num_iterations) as input. The outputs will be normalized to sum to 1 and then used to apportion the comparisons among the num_iterations iterations.

Raises

ValueError – if query_schedule is not a valid string or callable.

allow_variable_horizon: bool

If True, allow variable horizon trajectories; otherwise error if detected.

property logger: HierarchicalLogger
Return type

HierarchicalLogger

train(total_timesteps, total_comparisons, callback=None)[source]

Train the reward model and the policy if applicable.

Parameters
  • total_timesteps (int) – number of environment interaction steps

  • total_comparisons (int) – number of preferences to gather in total

  • callback (Optional[Callable[[int], None]]) – callback functions called at the end of each iteration

Return type

Mapping[str, Any]

Returns

A dictionary with final metrics such as loss and accuracy of the reward model.

class imitation.algorithms.base.BaseImitationAlgorithm(*, custom_logger=None, allow_variable_horizon=False)[source]

Bases: ABC

Base class for all imitation learning algorithms.

__init__(*, custom_logger=None, allow_variable_horizon=False)[source]

Creates an imitation learning algorithm.

Parameters
  • custom_logger (Optional[HierarchicalLogger]) – Where to log to; if None (default), creates a new logger.

  • allow_variable_horizon (bool) – If False (default), algorithm will raise an exception if it detects trajectories of different length during training. If True, overrides this safety check. WARNING: variable horizon episodes leak information about the reward via termination condition, and can seriously confound evaluation. Read https://imitation.readthedocs.io/en/latest/getting-started/variable-horizon.html before overriding this.

allow_variable_horizon: bool

If True, allow variable horizon trajectories; otherwise error if detected.

property logger: HierarchicalLogger
Return type

HierarchicalLogger