Preference Comparisons#

The preference comparison algorithm learns a reward function from preferences between pairs of trajectories. The comparisons are modeled as being generated from a Bradley-Terry (or Boltzmann rational) model, where the probability of preferring trajectory A over B is proportional to the exponential of the difference between the return of trajectory A minus B. In other words, the difference in returns forms a logit for a binary classification problem, and accordingly the reward function is trained using a cross-entropy loss to predict the preference comparison.

Note

Our implementation is based on the Deep Reinforcement Learning from Human Preferences algorithm.
An ensemble of reward networks can also be trained instead of a single network. The uncertainty in the preference between the member networks can be used to actively select preference queries.

Example#

You can copy this example to train PPO on Pendulum using a reward model trained on 200 synthetic preference comparisons. For a more detailed example, refer to Learning a Reward Function using Preference Comparisons.

import numpy as np

from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.ppo import MlpPolicy

from imitation.algorithms import preference_comparisons
from imitation.policies.base import FeedForward32Policy, NormalizeFeaturesExtractor
from imitation.rewards.reward_nets import BasicRewardNet
from imitation.rewards.reward_wrapper import RewardVecEnvWrapper
from imitation.util.networks import RunningNorm
from imitation.util.util import make_vec_env

rng = np.random.default_rng(0)

venv = make_vec_env("Pendulum-v1", rng=rng)

reward_net = BasicRewardNet(
    venv.observation_space, venv.action_space, normalize_input_layer=RunningNorm,
)

fragmenter = preference_comparisons.RandomFragmenter(warning_threshold=0, rng=rng)
gatherer = preference_comparisons.SyntheticGatherer(rng=rng)
preference_model = preference_comparisons.PreferenceModel(reward_net)
reward_trainer = preference_comparisons.BasicRewardTrainer(
    preference_model=preference_model,
    loss=preference_comparisons.CrossEntropyRewardLoss(),
    epochs=10,
    rng=rng,
)

agent = PPO(
    policy=FeedForward32Policy,
    policy_kwargs=dict(
        features_extractor_class=NormalizeFeaturesExtractor,
        features_extractor_kwargs=dict(normalize_class=RunningNorm),
    ),
    env=venv,
    n_steps=2048 // venv.num_envs,
    clip_range=0.1,
    ent_coef=0.01,
    gae_lambda=0.95,
    n_epochs=10,
    gamma=0.97,
    learning_rate=2e-3,
)

trajectory_generator = preference_comparisons.AgentTrainer(
    algorithm=agent,
    reward_fn=reward_net,
    venv=venv,
    exploration_frac=0.05,
    rng=rng,
)

pref_comparisons = preference_comparisons.PreferenceComparisons(
    trajectory_generator,
    reward_net,
    num_iterations=5, # Set to 60 for better performance
    fragmenter=fragmenter,
    preference_gatherer=gatherer,
    reward_trainer=reward_trainer,
    initial_epoch_multiplier=4,
    initial_comparison_frac=0.1,
    query_schedule="hyperbolic",
)
pref_comparisons.train(total_timesteps=50_000, total_comparisons=200)

n_eval_episodes = 10
reward_mean, reward_std = evaluate_policy(agent.policy, venv, n_eval_episodes)
reward_stderr = reward_std/np.sqrt(n_eval_episodes)
print(f"Reward: {reward_mean:.0f} +/- {reward_stderr:.0f}")

API#

class imitation.algorithms.preference_comparisons.PreferenceComparisons(trajectory_generator, reward_model, num_iterations, fragmenter=None, preference_gatherer=None, reward_trainer=None, comparison_queue_size=None, fragment_length=100, transition_oversampling=1, initial_comparison_frac=0.1, initial_epoch_multiplier=200.0, custom_logger=None, allow_variable_horizon=False, rng=None, query_schedule='hyperbolic')[source]

Bases: BaseImitationAlgorithm

Main interface for reward learning using preference comparisons.

__init__(trajectory_generator, reward_model, num_iterations, fragmenter=None, preference_gatherer=None, reward_trainer=None, comparison_queue_size=None, fragment_length=100, transition_oversampling=1, initial_comparison_frac=0.1, initial_epoch_multiplier=200.0, custom_logger=None, allow_variable_horizon=False, rng=None, query_schedule='hyperbolic')[source]

Initialize the preference comparison trainer.

The loggers of all subcomponents are overridden with the logger used by this class.

Parameters

trajectory_generator (TrajectoryGenerator) – generates trajectories while optionally training an RL agent on the learned reward function (can also be a sampler from a static dataset of trajectories though).
reward_model (RewardNet) – a RewardNet instance to be used for learning the reward
num_iterations (int) – number of times to train the agent against the reward model and then train the reward model against newly gathered preferences.
fragmenter (Optional[Fragmenter]) – takes in a set of trajectories and returns pairs of fragments for which preferences will be gathered. These fragments could be random, or they could be selected more deliberately (active learning). Default is a random fragmenter.
preference_gatherer (Optional[PreferenceGatherer]) – how to get preferences between trajectory fragments. Default (and currently the only option) is to use synthetic preferences based on ground-truth rewards. Human preferences could be implemented here in the future.
reward_trainer (Optional[RewardTrainer]) – trains the reward model based on pairs of fragments and associated preferences. Default is to use the preference model and loss function from DRLHP.
comparison_queue_size (Optional[int]) – the maximum number of comparisons to keep in the queue for training the reward model. If None, the queue will grow without bound as new comparisons are added.
fragment_length (int) – number of timesteps per fragment that is used to elicit preferences
transition_oversampling (float) – factor by which to oversample transitions before creating fragments. Since fragments are sampled with replacement, this is usually chosen > 1 to avoid having the same transition in too many fragments.
initial_comparison_frac (float) – fraction of the total_comparisons argument to train() that will be sampled before the rest of training begins (using a randomly initialized agent). This can be used to pretrain the reward model before the agent is trained on the learned reward, to help avoid irreversibly learning a bad policy from an untrained reward. Note that there will often be some additional pretraining comparisons since comparisons_per_iteration won’t exactly divide the total number of comparisons. How many such comparisons there are depends discontinuously on total_comparisons and comparisons_per_iteration.
initial_epoch_multiplier (float) – before agent training begins, train the reward model for this many more epochs than usual (on fragments sampled from a random agent).
custom_logger (Optional[HierarchicalLogger]) – Where to log to; if None (default), creates a new logger.
allow_variable_horizon (bool) – If False (default), algorithm will raise an exception if it detects trajectories of different length during training. If True, overrides this safety check. WARNING: variable horizon episodes leak information about the reward via termination condition, and can seriously confound evaluation. Read https://imitation.readthedocs.io/en/latest/guide/variable_horizon.html before overriding this.
rng (Optional[Generator]) – random number generator to use for initializing subcomponents such as fragmenter. Only used when default components are used; if you instantiate your own fragmenter, preference gatherer, etc., you are responsible for seeding them!
query_schedule (Union[str, Callable[[float], float]]) – one of (“constant”, “hyperbolic”, “inverse_quadratic”), or a function that takes in a float between 0 and 1 inclusive, representing a fraction of the total number of timesteps elapsed up to some time T, and returns a potentially unnormalized probability indicating the fraction of total_comparisons that should be queried at that iteration. This function will be called num_iterations times in __init__() with values from np.linspace(0, 1, num_iterations) as input. The outputs will be normalized to sum to 1 and then used to apportion the comparisons among the num_iterations iterations.

Raises

ValueError – if query_schedule is not a valid string or callable.

allow_variable_horizon: bool: If True, allow variable horizon trajectories; otherwise error if detected.

property logger: HierarchicalLogger

Return type: HierarchicalLogger

train(total_timesteps, total_comparisons, callback=None)[source]

Train the reward model and the policy if applicable.

Parameters

total_timesteps (int) – number of environment interaction steps
total_comparisons (int) – number of preferences to gather in total
callback (Optional[Callable[[int], None]]) – callback functions called at the end of each iteration

Return type

Mapping[str, Any]

Returns

A dictionary with final metrics such as loss and accuracy of the reward model.

class imitation.algorithms.base.BaseImitationAlgorithm(*, custom_logger=None, allow_variable_horizon=False)[source]

Bases: ABC

Base class for all imitation learning algorithms.

__init__(*, custom_logger=None, allow_variable_horizon=False)[source]

Creates an imitation learning algorithm.

Parameters

custom_logger (Optional[HierarchicalLogger]) – Where to log to; if None (default), creates a new logger.
allow_variable_horizon (bool) – If False (default), algorithm will raise an exception if it detects trajectories of different length during training. If True, overrides this safety check. WARNING: variable horizon episodes leak information about the reward via termination condition, and can seriously confound evaluation. Read https://imitation.readthedocs.io/en/latest/getting-started/variable-horizon.html before overriding this.

allow_variable_horizon: bool: If True, allow variable horizon trajectories; otherwise error if detected.

property logger: HierarchicalLogger

Return type: HierarchicalLogger