imitation.algorithms.preference_comparisons#
Learning reward models using preference comparisons.
Trains a reward model and optionally a policy based on preferences between trajectory fragments.
Functions
|
|
|
|
Classes
|
Sample fragments of trajectories based on active selection. |
|
Wrapper for training an SB3 algorithm on an arbitrary reward function. |
|
Train a basic reward model. |
Compute the cross entropy reward loss. |
|
|
Train a reward ensemble. |
|
Class for creating pairs of trajectory fragments from a set of trajectories. |
|
Loss and auxiliary metrics for reward network training. |
|
Main interface for reward learning using preference comparisons. |
|
A PyTorch Dataset for preference comparisons. |
|
Base class for gathering preference comparisons between trajectory fragments. |
|
Class to convert two fragments' rewards into preference probability. |
|
Sample fragments of trajectories uniformly at random with replacement. |
|
A loss function over preferences. |
|
Abstract base class for training reward models using preference comparisons. |
|
Computes synthetic preferences using ground-truth environment rewards. |
|
A fixed dataset of trajectories. |
|
Generator of trajectories with optional training logic. |
- class imitation.algorithms.preference_comparisons.ActiveSelectionFragmenter(preference_model, base_fragmenter, fragment_sample_factor, uncertainty_on='logit', custom_logger=None)[source]#
Bases:
Fragmenter
Sample fragments of trajectories based on active selection.
Actively picks the fragment pairs with the highest uncertainty (variance) of rewards/probabilties/predictions from ensemble model.
- __init__(preference_model, base_fragmenter, fragment_sample_factor, uncertainty_on='logit', custom_logger=None)[source]#
Initialize the active selection fragmenter.
- Parameters
preference_model (
PreferenceModel
) – an ensemble model that predicts the preference of the first fragment over the other.base_fragmenter (
Fragmenter
) – fragmenter instance to get fragment pairs from trajectoriesfragment_sample_factor (
float
) – the factor of the number of fragment pairs to sample from the base_fragmenteruncertainty_on (
str
) – the variable to calculate the variance on. Can be logit|probability|label.custom_logger (
Optional
[HierarchicalLogger
]) – Where to log to; if None (default), creates a new logger.
- Raises
ValueError – Preference model not wrapped over an ensemble of networks.
- property uncertainty_on: str#
- Return type
str
- variance_estimate(rews1, rews2)[source]#
Gets the variance estimate from the rewards of a fragment pair.
- Parameters
rews1 (
Tensor
) – rewards obtained by all the ensemble models for the first fragment. Shape - (fragment_length, num_ensemble_members)rews2 (
Tensor
) – rewards obtained by all the ensemble models for the second fragment. Shape - (fragment_length, num_ensemble_members)
- Return type
float
- Returns
the variance estimate based on the uncertainty_on flag.
- class imitation.algorithms.preference_comparisons.AgentTrainer(algorithm, reward_fn, venv, rng, exploration_frac=0.0, switch_prob=0.5, random_prob=0.5, custom_logger=None)[source]#
Bases:
TrajectoryGenerator
Wrapper for training an SB3 algorithm on an arbitrary reward function.
- __init__(algorithm, reward_fn, venv, rng, exploration_frac=0.0, switch_prob=0.5, random_prob=0.5, custom_logger=None)[source]#
Initialize the agent trainer.
- Parameters
algorithm (
BaseAlgorithm
) – the stable-baselines algorithm to use for training.reward_fn (
Union
[RewardFn
,RewardNet
]) – either a RewardFn or a RewardNet instance that will supply the rewards used for training the agent.venv (
VecEnv
) – vectorized environment to train in.rng (
Generator
) – random number generator used for exploration and for sampling.exploration_frac (
float
) – fraction of the trajectories that will be generated partially randomly rather than only by the agent when sampling.switch_prob (
float
) – the probability of switching the current policy at each step for the exploratory samples.random_prob (
float
) – the probability of picking the random policy when switching during exploration.custom_logger (
Optional
[HierarchicalLogger
]) – Where to log to; if None (default), creates a new logger.
- property logger: HierarchicalLogger#
- Return type
- sample(steps)[source]#
Sample a batch of trajectories.
- Parameters
steps (
int
) – All trajectories taken together should have at least this many steps.- Return type
Sequence
[TrajectoryWithRew
]- Returns
A list of sampled trajectories with rewards (which should be the environment rewards, not ones from a reward model).
- train(steps, **kwargs)[source]#
Train the agent using the reward function specified during instantiation.
- Parameters
steps (
int
) – number of environment timesteps to train for**kwargs – other keyword arguments to pass to BaseAlgorithm.train()
- Raises
RuntimeError – Transitions left in self.buffering_wrapper; call self.sample first to clear them.
- Return type
None
- class imitation.algorithms.preference_comparisons.BasicRewardTrainer(preference_model, loss, rng, batch_size=32, minibatch_size=None, epochs=1, lr=0.001, custom_logger=None, regularizer_factory=None)[source]#
Bases:
RewardTrainer
Train a basic reward model.
- __init__(preference_model, loss, rng, batch_size=32, minibatch_size=None, epochs=1, lr=0.001, custom_logger=None, regularizer_factory=None)[source]#
Initialize the reward model trainer.
- Parameters
preference_model (
PreferenceModel
) – the preference model to train the reward network.loss (
RewardLoss
) – the loss to userng (
Generator
) – the random number generator to use for splitting the dataset into training and validation.batch_size (
int
) – number of fragment pairs per batchminibatch_size (
Optional
[int
]) – size of minibatch to calculate gradients over. The gradients are accumulated until batch_size examples are processed before making an optimization step. This is useful in GPU training to reduce memory usage, since fewer examples are loaded into memory at once, facilitating training with larger batch sizes, but is generally slower. Must be a factor of batch_size. Optional, defaults to batch_size.epochs (
int
) – number of epochs in each training iteration (can be adjusted on the fly by specifying an epoch_multiplier in self.train() if longer training is desired in specific cases).lr (
float
) – the learning ratecustom_logger (
Optional
[HierarchicalLogger
]) – Where to log to; if None (default), creates a new logger.regularizer_factory (
Optional
[RegularizerFactory
]) – if you would like to apply regularization during training, specify a regularizer factory here. The factory will be used to construct a regularizer. Seeimitation.regularization.RegularizerFactory
for more details.
- Raises
ValueError – if the batch size is not a multiple of the minibatch size.
- regularizer: Optional[Regularizer]#
- property requires_regularizer_update: bool#
Whether the regularizer requires updating.
- Return type
bool
- Returns
If true, this means that a validation dataset will be used.
- class imitation.algorithms.preference_comparisons.CrossEntropyRewardLoss[source]#
Bases:
RewardLoss
Compute the cross entropy reward loss.
- forward(fragment_pairs, preferences, preference_model)[source]#
Computes the loss.
- Parameters
fragment_pairs (
Sequence
[Tuple
[Trajectory
,Trajectory
]]) – Batch consisting of pairs of trajectory fragments.preferences (
ndarray
) – The probability that the first fragment is preferred over the second. Typically 0, 1 or 0.5 (tie).preference_model (
PreferenceModel
) – model to predict the preferred fragment from a pair.
- Return type
- Returns
- The cross-entropy loss between the probability predicted by the
reward model and the target probabilities in preferences. Metrics are accuracy, and gt_reward_loss, if the ground truth reward is available.
- training: bool#
- class imitation.algorithms.preference_comparisons.EnsembleTrainer(preference_model, loss, rng, batch_size=32, minibatch_size=None, epochs=1, lr=0.001, custom_logger=None, regularizer_factory=None)[source]#
Bases:
BasicRewardTrainer
Train a reward ensemble.
- __init__(preference_model, loss, rng, batch_size=32, minibatch_size=None, epochs=1, lr=0.001, custom_logger=None, regularizer_factory=None)[source]#
Initialize the reward model trainer.
- Parameters
preference_model (
PreferenceModel
) – the preference model to train the reward network.loss (
RewardLoss
) – the loss to userng (
Generator
) – random state for the internal RNG used in baggingbatch_size (
int
) – number of fragment pairs per batchminibatch_size (
Optional
[int
]) – size of minibatch to calculate gradients over. The gradients are accumulated until batch_size examples are processed before making an optimization step. This is useful in GPU training to reduce memory usage, since fewer examples are loaded into memory at once, facilitating training with larger batch sizes, but is generally slower. Must be a factor of batch_size. Optional, defaults to batch_size.epochs (
int
) – number of epochs in each training iteration (can be adjusted on the fly by specifying an epoch_multiplier in self.train() if longer training is desired in specific cases).lr (
float
) – the learning ratecustom_logger (
Optional
[HierarchicalLogger
]) – Where to log to; if None (default), creates a new logger.regularizer_factory (
Optional
[RegularizerFactory
]) – A factory for creating a regularizer. If None, no regularization is used.
- Raises
TypeError – if model is not a RewardEnsemble.
- property logger: HierarchicalLogger#
- Return type
- regularizer: Optional[Regularizer]#
- class imitation.algorithms.preference_comparisons.Fragmenter(custom_logger=None)[source]#
Bases:
ABC
Class for creating pairs of trajectory fragments from a set of trajectories.
- __init__(custom_logger=None)[source]#
Initialize the fragmenter.
- Parameters
custom_logger (
Optional
[HierarchicalLogger
]) – Where to log to; if None (default), creates a new logger.
- class imitation.algorithms.preference_comparisons.LossAndMetrics(loss: Tensor, metrics: Mapping[str, Tensor])[source]#
Bases:
tuple
Loss and auxiliary metrics for reward network training.
- loss: Tensor#
- metrics: Mapping[str, Tensor]#
- class imitation.algorithms.preference_comparisons.PreferenceComparisons(trajectory_generator, reward_model, num_iterations, fragmenter=None, preference_gatherer=None, reward_trainer=None, comparison_queue_size=None, fragment_length=100, transition_oversampling=1, initial_comparison_frac=0.1, initial_epoch_multiplier=200.0, custom_logger=None, allow_variable_horizon=False, rng=None, query_schedule='hyperbolic')[source]#
Bases:
BaseImitationAlgorithm
Main interface for reward learning using preference comparisons.
- __init__(trajectory_generator, reward_model, num_iterations, fragmenter=None, preference_gatherer=None, reward_trainer=None, comparison_queue_size=None, fragment_length=100, transition_oversampling=1, initial_comparison_frac=0.1, initial_epoch_multiplier=200.0, custom_logger=None, allow_variable_horizon=False, rng=None, query_schedule='hyperbolic')[source]#
Initialize the preference comparison trainer.
The loggers of all subcomponents are overridden with the logger used by this class.
- Parameters
trajectory_generator (
TrajectoryGenerator
) – generates trajectories while optionally training an RL agent on the learned reward function (can also be a sampler from a static dataset of trajectories though).reward_model (
RewardNet
) – a RewardNet instance to be used for learning the rewardnum_iterations (
int
) – number of times to train the agent against the reward model and then train the reward model against newly gathered preferences.fragmenter (
Optional
[Fragmenter
]) – takes in a set of trajectories and returns pairs of fragments for which preferences will be gathered. These fragments could be random, or they could be selected more deliberately (active learning). Default is a random fragmenter.preference_gatherer (
Optional
[PreferenceGatherer
]) – how to get preferences between trajectory fragments. Default (and currently the only option) is to use synthetic preferences based on ground-truth rewards. Human preferences could be implemented here in the future.reward_trainer (
Optional
[RewardTrainer
]) – trains the reward model based on pairs of fragments and associated preferences. Default is to use the preference model and loss function from DRLHP.comparison_queue_size (
Optional
[int
]) – the maximum number of comparisons to keep in the queue for training the reward model. If None, the queue will grow without bound as new comparisons are added.fragment_length (
int
) – number of timesteps per fragment that is used to elicit preferencestransition_oversampling (
float
) – factor by which to oversample transitions before creating fragments. Since fragments are sampled with replacement, this is usually chosen > 1 to avoid having the same transition in too many fragments.initial_comparison_frac (
float
) – fraction of the total_comparisons argument to train() that will be sampled before the rest of training begins (using a randomly initialized agent). This can be used to pretrain the reward model before the agent is trained on the learned reward, to help avoid irreversibly learning a bad policy from an untrained reward. Note that there will often be some additional pretraining comparisons since comparisons_per_iteration won’t exactly divide the total number of comparisons. How many such comparisons there are depends discontinuously on total_comparisons and comparisons_per_iteration.initial_epoch_multiplier (
float
) – before agent training begins, train the reward model for this many more epochs than usual (on fragments sampled from a random agent).custom_logger (
Optional
[HierarchicalLogger
]) – Where to log to; if None (default), creates a new logger.allow_variable_horizon (
bool
) – If False (default), algorithm will raise an exception if it detects trajectories of different length during training. If True, overrides this safety check. WARNING: variable horizon episodes leak information about the reward via termination condition, and can seriously confound evaluation. Read https://imitation.readthedocs.io/en/latest/guide/variable_horizon.html before overriding this.rng (
Optional
[Generator
]) – random number generator to use for initializing subcomponents such as fragmenter. Only used when default components are used; if you instantiate your own fragmenter, preference gatherer, etc., you are responsible for seeding them!query_schedule (
Union
[str
,Callable
[[float
],float
]]) – one of (“constant”, “hyperbolic”, “inverse_quadratic”), or a function that takes in a float between 0 and 1 inclusive, representing a fraction of the total number of timesteps elapsed up to some time T, and returns a potentially unnormalized probability indicating the fraction of total_comparisons that should be queried at that iteration. This function will be called num_iterations times in __init__() with values from np.linspace(0, 1, num_iterations) as input. The outputs will be normalized to sum to 1 and then used to apportion the comparisons among the num_iterations iterations.
- Raises
ValueError – if query_schedule is not a valid string or callable.
- allow_variable_horizon: bool#
If True, allow variable horizon trajectories; otherwise error if detected.
- train(total_timesteps, total_comparisons, callback=None)[source]#
Train the reward model and the policy if applicable.
- Parameters
total_timesteps (
int
) – number of environment interaction stepstotal_comparisons (
int
) – number of preferences to gather in totalcallback (
Optional
[Callable
[[int
],None
]]) – callback functions called at the end of each iteration
- Return type
Mapping
[str
,Any
]- Returns
A dictionary with final metrics such as loss and accuracy of the reward model.
- class imitation.algorithms.preference_comparisons.PreferenceDataset(max_size=None)[source]#
Bases:
Dataset
A PyTorch Dataset for preference comparisons.
Each item is a tuple consisting of two trajectory fragments and a probability that fragment 1 is preferred over fragment 2.
This dataset is meant to be generated piece by piece during the training process, which is why data can be added via the .push() method.
- __init__(max_size=None)[source]#
Builds an empty PreferenceDataset.
- Parameters
max_size (
Optional
[int
]) – Maximum number of preference comparisons to store in the dataset. If None (default), the dataset can grow indefinitely. Otherwise, the dataset acts as a FIFO queue, and the oldest comparisons are evicted when push() is called and the dataset is at max capacity.
- push(fragments, preferences)[source]#
Add more samples to the dataset.
- Parameters
fragments (
Sequence
[Tuple
[TrajectoryWithRew
,TrajectoryWithRew
]]) – list of pairs of trajectory fragments to addpreferences (
ndarray
) – corresponding preference probabilities (probability that fragment 1 is preferred over fragment 2)
- Raises
ValueError – preferences shape does not match fragments or has non-float32 dtype.
- Return type
None
- class imitation.algorithms.preference_comparisons.PreferenceGatherer(rng=None, custom_logger=None)[source]#
Bases:
ABC
Base class for gathering preference comparisons between trajectory fragments.
- __init__(rng=None, custom_logger=None)[source]#
Initializes the preference gatherer.
- Parameters
rng (
Optional
[Generator
]) – random number generator, if applicable.custom_logger (
Optional
[HierarchicalLogger
]) – Where to log to; if None (default), creates a new logger.
- class imitation.algorithms.preference_comparisons.PreferenceModel(model, noise_prob=0.0, discount_factor=1.0, threshold=50)[source]#
Bases:
Module
Class to convert two fragments’ rewards into preference probability.
- __init__(model, noise_prob=0.0, discount_factor=1.0, threshold=50)[source]#
Create Preference Prediction Model.
- Parameters
model (
RewardNet
) – base model to compute reward.noise_prob (
float
) – assumed probability with which the preference is uniformly random (used for the model of preference generation that is used for the loss).discount_factor (
float
) – the model of preference generation uses a softmax of returns as the probability that a fragment is preferred. This is the discount factor used to calculate those returns. Default is 1, i.e. undiscounted sums of rewards (which is what the DRLHP paper uses).threshold (
float
) – the preference model used to compute the loss contains a softmax of returns. To avoid overflows, we clip differences in returns that are above this threshold. This threshold is therefore in logspace. The default value of 50 means that probabilities below 2e-22 are rounded up to 2e-22.
- Raises
ValueError – if RewardEnsemble is wrapped around a class other than AddSTDRewardWrapper.
- forward(fragment_pairs)[source]#
Computes the preference probability of the first fragment for all pairs.
- Note: This function passes the gradient through for non-ensemble models.
For an ensemble model, this function should not be used for loss calculation. It can be used in case where passing the gradient is not required such as during active selection or inference time. Therefore, the EnsembleTrainer passes each member network through this function instead of passing the EnsembleNetwork object with the use of ensemble_member_index.
- Parameters
fragment_pairs (
Sequence
[Tuple
[Trajectory
,Trajectory
]]) – batch of pair of fragments.- Return type
Tuple
[Tensor
,Optional
[Tensor
]]- Returns
A tuple with the first element as the preference probabilities for the first fragment for all fragment pairs given by the network(s). If the ground truth rewards are available, it also returns gt preference probabilities in the second element of the tuple (else None). Reward probability shape - (num_fragment_pairs, ) for non-ensemble reward network and (num_fragment_pairs, num_networks) for an ensemble of networks.
- probability(rews1, rews2)[source]#
Computes the Boltzmann rational probability the first trajectory is best.
- Parameters
rews1 (
Tensor
) – array/matrix of rewards for the first trajectory fragment. matrix for ensemble models and array for non-ensemble models.rews2 (
Tensor
) – array/matrix of rewards for the second trajectory fragment. matrix for ensemble models and array for non-ensemble models.
- Return type
Tensor
- Returns
The softmax of the difference between the (discounted) return of the first and second trajectory. Shape - (num_ensemble_members, ) for ensemble model and () for non-ensemble model which is a torch scalar.
- rewards(transitions)[source]#
Computes the reward for all transitions.
- Parameters
transitions (
Transitions
) – batch of obs-act-obs-done for a fragment of a trajectory.- Return type
Tensor
- Returns
The reward given by the network(s) for all the transitions. Shape - (num_transitions, ) for Single reward network and (num_transitions, num_networks) for ensemble of networks.
- training: bool#
- class imitation.algorithms.preference_comparisons.RandomFragmenter(rng, warning_threshold=10, custom_logger=None)[source]#
Bases:
Fragmenter
Sample fragments of trajectories uniformly at random with replacement.
Note that each fragment is part of a single episode and has a fixed length. This leads to a bias: transitions at the beginning and at the end of episodes are less likely to occur as part of fragments (this affects the first and last fragment_length transitions).
An additional bias is that trajectories shorter than the desired fragment length are never used.
- __init__(rng, warning_threshold=10, custom_logger=None)[source]#
Initialize the fragmenter.
- Parameters
rng (
Generator
) – the random number generatorwarning_threshold (
int
) – give a warning if the number of available transitions is less than this many times the number of required samples. Set to 0 to disable this warning.custom_logger (
Optional
[HierarchicalLogger
]) – Where to log to; if None (default), creates a new logger.
- class imitation.algorithms.preference_comparisons.RewardLoss(*args, **kwargs)[source]#
Bases:
Module
,ABC
A loss function over preferences.
- abstract forward(fragment_pairs, preferences, preference_model)[source]#
Computes the loss.
- Parameters
fragment_pairs (
Sequence
[Tuple
[Trajectory
,Trajectory
]]) – Batch consisting of pairs of trajectory fragments.preferences (
ndarray
) – The probability that the first fragment is preferred over the second. Typically 0, 1 or 0.5 (tie).preference_model (
PreferenceModel
) – model to predict the preferred fragment from a pair.
- Returns: # noqa: DAR202
loss: the loss metrics: a dictionary of metrics that can be logged
- Return type
- training: bool#
- class imitation.algorithms.preference_comparisons.RewardTrainer(preference_model, custom_logger=None)[source]#
Bases:
ABC
Abstract base class for training reward models using preference comparisons.
This class contains only the actual reward model training code, it is not responsible for gathering trajectories and preferences or for agent training (see :class: PreferenceComparisons for that).
- __init__(preference_model, custom_logger=None)[source]#
Initialize the reward trainer.
- Parameters
preference_model (
PreferenceModel
) – the preference model to train the reward network.custom_logger (
Optional
[HierarchicalLogger
]) – Where to log to; if None (default), creates a new logger.
- property logger: HierarchicalLogger#
- Return type
- train(dataset, epoch_multiplier=1.0)[source]#
Train the reward model on a batch of fragment pairs and preferences.
- Parameters
dataset (
PreferenceDataset
) – the dataset of preference comparisons to train on.epoch_multiplier (
float
) – how much longer to train for than usual (measured relatively).
- Return type
None
- class imitation.algorithms.preference_comparisons.SyntheticGatherer(temperature=1, discount_factor=1, sample=True, rng=None, threshold=50, custom_logger=None)[source]#
Bases:
PreferenceGatherer
Computes synthetic preferences using ground-truth environment rewards.
- __init__(temperature=1, discount_factor=1, sample=True, rng=None, threshold=50, custom_logger=None)[source]#
Initialize the synthetic preference gatherer.
- Parameters
temperature (
float
) – the preferences are sampled from a softmax, this is the temperature used for sampling. temperature=0 leads to deterministic results (for equal rewards, 0.5 will be returned).discount_factor (
float
) – discount factor that is used to compute how good a fragment is. Default is to use undiscounted sums of rewards (as in the DRLHP paper).sample (
bool
) – if True (default), the preferences are 0 or 1, sampled from a Bernoulli distribution (or 0.5 in the case of ties with zero temperature). If False, then the underlying Bernoulli probabilities are returned instead.rng (
Optional
[Generator
]) – random number generator, only used iftemperature > 0
andsample=True
threshold (
float
) – preferences are sampled from a softmax of returns. To avoid overflows, we clip differences in returns that are above this threshold (after multiplying with temperature). This threshold is therefore in logspace. The default value of 50 means that probabilities below 2e-22 are rounded up to 2e-22.custom_logger (
Optional
[HierarchicalLogger
]) – Where to log to; if None (default), creates a new logger.
- Raises
ValueError – if sample is true and no random state is provided.
- class imitation.algorithms.preference_comparisons.TrajectoryDataset(trajectories, rng, custom_logger=None)[source]#
Bases:
TrajectoryGenerator
A fixed dataset of trajectories.
- __init__(trajectories, rng, custom_logger=None)[source]#
Creates a dataset loaded from path.
- Parameters
trajectories (
Sequence
[TrajectoryWithRew
]) – the dataset of rollouts.rng (
Generator
) – RNG used for shuffling dataset.custom_logger (
Optional
[HierarchicalLogger
]) – Where to log to; if None (default), creates a new logger.
- sample(steps)[source]#
Sample a batch of trajectories.
- Parameters
steps (
int
) – All trajectories taken together should have at least this many steps.- Return type
Sequence
[TrajectoryWithRew
]- Returns
A list of sampled trajectories with rewards (which should be the environment rewards, not ones from a reward model).
- class imitation.algorithms.preference_comparisons.TrajectoryGenerator(custom_logger=None)[source]#
Bases:
ABC
Generator of trajectories with optional training logic.
- __init__(custom_logger=None)[source]#
Builds TrajectoryGenerator.
- Parameters
custom_logger (
Optional
[HierarchicalLogger
]) – Where to log to; if None (default), creates a new logger.
- property logger: HierarchicalLogger#
- Return type
- abstract sample(steps)[source]#
Sample a batch of trajectories.
- Parameters
steps (
int
) – All trajectories taken together should have at least this many steps.- Return type
Sequence
[TrajectoryWithRew
]- Returns
A list of sampled trajectories with rewards (which should be the environment rewards, not ones from a reward model).
- train(steps, **kwargs)[source]#
Train an agent if the trajectory generator uses one.
By default, this method does nothing and doesn’t need to be overridden in subclasses that don’t require training.
- Parameters
steps (
int
) – number of environment steps to train for.**kwargs – additional keyword arguments to pass on to the training procedure.
- Return type
None
- imitation.algorithms.preference_comparisons.preference_collate_fn(batch)[source]#
- Return type
Tuple
[Sequence
[Tuple
[TrajectoryWithRew
,TrajectoryWithRew
]],ndarray
]