imitation.algorithms.preference_comparisons#

Learning reward models using preference comparisons.

Trains a reward model and optionally a policy based on preferences between trajectory fragments.

Functions

get_base_model(reward_model)

rtype

RewardNet

preference_collate_fn(batch)

rtype

Tuple[Sequence[Tuple[TrajectoryWithRew, TrajectoryWithRew]], ndarray]

Classes

ActiveSelectionFragmenter(preference_model, ...)

Sample fragments of trajectories based on active selection.

AgentTrainer(algorithm, reward_fn, venv, rng)

Wrapper for training an SB3 algorithm on an arbitrary reward function.

BasicRewardTrainer(preference_model, loss, rng)

Train a basic reward model.

CrossEntropyRewardLoss()

Compute the cross entropy reward loss.

EnsembleTrainer(preference_model, loss, rng)

Train a reward ensemble.

Fragmenter([custom_logger])

Class for creating pairs of trajectory fragments from a set of trajectories.

LossAndMetrics(loss, metrics)

Loss and auxiliary metrics for reward network training.

PreferenceComparisons(trajectory_generator, ...)

Main interface for reward learning using preference comparisons.

PreferenceDataset([max_size])

A PyTorch Dataset for preference comparisons.

PreferenceGatherer([rng, custom_logger])

Base class for gathering preference comparisons between trajectory fragments.

PreferenceModel(model[, noise_prob, ...])

Class to convert two fragments' rewards into preference probability.

RandomFragmenter(rng[, warning_threshold, ...])

Sample fragments of trajectories uniformly at random with replacement.

RewardLoss(*args, **kwargs)

A loss function over preferences.

RewardTrainer(preference_model[, custom_logger])

Abstract base class for training reward models using preference comparisons.

SyntheticGatherer([temperature, ...])

Computes synthetic preferences using ground-truth environment rewards.

TrajectoryDataset(trajectories, rng[, ...])

A fixed dataset of trajectories.

TrajectoryGenerator([custom_logger])

Generator of trajectories with optional training logic.

class imitation.algorithms.preference_comparisons.ActiveSelectionFragmenter(preference_model, base_fragmenter, fragment_sample_factor, uncertainty_on='logit', custom_logger=None)[source]#

Bases: Fragmenter

Sample fragments of trajectories based on active selection.

Actively picks the fragment pairs with the highest uncertainty (variance) of rewards/probabilties/predictions from ensemble model.

__init__(preference_model, base_fragmenter, fragment_sample_factor, uncertainty_on='logit', custom_logger=None)[source]#

Initialize the active selection fragmenter.

Parameters
  • preference_model (PreferenceModel) – an ensemble model that predicts the preference of the first fragment over the other.

  • base_fragmenter (Fragmenter) – fragmenter instance to get fragment pairs from trajectories

  • fragment_sample_factor (float) – the factor of the number of fragment pairs to sample from the base_fragmenter

  • uncertainty_on (str) – the variable to calculate the variance on. Can be logit|probability|label.

  • custom_logger (Optional[HierarchicalLogger]) – Where to log to; if None (default), creates a new logger.

Raises

ValueError – Preference model not wrapped over an ensemble of networks.

raise_uncertainty_on_not_supported()[source]#
Return type

NoReturn

property uncertainty_on: str#
Return type

str

variance_estimate(rews1, rews2)[source]#

Gets the variance estimate from the rewards of a fragment pair.

Parameters
  • rews1 (Tensor) – rewards obtained by all the ensemble models for the first fragment. Shape - (fragment_length, num_ensemble_members)

  • rews2 (Tensor) – rewards obtained by all the ensemble models for the second fragment. Shape - (fragment_length, num_ensemble_members)

Return type

float

Returns

the variance estimate based on the uncertainty_on flag.

class imitation.algorithms.preference_comparisons.AgentTrainer(algorithm, reward_fn, venv, rng, exploration_frac=0.0, switch_prob=0.5, random_prob=0.5, custom_logger=None)[source]#

Bases: TrajectoryGenerator

Wrapper for training an SB3 algorithm on an arbitrary reward function.

__init__(algorithm, reward_fn, venv, rng, exploration_frac=0.0, switch_prob=0.5, random_prob=0.5, custom_logger=None)[source]#

Initialize the agent trainer.

Parameters
  • algorithm (BaseAlgorithm) – the stable-baselines algorithm to use for training.

  • reward_fn (Union[RewardFn, RewardNet]) – either a RewardFn or a RewardNet instance that will supply the rewards used for training the agent.

  • venv (VecEnv) – vectorized environment to train in.

  • rng (Generator) – random number generator used for exploration and for sampling.

  • exploration_frac (float) – fraction of the trajectories that will be generated partially randomly rather than only by the agent when sampling.

  • switch_prob (float) – the probability of switching the current policy at each step for the exploratory samples.

  • random_prob (float) – the probability of picking the random policy when switching during exploration.

  • custom_logger (Optional[HierarchicalLogger]) – Where to log to; if None (default), creates a new logger.

property logger: HierarchicalLogger#
Return type

HierarchicalLogger

sample(steps)[source]#

Sample a batch of trajectories.

Parameters

steps (int) – All trajectories taken together should have at least this many steps.

Return type

Sequence[TrajectoryWithRew]

Returns

A list of sampled trajectories with rewards (which should be the environment rewards, not ones from a reward model).

train(steps, **kwargs)[source]#

Train the agent using the reward function specified during instantiation.

Parameters
  • steps (int) – number of environment timesteps to train for

  • **kwargs – other keyword arguments to pass to BaseAlgorithm.train()

Raises

RuntimeError – Transitions left in self.buffering_wrapper; call self.sample first to clear them.

Return type

None

class imitation.algorithms.preference_comparisons.BasicRewardTrainer(preference_model, loss, rng, batch_size=32, minibatch_size=None, epochs=1, lr=0.001, custom_logger=None, regularizer_factory=None)[source]#

Bases: RewardTrainer

Train a basic reward model.

__init__(preference_model, loss, rng, batch_size=32, minibatch_size=None, epochs=1, lr=0.001, custom_logger=None, regularizer_factory=None)[source]#

Initialize the reward model trainer.

Parameters
  • preference_model (PreferenceModel) – the preference model to train the reward network.

  • loss (RewardLoss) – the loss to use

  • rng (Generator) – the random number generator to use for splitting the dataset into training and validation.

  • batch_size (int) – number of fragment pairs per batch

  • minibatch_size (Optional[int]) – size of minibatch to calculate gradients over. The gradients are accumulated until batch_size examples are processed before making an optimization step. This is useful in GPU training to reduce memory usage, since fewer examples are loaded into memory at once, facilitating training with larger batch sizes, but is generally slower. Must be a factor of batch_size. Optional, defaults to batch_size.

  • epochs (int) – number of epochs in each training iteration (can be adjusted on the fly by specifying an epoch_multiplier in self.train() if longer training is desired in specific cases).

  • lr (float) – the learning rate

  • custom_logger (Optional[HierarchicalLogger]) – Where to log to; if None (default), creates a new logger.

  • regularizer_factory (Optional[RegularizerFactory]) – if you would like to apply regularization during training, specify a regularizer factory here. The factory will be used to construct a regularizer. See imitation.regularization.RegularizerFactory for more details.

Raises

ValueError – if the batch size is not a multiple of the minibatch size.

regularizer: Optional[Regularizer]#
property requires_regularizer_update: bool#

Whether the regularizer requires updating.

Return type

bool

Returns

If true, this means that a validation dataset will be used.

class imitation.algorithms.preference_comparisons.CrossEntropyRewardLoss[source]#

Bases: RewardLoss

Compute the cross entropy reward loss.

__init__()[source]#

Create cross entropy reward loss.

forward(fragment_pairs, preferences, preference_model)[source]#

Computes the loss.

Parameters
  • fragment_pairs (Sequence[Tuple[Trajectory, Trajectory]]) – Batch consisting of pairs of trajectory fragments.

  • preferences (ndarray) – The probability that the first fragment is preferred over the second. Typically 0, 1 or 0.5 (tie).

  • preference_model (PreferenceModel) – model to predict the preferred fragment from a pair.

Return type

LossAndMetrics

Returns

The cross-entropy loss between the probability predicted by the

reward model and the target probabilities in preferences. Metrics are accuracy, and gt_reward_loss, if the ground truth reward is available.

training: bool#
class imitation.algorithms.preference_comparisons.EnsembleTrainer(preference_model, loss, rng, batch_size=32, minibatch_size=None, epochs=1, lr=0.001, custom_logger=None, regularizer_factory=None)[source]#

Bases: BasicRewardTrainer

Train a reward ensemble.

__init__(preference_model, loss, rng, batch_size=32, minibatch_size=None, epochs=1, lr=0.001, custom_logger=None, regularizer_factory=None)[source]#

Initialize the reward model trainer.

Parameters
  • preference_model (PreferenceModel) – the preference model to train the reward network.

  • loss (RewardLoss) – the loss to use

  • rng (Generator) – random state for the internal RNG used in bagging

  • batch_size (int) – number of fragment pairs per batch

  • minibatch_size (Optional[int]) – size of minibatch to calculate gradients over. The gradients are accumulated until batch_size examples are processed before making an optimization step. This is useful in GPU training to reduce memory usage, since fewer examples are loaded into memory at once, facilitating training with larger batch sizes, but is generally slower. Must be a factor of batch_size. Optional, defaults to batch_size.

  • epochs (int) – number of epochs in each training iteration (can be adjusted on the fly by specifying an epoch_multiplier in self.train() if longer training is desired in specific cases).

  • lr (float) – the learning rate

  • custom_logger (Optional[HierarchicalLogger]) – Where to log to; if None (default), creates a new logger.

  • regularizer_factory (Optional[RegularizerFactory]) – A factory for creating a regularizer. If None, no regularization is used.

Raises

TypeError – if model is not a RewardEnsemble.

property logger: HierarchicalLogger#
Return type

HierarchicalLogger

regularizer: Optional[Regularizer]#
class imitation.algorithms.preference_comparisons.Fragmenter(custom_logger=None)[source]#

Bases: ABC

Class for creating pairs of trajectory fragments from a set of trajectories.

__init__(custom_logger=None)[source]#

Initialize the fragmenter.

Parameters

custom_logger (Optional[HierarchicalLogger]) – Where to log to; if None (default), creates a new logger.

class imitation.algorithms.preference_comparisons.LossAndMetrics(loss: Tensor, metrics: Mapping[str, Tensor])[source]#

Bases: tuple

Loss and auxiliary metrics for reward network training.

loss: Tensor#
metrics: Mapping[str, Tensor]#
class imitation.algorithms.preference_comparisons.PreferenceComparisons(trajectory_generator, reward_model, num_iterations, fragmenter=None, preference_gatherer=None, reward_trainer=None, comparison_queue_size=None, fragment_length=100, transition_oversampling=1, initial_comparison_frac=0.1, initial_epoch_multiplier=200.0, custom_logger=None, allow_variable_horizon=False, rng=None, query_schedule='hyperbolic')[source]#

Bases: BaseImitationAlgorithm

Main interface for reward learning using preference comparisons.

__init__(trajectory_generator, reward_model, num_iterations, fragmenter=None, preference_gatherer=None, reward_trainer=None, comparison_queue_size=None, fragment_length=100, transition_oversampling=1, initial_comparison_frac=0.1, initial_epoch_multiplier=200.0, custom_logger=None, allow_variable_horizon=False, rng=None, query_schedule='hyperbolic')[source]#

Initialize the preference comparison trainer.

The loggers of all subcomponents are overridden with the logger used by this class.

Parameters
  • trajectory_generator (TrajectoryGenerator) – generates trajectories while optionally training an RL agent on the learned reward function (can also be a sampler from a static dataset of trajectories though).

  • reward_model (RewardNet) – a RewardNet instance to be used for learning the reward

  • num_iterations (int) – number of times to train the agent against the reward model and then train the reward model against newly gathered preferences.

  • fragmenter (Optional[Fragmenter]) – takes in a set of trajectories and returns pairs of fragments for which preferences will be gathered. These fragments could be random, or they could be selected more deliberately (active learning). Default is a random fragmenter.

  • preference_gatherer (Optional[PreferenceGatherer]) – how to get preferences between trajectory fragments. Default (and currently the only option) is to use synthetic preferences based on ground-truth rewards. Human preferences could be implemented here in the future.

  • reward_trainer (Optional[RewardTrainer]) – trains the reward model based on pairs of fragments and associated preferences. Default is to use the preference model and loss function from DRLHP.

  • comparison_queue_size (Optional[int]) – the maximum number of comparisons to keep in the queue for training the reward model. If None, the queue will grow without bound as new comparisons are added.

  • fragment_length (int) – number of timesteps per fragment that is used to elicit preferences

  • transition_oversampling (float) – factor by which to oversample transitions before creating fragments. Since fragments are sampled with replacement, this is usually chosen > 1 to avoid having the same transition in too many fragments.

  • initial_comparison_frac (float) – fraction of the total_comparisons argument to train() that will be sampled before the rest of training begins (using a randomly initialized agent). This can be used to pretrain the reward model before the agent is trained on the learned reward, to help avoid irreversibly learning a bad policy from an untrained reward. Note that there will often be some additional pretraining comparisons since comparisons_per_iteration won’t exactly divide the total number of comparisons. How many such comparisons there are depends discontinuously on total_comparisons and comparisons_per_iteration.

  • initial_epoch_multiplier (float) – before agent training begins, train the reward model for this many more epochs than usual (on fragments sampled from a random agent).

  • custom_logger (Optional[HierarchicalLogger]) – Where to log to; if None (default), creates a new logger.

  • allow_variable_horizon (bool) – If False (default), algorithm will raise an exception if it detects trajectories of different length during training. If True, overrides this safety check. WARNING: variable horizon episodes leak information about the reward via termination condition, and can seriously confound evaluation. Read https://imitation.readthedocs.io/en/latest/guide/variable_horizon.html before overriding this.

  • rng (Optional[Generator]) – random number generator to use for initializing subcomponents such as fragmenter. Only used when default components are used; if you instantiate your own fragmenter, preference gatherer, etc., you are responsible for seeding them!

  • query_schedule (Union[str, Callable[[float], float]]) – one of (“constant”, “hyperbolic”, “inverse_quadratic”), or a function that takes in a float between 0 and 1 inclusive, representing a fraction of the total number of timesteps elapsed up to some time T, and returns a potentially unnormalized probability indicating the fraction of total_comparisons that should be queried at that iteration. This function will be called num_iterations times in __init__() with values from np.linspace(0, 1, num_iterations) as input. The outputs will be normalized to sum to 1 and then used to apportion the comparisons among the num_iterations iterations.

Raises

ValueError – if query_schedule is not a valid string or callable.

allow_variable_horizon: bool#

If True, allow variable horizon trajectories; otherwise error if detected.

train(total_timesteps, total_comparisons, callback=None)[source]#

Train the reward model and the policy if applicable.

Parameters
  • total_timesteps (int) – number of environment interaction steps

  • total_comparisons (int) – number of preferences to gather in total

  • callback (Optional[Callable[[int], None]]) – callback functions called at the end of each iteration

Return type

Mapping[str, Any]

Returns

A dictionary with final metrics such as loss and accuracy of the reward model.

class imitation.algorithms.preference_comparisons.PreferenceDataset(max_size=None)[source]#

Bases: Dataset

A PyTorch Dataset for preference comparisons.

Each item is a tuple consisting of two trajectory fragments and a probability that fragment 1 is preferred over fragment 2.

This dataset is meant to be generated piece by piece during the training process, which is why data can be added via the .push() method.

__init__(max_size=None)[source]#

Builds an empty PreferenceDataset.

Parameters

max_size (Optional[int]) – Maximum number of preference comparisons to store in the dataset. If None (default), the dataset can grow indefinitely. Otherwise, the dataset acts as a FIFO queue, and the oldest comparisons are evicted when push() is called and the dataset is at max capacity.

static load(path)[source]#
Return type

PreferenceDataset

push(fragments, preferences)[source]#

Add more samples to the dataset.

Parameters
  • fragments (Sequence[Tuple[TrajectoryWithRew, TrajectoryWithRew]]) – list of pairs of trajectory fragments to add

  • preferences (ndarray) – corresponding preference probabilities (probability that fragment 1 is preferred over fragment 2)

Raises

ValueErrorpreferences shape does not match fragments or has non-float32 dtype.

Return type

None

save(path)[source]#
Return type

None

class imitation.algorithms.preference_comparisons.PreferenceGatherer(rng=None, custom_logger=None)[source]#

Bases: ABC

Base class for gathering preference comparisons between trajectory fragments.

__init__(rng=None, custom_logger=None)[source]#

Initializes the preference gatherer.

Parameters
  • rng (Optional[Generator]) – random number generator, if applicable.

  • custom_logger (Optional[HierarchicalLogger]) – Where to log to; if None (default), creates a new logger.

class imitation.algorithms.preference_comparisons.PreferenceModel(model, noise_prob=0.0, discount_factor=1.0, threshold=50)[source]#

Bases: Module

Class to convert two fragments’ rewards into preference probability.

__init__(model, noise_prob=0.0, discount_factor=1.0, threshold=50)[source]#

Create Preference Prediction Model.

Parameters
  • model (RewardNet) – base model to compute reward.

  • noise_prob (float) – assumed probability with which the preference is uniformly random (used for the model of preference generation that is used for the loss).

  • discount_factor (float) – the model of preference generation uses a softmax of returns as the probability that a fragment is preferred. This is the discount factor used to calculate those returns. Default is 1, i.e. undiscounted sums of rewards (which is what the DRLHP paper uses).

  • threshold (float) – the preference model used to compute the loss contains a softmax of returns. To avoid overflows, we clip differences in returns that are above this threshold. This threshold is therefore in logspace. The default value of 50 means that probabilities below 2e-22 are rounded up to 2e-22.

Raises

ValueError – if RewardEnsemble is wrapped around a class other than AddSTDRewardWrapper.

forward(fragment_pairs)[source]#

Computes the preference probability of the first fragment for all pairs.

Note: This function passes the gradient through for non-ensemble models.

For an ensemble model, this function should not be used for loss calculation. It can be used in case where passing the gradient is not required such as during active selection or inference time. Therefore, the EnsembleTrainer passes each member network through this function instead of passing the EnsembleNetwork object with the use of ensemble_member_index.

Parameters

fragment_pairs (Sequence[Tuple[Trajectory, Trajectory]]) – batch of pair of fragments.

Return type

Tuple[Tensor, Optional[Tensor]]

Returns

A tuple with the first element as the preference probabilities for the first fragment for all fragment pairs given by the network(s). If the ground truth rewards are available, it also returns gt preference probabilities in the second element of the tuple (else None). Reward probability shape - (num_fragment_pairs, ) for non-ensemble reward network and (num_fragment_pairs, num_networks) for an ensemble of networks.

probability(rews1, rews2)[source]#

Computes the Boltzmann rational probability the first trajectory is best.

Parameters
  • rews1 (Tensor) – array/matrix of rewards for the first trajectory fragment. matrix for ensemble models and array for non-ensemble models.

  • rews2 (Tensor) – array/matrix of rewards for the second trajectory fragment. matrix for ensemble models and array for non-ensemble models.

Return type

Tensor

Returns

The softmax of the difference between the (discounted) return of the first and second trajectory. Shape - (num_ensemble_members, ) for ensemble model and () for non-ensemble model which is a torch scalar.

rewards(transitions)[source]#

Computes the reward for all transitions.

Parameters

transitions (Transitions) – batch of obs-act-obs-done for a fragment of a trajectory.

Return type

Tensor

Returns

The reward given by the network(s) for all the transitions. Shape - (num_transitions, ) for Single reward network and (num_transitions, num_networks) for ensemble of networks.

training: bool#
class imitation.algorithms.preference_comparisons.RandomFragmenter(rng, warning_threshold=10, custom_logger=None)[source]#

Bases: Fragmenter

Sample fragments of trajectories uniformly at random with replacement.

Note that each fragment is part of a single episode and has a fixed length. This leads to a bias: transitions at the beginning and at the end of episodes are less likely to occur as part of fragments (this affects the first and last fragment_length transitions).

An additional bias is that trajectories shorter than the desired fragment length are never used.

__init__(rng, warning_threshold=10, custom_logger=None)[source]#

Initialize the fragmenter.

Parameters
  • rng (Generator) – the random number generator

  • warning_threshold (int) – give a warning if the number of available transitions is less than this many times the number of required samples. Set to 0 to disable this warning.

  • custom_logger (Optional[HierarchicalLogger]) – Where to log to; if None (default), creates a new logger.

class imitation.algorithms.preference_comparisons.RewardLoss(*args, **kwargs)[source]#

Bases: Module, ABC

A loss function over preferences.

abstract forward(fragment_pairs, preferences, preference_model)[source]#

Computes the loss.

Parameters
  • fragment_pairs (Sequence[Tuple[Trajectory, Trajectory]]) – Batch consisting of pairs of trajectory fragments.

  • preferences (ndarray) – The probability that the first fragment is preferred over the second. Typically 0, 1 or 0.5 (tie).

  • preference_model (PreferenceModel) – model to predict the preferred fragment from a pair.

Returns: # noqa: DAR202

loss: the loss metrics: a dictionary of metrics that can be logged

Return type

LossAndMetrics

training: bool#
class imitation.algorithms.preference_comparisons.RewardTrainer(preference_model, custom_logger=None)[source]#

Bases: ABC

Abstract base class for training reward models using preference comparisons.

This class contains only the actual reward model training code, it is not responsible for gathering trajectories and preferences or for agent training (see :class: PreferenceComparisons for that).

__init__(preference_model, custom_logger=None)[source]#

Initialize the reward trainer.

Parameters
  • preference_model (PreferenceModel) – the preference model to train the reward network.

  • custom_logger (Optional[HierarchicalLogger]) – Where to log to; if None (default), creates a new logger.

property logger: HierarchicalLogger#
Return type

HierarchicalLogger

train(dataset, epoch_multiplier=1.0)[source]#

Train the reward model on a batch of fragment pairs and preferences.

Parameters
  • dataset (PreferenceDataset) – the dataset of preference comparisons to train on.

  • epoch_multiplier (float) – how much longer to train for than usual (measured relatively).

Return type

None

class imitation.algorithms.preference_comparisons.SyntheticGatherer(temperature=1, discount_factor=1, sample=True, rng=None, threshold=50, custom_logger=None)[source]#

Bases: PreferenceGatherer

Computes synthetic preferences using ground-truth environment rewards.

__init__(temperature=1, discount_factor=1, sample=True, rng=None, threshold=50, custom_logger=None)[source]#

Initialize the synthetic preference gatherer.

Parameters
  • temperature (float) – the preferences are sampled from a softmax, this is the temperature used for sampling. temperature=0 leads to deterministic results (for equal rewards, 0.5 will be returned).

  • discount_factor (float) – discount factor that is used to compute how good a fragment is. Default is to use undiscounted sums of rewards (as in the DRLHP paper).

  • sample (bool) – if True (default), the preferences are 0 or 1, sampled from a Bernoulli distribution (or 0.5 in the case of ties with zero temperature). If False, then the underlying Bernoulli probabilities are returned instead.

  • rng (Optional[Generator]) – random number generator, only used if temperature > 0 and sample=True

  • threshold (float) – preferences are sampled from a softmax of returns. To avoid overflows, we clip differences in returns that are above this threshold (after multiplying with temperature). This threshold is therefore in logspace. The default value of 50 means that probabilities below 2e-22 are rounded up to 2e-22.

  • custom_logger (Optional[HierarchicalLogger]) – Where to log to; if None (default), creates a new logger.

Raises

ValueError – if sample is true and no random state is provided.

class imitation.algorithms.preference_comparisons.TrajectoryDataset(trajectories, rng, custom_logger=None)[source]#

Bases: TrajectoryGenerator

A fixed dataset of trajectories.

__init__(trajectories, rng, custom_logger=None)[source]#

Creates a dataset loaded from path.

Parameters
  • trajectories (Sequence[TrajectoryWithRew]) – the dataset of rollouts.

  • rng (Generator) – RNG used for shuffling dataset.

  • custom_logger (Optional[HierarchicalLogger]) – Where to log to; if None (default), creates a new logger.

sample(steps)[source]#

Sample a batch of trajectories.

Parameters

steps (int) – All trajectories taken together should have at least this many steps.

Return type

Sequence[TrajectoryWithRew]

Returns

A list of sampled trajectories with rewards (which should be the environment rewards, not ones from a reward model).

class imitation.algorithms.preference_comparisons.TrajectoryGenerator(custom_logger=None)[source]#

Bases: ABC

Generator of trajectories with optional training logic.

__init__(custom_logger=None)[source]#

Builds TrajectoryGenerator.

Parameters

custom_logger (Optional[HierarchicalLogger]) – Where to log to; if None (default), creates a new logger.

property logger: HierarchicalLogger#
Return type

HierarchicalLogger

abstract sample(steps)[source]#

Sample a batch of trajectories.

Parameters

steps (int) – All trajectories taken together should have at least this many steps.

Return type

Sequence[TrajectoryWithRew]

Returns

A list of sampled trajectories with rewards (which should be the environment rewards, not ones from a reward model).

train(steps, **kwargs)[source]#

Train an agent if the trajectory generator uses one.

By default, this method does nothing and doesn’t need to be overridden in subclasses that don’t require training.

Parameters
  • steps (int) – number of environment steps to train for.

  • **kwargs – additional keyword arguments to pass on to the training procedure.

Return type

None

imitation.algorithms.preference_comparisons.get_base_model(reward_model)[source]#
Return type

RewardNet

imitation.algorithms.preference_comparisons.preference_collate_fn(batch)[source]#
Return type

Tuple[Sequence[Tuple[TrajectoryWithRew, TrajectoryWithRew]], ndarray]