imitation.rewards.reward_nets#

Constructs deep network reward models.

Functions

cnn_transpose(tens)

Transpose a (b,h,w,c)-formatted tensor to (b,c,h,w) format.

Classes

AddSTDRewardWrapper(base[, default_alpha])

Adds a multiple of the estimated standard deviation to mean reward.

BasicPotentialCNN(observation_space, hid_sizes)

Simple implementation of a potential using a CNN.

BasicPotentialMLP(observation_space, ...)

Simple implementation of a potential using an MLP.

BasicRewardNet(observation_space, action_space)

MLP that takes as input the state, action, next state and done flag.

BasicShapedRewardNet(observation_space, ...)

Shaped reward net based on MLPs.

CnnRewardNet(observation_space, action_space)

CNN that takes as input the state, action, next state and done flag.

ForwardWrapper(base)

An abstract RewardNetWrapper that changes the behavior of forward.

NormalizedRewardNet(base, normalize_output_layer)

A reward net that normalizes the output of its base network.

PredictProcessedWrapper(base)

An abstract RewardNetWrapper that changes the behavior of predict_processed.

RewardEnsemble(observation_space, ...)

A mean ensemble of reward networks.

RewardNet(observation_space, action_space[, ...])

Minimal abstract reward network.

RewardNetWithVariance(observation_space, ...)

A reward net that keeps track of its epistemic uncertainty through variance.

RewardNetWrapper(base)

Abstract class representing a wrapper modifying a RewardNet's functionality.

ShapedRewardNet(base, potential, discount_factor)

A RewardNet consisting of a base network and a potential shaping.

class imitation.rewards.reward_nets.AddSTDRewardWrapper(base, default_alpha=0.0)[source]#

Bases: PredictProcessedWrapper

Adds a multiple of the estimated standard deviation to mean reward.

__init__(base, default_alpha=0.0)[source]#

Create a reward network that adds a multiple of the standard deviation.

Parameters
  • base (RewardNetWithVariance) – A reward network that keeps track of its epistemic variance. This is used to compute the standard deviation.

  • default_alpha (float) – multiple of standard deviation to add to the reward mean. Defaults to 0.0.

Raises

TypeError – if base is not an instance of RewardNetWithVariance

predict_processed(state, action, next_state, done, alpha=None, **kwargs)[source]#

Compute a lower/upper confidence bound on the reward without gradients.

Parameters
  • state (ndarray) – Current states of shape (batch_size,) + state_shape.

  • action (ndarray) – Actions of shape (batch_size,) + action_shape.

  • next_state (ndarray) – Successor states of shape (batch_size,) + state_shape.

  • done (ndarray) – End-of-episode (terminal state) indicator of shape (batch_size,).

  • alpha (Optional[float]) – multiple of standard deviation to add to the reward mean. Defaults to the value provided at initialization.

  • **kwargs – are not used

Return type

ndarray

Returns

Estimated lower confidence bounds on rewards of shape (batch_size,).

class imitation.rewards.reward_nets.BasicPotentialCNN(observation_space, hid_sizes, hwc_format=True, **kwargs)[source]#

Bases: Module

Simple implementation of a potential using a CNN.

__init__(observation_space, hid_sizes, hwc_format=True, **kwargs)[source]#

Initialize the potential.

Parameters
  • observation_space (Space) – observation space of the environment.

  • hid_sizes (Iterable[int]) – number of channels in hidden layers of the CNN.

  • hwc_format (bool) – format of the observation. True if channel dimension is last, False if channel dimension is first.

  • kwargs – passed straight through to build_cnn.

Raises

ValueError – if observations are not images.

forward(state)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Return type

Tensor

training: bool#
class imitation.rewards.reward_nets.BasicPotentialMLP(observation_space, hid_sizes, **kwargs)[source]#

Bases: Module

Simple implementation of a potential using an MLP.

__init__(observation_space, hid_sizes, **kwargs)[source]#

Initialize the potential.

Parameters
  • observation_space (Space) – observation space of the environment.

  • hid_sizes (Iterable[int]) – widths of the hidden layers of the MLP.

  • kwargs – passed straight through to build_mlp.

forward(state)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Return type

Tensor

training: bool#
class imitation.rewards.reward_nets.BasicRewardNet(observation_space, action_space, use_state=True, use_action=True, use_next_state=False, use_done=False, **kwargs)[source]#

Bases: RewardNet

MLP that takes as input the state, action, next state and done flag.

These inputs are flattened and then concatenated to one another. Each input can enabled or disabled by the use_* constructor keyword arguments.

__init__(observation_space, action_space, use_state=True, use_action=True, use_next_state=False, use_done=False, **kwargs)[source]#

Builds reward MLP.

Parameters
  • observation_space (Space) – The observation space.

  • action_space (Space) – The action space.

  • use_state (bool) – should the current state be included as an input to the MLP?

  • use_action (bool) – should the current action be included as an input to the MLP?

  • use_next_state (bool) – should the next state be included as an input to the MLP?

  • use_done (bool) – should the “done” flag be included as an input to the MLP?

  • kwargs – passed straight through to build_mlp.

forward(state, action, next_state, done)[source]#

Compute rewards for a batch of transitions and keep gradients.

training: bool#
class imitation.rewards.reward_nets.BasicShapedRewardNet(observation_space, action_space, *, reward_hid_sizes=(32,), potential_hid_sizes=(32, 32), use_state=True, use_action=True, use_next_state=False, use_done=False, discount_factor=0.99, **kwargs)[source]#

Bases: ShapedRewardNet

Shaped reward net based on MLPs.

This is just a very simple convenience class for instantiating a BasicRewardNet and a BasicPotentialMLP and wrapping them inside a ShapedRewardNet. Mainly exists for backwards compatibility after https://github.com/HumanCompatibleAI/imitation/pull/311 to keep the scripts working.

TODO(ejnnr): if we ever modify AIRL so that it takes in a RewardNet instance

directly (instead of a class and kwargs) and instead instantiate the RewardNet inside the scripts, then it probably makes sense to get rid of this class.

__init__(observation_space, action_space, *, reward_hid_sizes=(32,), potential_hid_sizes=(32, 32), use_state=True, use_action=True, use_next_state=False, use_done=False, discount_factor=0.99, **kwargs)[source]#

Builds a simple shaped reward network.

Parameters
  • observation_space (Space) – The observation space.

  • action_space (Space) – The action space.

  • reward_hid_sizes (Sequence[int]) – sequence of widths for the hidden layers of the base reward MLP.

  • potential_hid_sizes (Sequence[int]) – sequence of widths for the hidden layers of the potential MLP.

  • use_state (bool) – should the current state be included as an input to the reward MLP?

  • use_action (bool) – should the current action be included as an input to the reward MLP?

  • use_next_state (bool) – should the next state be included as an input to the reward MLP?

  • use_done (bool) – should the “done” flag be included as an input to the reward MLP?

  • discount_factor (float) – discount factor for the potential shaping.

  • kwargs – passed straight through to BasicRewardNet and BasicPotentialMLP.

training: bool#
class imitation.rewards.reward_nets.CnnRewardNet(observation_space, action_space, use_state=True, use_action=True, use_next_state=False, use_done=False, hwc_format=True, **kwargs)[source]#

Bases: RewardNet

CNN that takes as input the state, action, next state and done flag.

Inputs are boosted to tensors with channel, height, and width dimensions, and then concatenated. Image inputs are assumed to be in (h,w,c) format, unless the argument hwc_format=False is passed in. Each input can be enabled or disabled by the use_* constructor keyword arguments, but either use_state or use_next_state must be True.

__init__(observation_space, action_space, use_state=True, use_action=True, use_next_state=False, use_done=False, hwc_format=True, **kwargs)[source]#

Builds reward CNN.

Parameters
  • observation_space (Space) – The observation space.

  • action_space (Space) – The action space.

  • use_state (bool) – Should the current state be included as an input to the CNN?

  • use_action (bool) – Should the current action be included as an input to the CNN?

  • use_next_state (bool) – Should the next state be included as an input to the CNN?

  • use_done (bool) – Should the “done” flag be included as an input to the CNN?

  • hwc_format (bool) – Are image inputs in (h,w,c) format (True), or (c,h,w) (False)? If hwc_format is False, image inputs are not transposed.

  • kwargs – Passed straight through to build_cnn.

Raises

ValueError – if observation or action space is not easily massaged into a CNN input.

forward(state, action, next_state, done)[source]#

Computes rewardNet value on input state, action, next_state, and done flag.

Takes inputs that will be used, transposes image states to (c,h,w) format if needed, reshapes inputs to have compatible dimensions, concatenates them, and inputs them into the CNN.

Parameters
  • state (Tensor) – current state.

  • action (Tensor) – current action.

  • next_state (Tensor) – next state.

  • done (Tensor) – flag for whether the episode is over.

Returns

reward of the transition.

Return type

th.Tensor

get_num_channels_obs(space)[source]#

Gets number of channels for the observation.

Return type

int

training: bool#
class imitation.rewards.reward_nets.ForwardWrapper(base)[source]#

Bases: RewardNetWrapper

An abstract RewardNetWrapper that changes the behavior of forward.

Note that all forward wrappers must be placed before all predict processed wrappers.

__init__(base)[source]#

Create a forward wrapper.

Parameters

base (RewardNet) – The base reward network

Raises

ValueError – if the base network is a PredictProcessedWrapper.

training: bool#
class imitation.rewards.reward_nets.NormalizedRewardNet(base, normalize_output_layer)[source]#

Bases: PredictProcessedWrapper

A reward net that normalizes the output of its base network.

__init__(base, normalize_output_layer)[source]#

Initialize the NormalizedRewardNet.

Parameters
  • base (RewardNet) – a base RewardNet

  • normalize_output_layer (Type[BaseNorm]) – The class to use to normalize rewards. This can be any nn.Module that preserves the shape; e.g. nn.Identity, nn.LayerNorm, or networks.RunningNorm.

predict_processed(state, action, next_state, done, update_stats=True, **kwargs)[source]#

Compute normalized rewards for a batch of transitions without gradients.

Parameters
  • state (ndarray) – Current states of shape (batch_size,) + state_shape.

  • action (ndarray) – Actions of shape (batch_size,) + action_shape.

  • next_state (ndarray) – Successor states of shape (batch_size,) + state_shape.

  • done (ndarray) – End-of-episode (terminal state) indicator of shape (batch_size,).

  • update_stats (bool) – Whether to update the running stats of the normalization layer.

  • **kwargs – kwargs passed to base predict_processed call.

Return type

ndarray

Returns

Computed normalized rewards of shape (batch_size,).

training: bool#
class imitation.rewards.reward_nets.PredictProcessedWrapper(base)[source]#

Bases: RewardNetWrapper

An abstract RewardNetWrapper that changes the behavior of predict_processed.

Subclasses should override predict_processed. Implementations should pass along kwargs to the base reward net’s predict_processed method.

Note: The wrapper will default to forwarding calls to device, forward,

preprocess and predict to the base reward net unless explicitly overridden in a subclass.

forward(state, action, next_state, done)[source]#

Compute rewards for a batch of transitions and keep gradients.

Return type

Tensor

predict(state, action, next_state, done)[source]#

Compute rewards for a batch of transitions without gradients.

Converting th.Tensor rewards from predict_th to NumPy arrays.

Parameters
  • state (ndarray) – Current states of shape (batch_size,) + state_shape.

  • action (ndarray) – Actions of shape (batch_size,) + action_shape.

  • next_state (ndarray) – Successor states of shape (batch_size,) + state_shape.

  • done (ndarray) – End-of-episode (terminal state) indicator of shape (batch_size,).

Return type

ndarray

Returns

Computed rewards of shape (batch_size,).

abstract predict_processed(state, action, next_state, done, **kwargs)[source]#

Predict processed must be overridden in subclasses.

Return type

ndarray

predict_th(state, action, next_state, done)[source]#

Compute th.Tensor rewards for a batch of transitions without gradients.

Preprocesses the inputs, output th.Tensor reward arrays.

Parameters
  • state (ndarray) – Current states of shape (batch_size,) + state_shape.

  • action (ndarray) – Actions of shape (batch_size,) + action_shape.

  • next_state (ndarray) – Successor states of shape (batch_size,) + state_shape.

  • done (ndarray) – End-of-episode (terminal state) indicator of shape (batch_size,).

Return type

Tensor

Returns

Computed th.Tensor rewards of shape (batch_size,).

training: bool#
class imitation.rewards.reward_nets.RewardEnsemble(observation_space, action_space, members)[source]#

Bases: RewardNetWithVariance

A mean ensemble of reward networks.

A reward ensemble is made up of individual reward networks. To maintain consistency the “output” of a reward network will be defined as the results of its predict_processed. Thus for example the mean of the ensemble is the mean of the results of its members predict processed classes.

__init__(observation_space, action_space, members)[source]#

Initialize the RewardEnsemble.

Parameters
  • observation_space (Space) – the observation space of the environment

  • action_space (Space) – the action space of the environment

  • members (Iterable[RewardNet]) – the member networks that will make up the ensemble.

Raises

ValueError – if num_members is less than 1

forward(*args)[source]#

The forward method of the ensemble should in general not be used directly.

Return type

Tensor

members: ModuleList#
property num_members#

The number of members in the ensemble.

predict(state, action, next_state, done, **kwargs)[source]#

Return the mean of the ensemble members.

predict_processed(state, action, next_state, done, **kwargs)[source]#

Return the mean of the ensemble members.

Return type

ndarray

predict_processed_all(state, action, next_state, done, **kwargs)[source]#

Get the results of predict processed on all of the members.

Parameters
  • state (ndarray) – Current states of shape (batch_size,) + state_shape.

  • action (ndarray) – Actions of shape (batch_size,) + action_shape.

  • next_state (ndarray) – Successor states of shape (batch_size,) + state_shape.

  • done (ndarray) – End-of-episode (terminal state) indicator of shape (batch_size,).

  • kwargs – passed along to ensemble members.

Return type

ndarray

Returns

The result of predict processed for each member in the ensemble of

shape (batch_size, num_members).

predict_reward_moments(state, action, next_state, done, **kwargs)[source]#

Compute the standard deviation of the reward distribution for a batch.

Parameters
  • state (ndarray) – Current states of shape (batch_size,) + state_shape.

  • action (ndarray) – Actions of shape (batch_size,) + action_shape.

  • next_state (ndarray) – Successor states of shape (batch_size,) + state_shape.

  • done (ndarray) – End-of-episode (terminal state) indicator of shape (batch_size,).

  • **kwargs – passed along to predict processed.

Return type

Tuple[ndarray, ndarray]

Returns

  • Reward mean of shape (batch_size,).

  • Reward variance of shape (batch_size,).

class imitation.rewards.reward_nets.RewardNet(observation_space, action_space, normalize_images=True)[source]#

Bases: Module, ABC

Minimal abstract reward network.

Only requires the implementation of a forward pass (calculating rewards given a batch of states, actions, next states and dones).

__init__(observation_space, action_space, normalize_images=True)[source]#

Initialize the RewardNet.

Parameters
  • observation_space (Space) – the observation space of the environment

  • action_space (Space) – the action space of the environment

  • normalize_images (bool) – whether to automatically normalize image observations to [0, 1] (from 0 to 255). Defaults to True.

property device: device#

Heuristic to determine which device this module is on.

Return type

device

property dtype: dtype#

Heuristic to determine dtype of module.

Return type

dtype

abstract forward(state, action, next_state, done)[source]#

Compute rewards for a batch of transitions and keep gradients.

Return type

Tensor

predict(state, action, next_state, done)[source]#

Compute rewards for a batch of transitions without gradients.

Converting th.Tensor rewards from predict_th to NumPy arrays.

Parameters
  • state (ndarray) – Current states of shape (batch_size,) + state_shape.

  • action (ndarray) – Actions of shape (batch_size,) + action_shape.

  • next_state (ndarray) – Successor states of shape (batch_size,) + state_shape.

  • done (ndarray) – End-of-episode (terminal state) indicator of shape (batch_size,).

Return type

ndarray

Returns

Computed rewards of shape (batch_size,).

predict_processed(state, action, next_state, done, **kwargs)[source]#

Compute the processed rewards for a batch of transitions without gradients.

Defaults to calling predict. Subclasses can override this to normalize or otherwise modify the rewards in ways that may help RL training or other applications of the reward function.

Parameters
  • state (ndarray) – Current states of shape (batch_size,) + state_shape.

  • action (ndarray) – Actions of shape (batch_size,) + action_shape.

  • next_state (ndarray) – Successor states of shape (batch_size,) + state_shape.

  • done (ndarray) – End-of-episode (terminal state) indicator of shape (batch_size,).

  • kwargs – additional kwargs may be passed to change the functionality of subclasses.

Return type

ndarray

Returns

Computed processed rewards of shape (batch_size,).

predict_th(state, action, next_state, done)[source]#

Compute th.Tensor rewards for a batch of transitions without gradients.

Preprocesses the inputs, output th.Tensor reward arrays.

Parameters
  • state (ndarray) – Current states of shape (batch_size,) + state_shape.

  • action (ndarray) – Actions of shape (batch_size,) + action_shape.

  • next_state (ndarray) – Successor states of shape (batch_size,) + state_shape.

  • done (ndarray) – End-of-episode (terminal state) indicator of shape (batch_size,).

Return type

Tensor

Returns

Computed th.Tensor rewards of shape (batch_size,).

preprocess(state, action, next_state, done)[source]#

Preprocess a batch of input transitions and convert it to PyTorch tensors.

The output of this function is suitable for its forward pass, so a typical usage would be model(*model.preprocess(transitions)).

Parameters
  • state (ndarray) – The observation input. Its shape is (batch_size,) + observation_space.shape.

  • action (ndarray) – The action input. Its shape is (batch_size,) + action_space.shape. The None dimension is expected to be the same as None dimension from obs_input.

  • next_state (ndarray) – The observation input. Its shape is (batch_size,) + observation_space.shape.

  • done (ndarray) – Whether the episode has terminated. Its shape is (batch_size,).

Returns

a Tuple of tensors containing observations, actions, next observations and dones.

Return type

Preprocessed transitions

training: bool#
class imitation.rewards.reward_nets.RewardNetWithVariance(observation_space, action_space, normalize_images=True)[source]#

Bases: RewardNet

A reward net that keeps track of its epistemic uncertainty through variance.

abstract predict_reward_moments(state, action, next_state, done, **kwargs)[source]#

Compute the mean and variance of the reward distribution.

Parameters
  • state (ndarray) – Current states of shape (batch_size,) + state_shape.

  • action (ndarray) – Actions of shape (batch_size,) + action_shape.

  • next_state (ndarray) – Successor states of shape (batch_size,) + state_shape.

  • done (ndarray) – End-of-episode (terminal state) indicator of shape (batch_size,).

  • **kwargs – may modify the behavior of subclasses

Return type

Tuple[ndarray, ndarray]

Returns

  • Estimated reward mean of shape (batch_size,).

  • Estimated reward variance of shape (batch_size,). # noqa: DAR202

training: bool#
class imitation.rewards.reward_nets.RewardNetWrapper(base)[source]#

Bases: RewardNet

Abstract class representing a wrapper modifying a RewardNet’s functionality.

In general RewardNetWrapper``s should either subclass ``ForwardWrapper or PredictProcessedWrapper.

__init__(base)[source]#

Initialize a RewardNet wrapper.

Parameters

base (RewardNet) – the base RewardNet to wrap.

property base: RewardNet#
Return type

RewardNet

property device: device#

Heuristic to determine which device this module is on.

Return type

device

property dtype: dtype#

Heuristic to determine dtype of module.

Return type

dtype

preprocess(state, action, next_state, done)[source]#

Preprocess a batch of input transitions and convert it to PyTorch tensors.

The output of this function is suitable for its forward pass, so a typical usage would be model(*model.preprocess(transitions)).

Parameters
  • state (ndarray) – The observation input. Its shape is (batch_size,) + observation_space.shape.

  • action (ndarray) – The action input. Its shape is (batch_size,) + action_space.shape. The None dimension is expected to be the same as None dimension from obs_input.

  • next_state (ndarray) – The observation input. Its shape is (batch_size,) + observation_space.shape.

  • done (ndarray) – Whether the episode has terminated. Its shape is (batch_size,).

Returns

a Tuple of tensors containing observations, actions, next observations and dones.

Return type

Preprocessed transitions

training: bool#
class imitation.rewards.reward_nets.ShapedRewardNet(base, potential, discount_factor)[source]#

Bases: ForwardWrapper

A RewardNet consisting of a base network and a potential shaping.

__init__(base, potential, discount_factor)[source]#

Setup a ShapedRewardNet instance.

Parameters
  • base (RewardNet) – the base reward net to which the potential shaping will be added. Shaping must be applied directly to the raw reward net. See error below.

  • potential (Callable[[Tensor], Tensor]) – A callable which takes a batch of states (as a PyTorch tensor) and returns a batch of potentials for these states. If this is a PyTorch Module, it becomes a submodule of the ShapedRewardNet instance.

  • discount_factor (float) – discount factor to use for the potential shaping.

forward(state, action, next_state, done)[source]#

Compute rewards for a batch of transitions and keep gradients.

training: bool#
imitation.rewards.reward_nets.cnn_transpose(tens)[source]#

Transpose a (b,h,w,c)-formatted tensor to (b,c,h,w) format.

Return type

Tensor