imitation.rewards.reward_nets#
Constructs deep network reward models.
Functions
|
Transpose a (b,h,w,c)-formatted tensor to (b,c,h,w) format. |
Classes
|
Adds a multiple of the estimated standard deviation to mean reward. |
|
Simple implementation of a potential using a CNN. |
|
Simple implementation of a potential using an MLP. |
|
MLP that takes as input the state, action, next state and done flag. |
|
Shaped reward net based on MLPs. |
|
CNN that takes as input the state, action, next state and done flag. |
|
An abstract RewardNetWrapper that changes the behavior of forward. |
|
A reward net that normalizes the output of its base network. |
|
An abstract RewardNetWrapper that changes the behavior of predict_processed. |
|
A mean ensemble of reward networks. |
|
Minimal abstract reward network. |
|
A reward net that keeps track of its epistemic uncertainty through variance. |
|
Abstract class representing a wrapper modifying a |
|
A RewardNet consisting of a base network and a potential shaping. |
- class imitation.rewards.reward_nets.AddSTDRewardWrapper(base, default_alpha=0.0)[source]#
Bases:
PredictProcessedWrapper
Adds a multiple of the estimated standard deviation to mean reward.
- __init__(base, default_alpha=0.0)[source]#
Create a reward network that adds a multiple of the standard deviation.
- Parameters
base (
RewardNetWithVariance
) – A reward network that keeps track of its epistemic variance. This is used to compute the standard deviation.default_alpha (
float
) – multiple of standard deviation to add to the reward mean. Defaults to 0.0.
- Raises
TypeError – if base is not an instance of RewardNetWithVariance
- predict_processed(state, action, next_state, done, alpha=None, **kwargs)[source]#
Compute a lower/upper confidence bound on the reward without gradients.
- Parameters
state (
ndarray
) – Current states of shape (batch_size,) + state_shape.action (
ndarray
) – Actions of shape (batch_size,) + action_shape.next_state (
ndarray
) – Successor states of shape (batch_size,) + state_shape.done (
ndarray
) – End-of-episode (terminal state) indicator of shape (batch_size,).alpha (
Optional
[float
]) – multiple of standard deviation to add to the reward mean. Defaults to the value provided at initialization.**kwargs – are not used
- Return type
ndarray
- Returns
Estimated lower confidence bounds on rewards of shape (batch_size,).
- class imitation.rewards.reward_nets.BasicPotentialCNN(observation_space, hid_sizes, hwc_format=True, **kwargs)[source]#
Bases:
Module
Simple implementation of a potential using a CNN.
- __init__(observation_space, hid_sizes, hwc_format=True, **kwargs)[source]#
Initialize the potential.
- Parameters
observation_space (
Space
) – observation space of the environment.hid_sizes (
Iterable
[int
]) – number of channels in hidden layers of the CNN.hwc_format (
bool
) – format of the observation. True if channel dimension is last, False if channel dimension is first.kwargs – passed straight through to build_cnn.
- Raises
ValueError – if observations are not images.
- forward(state)[source]#
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.- Return type
Tensor
- training: bool#
- class imitation.rewards.reward_nets.BasicPotentialMLP(observation_space, hid_sizes, **kwargs)[source]#
Bases:
Module
Simple implementation of a potential using an MLP.
- __init__(observation_space, hid_sizes, **kwargs)[source]#
Initialize the potential.
- Parameters
observation_space (
Space
) – observation space of the environment.hid_sizes (
Iterable
[int
]) – widths of the hidden layers of the MLP.kwargs – passed straight through to build_mlp.
- forward(state)[source]#
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.- Return type
Tensor
- training: bool#
- class imitation.rewards.reward_nets.BasicRewardNet(observation_space, action_space, use_state=True, use_action=True, use_next_state=False, use_done=False, **kwargs)[source]#
Bases:
RewardNet
MLP that takes as input the state, action, next state and done flag.
These inputs are flattened and then concatenated to one another. Each input can enabled or disabled by the use_* constructor keyword arguments.
- __init__(observation_space, action_space, use_state=True, use_action=True, use_next_state=False, use_done=False, **kwargs)[source]#
Builds reward MLP.
- Parameters
observation_space (
Space
) – The observation space.action_space (
Space
) – The action space.use_state (
bool
) – should the current state be included as an input to the MLP?use_action (
bool
) – should the current action be included as an input to the MLP?use_next_state (
bool
) – should the next state be included as an input to the MLP?use_done (
bool
) – should the “done” flag be included as an input to the MLP?kwargs – passed straight through to build_mlp.
- forward(state, action, next_state, done)[source]#
Compute rewards for a batch of transitions and keep gradients.
- training: bool#
- class imitation.rewards.reward_nets.BasicShapedRewardNet(observation_space, action_space, *, reward_hid_sizes=(32,), potential_hid_sizes=(32, 32), use_state=True, use_action=True, use_next_state=False, use_done=False, discount_factor=0.99, **kwargs)[source]#
Bases:
ShapedRewardNet
Shaped reward net based on MLPs.
This is just a very simple convenience class for instantiating a BasicRewardNet and a BasicPotentialMLP and wrapping them inside a ShapedRewardNet. Mainly exists for backwards compatibility after https://github.com/HumanCompatibleAI/imitation/pull/311 to keep the scripts working.
- TODO(ejnnr): if we ever modify AIRL so that it takes in a RewardNet instance
directly (instead of a class and kwargs) and instead instantiate the RewardNet inside the scripts, then it probably makes sense to get rid of this class.
- __init__(observation_space, action_space, *, reward_hid_sizes=(32,), potential_hid_sizes=(32, 32), use_state=True, use_action=True, use_next_state=False, use_done=False, discount_factor=0.99, **kwargs)[source]#
Builds a simple shaped reward network.
- Parameters
observation_space (
Space
) – The observation space.action_space (
Space
) – The action space.reward_hid_sizes (
Sequence
[int
]) – sequence of widths for the hidden layers of the base reward MLP.potential_hid_sizes (
Sequence
[int
]) – sequence of widths for the hidden layers of the potential MLP.use_state (
bool
) – should the current state be included as an input to the reward MLP?use_action (
bool
) – should the current action be included as an input to the reward MLP?use_next_state (
bool
) – should the next state be included as an input to the reward MLP?use_done (
bool
) – should the “done” flag be included as an input to the reward MLP?discount_factor (
float
) – discount factor for the potential shaping.kwargs – passed straight through to BasicRewardNet and BasicPotentialMLP.
- training: bool#
- class imitation.rewards.reward_nets.CnnRewardNet(observation_space, action_space, use_state=True, use_action=True, use_next_state=False, use_done=False, hwc_format=True, **kwargs)[source]#
Bases:
RewardNet
CNN that takes as input the state, action, next state and done flag.
Inputs are boosted to tensors with channel, height, and width dimensions, and then concatenated. Image inputs are assumed to be in (h,w,c) format, unless the argument hwc_format=False is passed in. Each input can be enabled or disabled by the use_* constructor keyword arguments, but either use_state or use_next_state must be True.
- __init__(observation_space, action_space, use_state=True, use_action=True, use_next_state=False, use_done=False, hwc_format=True, **kwargs)[source]#
Builds reward CNN.
- Parameters
observation_space (
Space
) – The observation space.action_space (
Space
) – The action space.use_state (
bool
) – Should the current state be included as an input to the CNN?use_action (
bool
) – Should the current action be included as an input to the CNN?use_next_state (
bool
) – Should the next state be included as an input to the CNN?use_done (
bool
) – Should the “done” flag be included as an input to the CNN?hwc_format (
bool
) – Are image inputs in (h,w,c) format (True), or (c,h,w) (False)? If hwc_format is False, image inputs are not transposed.kwargs – Passed straight through to build_cnn.
- Raises
ValueError – if observation or action space is not easily massaged into a CNN input.
- forward(state, action, next_state, done)[source]#
Computes rewardNet value on input state, action, next_state, and done flag.
Takes inputs that will be used, transposes image states to (c,h,w) format if needed, reshapes inputs to have compatible dimensions, concatenates them, and inputs them into the CNN.
- Parameters
state (
Tensor
) – current state.action (
Tensor
) – current action.next_state (
Tensor
) – next state.done (
Tensor
) – flag for whether the episode is over.
- Returns
reward of the transition.
- Return type
th.Tensor
- training: bool#
- class imitation.rewards.reward_nets.ForwardWrapper(base)[source]#
Bases:
RewardNetWrapper
An abstract RewardNetWrapper that changes the behavior of forward.
Note that all forward wrappers must be placed before all predict processed wrappers.
- __init__(base)[source]#
Create a forward wrapper.
- Parameters
base (
RewardNet
) – The base reward network- Raises
ValueError – if the base network is a PredictProcessedWrapper.
- training: bool#
- class imitation.rewards.reward_nets.NormalizedRewardNet(base, normalize_output_layer)[source]#
Bases:
PredictProcessedWrapper
A reward net that normalizes the output of its base network.
- predict_processed(state, action, next_state, done, update_stats=True, **kwargs)[source]#
Compute normalized rewards for a batch of transitions without gradients.
- Parameters
state (
ndarray
) – Current states of shape (batch_size,) + state_shape.action (
ndarray
) – Actions of shape (batch_size,) + action_shape.next_state (
ndarray
) – Successor states of shape (batch_size,) + state_shape.done (
ndarray
) – End-of-episode (terminal state) indicator of shape (batch_size,).update_stats (
bool
) – Whether to update the running stats of the normalization layer.**kwargs – kwargs passed to base predict_processed call.
- Return type
ndarray
- Returns
Computed normalized rewards of shape (batch_size,).
- training: bool#
- class imitation.rewards.reward_nets.PredictProcessedWrapper(base)[source]#
Bases:
RewardNetWrapper
An abstract RewardNetWrapper that changes the behavior of predict_processed.
Subclasses should override predict_processed. Implementations should pass along kwargs to the base reward net’s predict_processed method.
- Note: The wrapper will default to forwarding calls to device, forward,
preprocess and predict to the base reward net unless explicitly overridden in a subclass.
- forward(state, action, next_state, done)[source]#
Compute rewards for a batch of transitions and keep gradients.
- Return type
Tensor
- predict(state, action, next_state, done)[source]#
Compute rewards for a batch of transitions without gradients.
Converting th.Tensor rewards from predict_th to NumPy arrays.
- Parameters
state (
ndarray
) – Current states of shape (batch_size,) + state_shape.action (
ndarray
) – Actions of shape (batch_size,) + action_shape.next_state (
ndarray
) – Successor states of shape (batch_size,) + state_shape.done (
ndarray
) – End-of-episode (terminal state) indicator of shape (batch_size,).
- Return type
ndarray
- Returns
Computed rewards of shape (batch_size,).
- abstract predict_processed(state, action, next_state, done, **kwargs)[source]#
Predict processed must be overridden in subclasses.
- Return type
ndarray
- predict_th(state, action, next_state, done)[source]#
Compute th.Tensor rewards for a batch of transitions without gradients.
Preprocesses the inputs, output th.Tensor reward arrays.
- Parameters
state (
ndarray
) – Current states of shape (batch_size,) + state_shape.action (
ndarray
) – Actions of shape (batch_size,) + action_shape.next_state (
ndarray
) – Successor states of shape (batch_size,) + state_shape.done (
ndarray
) – End-of-episode (terminal state) indicator of shape (batch_size,).
- Return type
Tensor
- Returns
Computed th.Tensor rewards of shape (batch_size,).
- training: bool#
- class imitation.rewards.reward_nets.RewardEnsemble(observation_space, action_space, members)[source]#
Bases:
RewardNetWithVariance
A mean ensemble of reward networks.
A reward ensemble is made up of individual reward networks. To maintain consistency the “output” of a reward network will be defined as the results of its predict_processed. Thus for example the mean of the ensemble is the mean of the results of its members predict processed classes.
- __init__(observation_space, action_space, members)[source]#
Initialize the RewardEnsemble.
- Parameters
observation_space (
Space
) – the observation space of the environmentaction_space (
Space
) – the action space of the environmentmembers (
Iterable
[RewardNet
]) – the member networks that will make up the ensemble.
- Raises
ValueError – if num_members is less than 1
- forward(*args)[source]#
The forward method of the ensemble should in general not be used directly.
- Return type
Tensor
- members: ModuleList#
- property num_members#
The number of members in the ensemble.
- predict(state, action, next_state, done, **kwargs)[source]#
Return the mean of the ensemble members.
- predict_processed(state, action, next_state, done, **kwargs)[source]#
Return the mean of the ensemble members.
- Return type
ndarray
- predict_processed_all(state, action, next_state, done, **kwargs)[source]#
Get the results of predict processed on all of the members.
- Parameters
state (
ndarray
) – Current states of shape (batch_size,) + state_shape.action (
ndarray
) – Actions of shape (batch_size,) + action_shape.next_state (
ndarray
) – Successor states of shape (batch_size,) + state_shape.done (
ndarray
) – End-of-episode (terminal state) indicator of shape (batch_size,).kwargs – passed along to ensemble members.
- Return type
ndarray
- Returns
- The result of predict processed for each member in the ensemble of
shape (batch_size, num_members).
- predict_reward_moments(state, action, next_state, done, **kwargs)[source]#
Compute the standard deviation of the reward distribution for a batch.
- Parameters
state (
ndarray
) – Current states of shape (batch_size,) + state_shape.action (
ndarray
) – Actions of shape (batch_size,) + action_shape.next_state (
ndarray
) – Successor states of shape (batch_size,) + state_shape.done (
ndarray
) – End-of-episode (terminal state) indicator of shape (batch_size,).**kwargs – passed along to predict processed.
- Return type
Tuple
[ndarray
,ndarray
]- Returns
Reward mean of shape (batch_size,).
Reward variance of shape (batch_size,).
- class imitation.rewards.reward_nets.RewardNet(observation_space, action_space, normalize_images=True)[source]#
Bases:
Module
,ABC
Minimal abstract reward network.
Only requires the implementation of a forward pass (calculating rewards given a batch of states, actions, next states and dones).
- __init__(observation_space, action_space, normalize_images=True)[source]#
Initialize the RewardNet.
- Parameters
observation_space (
Space
) – the observation space of the environmentaction_space (
Space
) – the action space of the environmentnormalize_images (
bool
) – whether to automatically normalize image observations to [0, 1] (from 0 to 255). Defaults to True.
- property device: device#
Heuristic to determine which device this module is on.
- Return type
device
- property dtype: dtype#
Heuristic to determine dtype of module.
- Return type
dtype
- abstract forward(state, action, next_state, done)[source]#
Compute rewards for a batch of transitions and keep gradients.
- Return type
Tensor
- predict(state, action, next_state, done)[source]#
Compute rewards for a batch of transitions without gradients.
Converting th.Tensor rewards from predict_th to NumPy arrays.
- Parameters
state (
ndarray
) – Current states of shape (batch_size,) + state_shape.action (
ndarray
) – Actions of shape (batch_size,) + action_shape.next_state (
ndarray
) – Successor states of shape (batch_size,) + state_shape.done (
ndarray
) – End-of-episode (terminal state) indicator of shape (batch_size,).
- Return type
ndarray
- Returns
Computed rewards of shape (batch_size,).
- predict_processed(state, action, next_state, done, **kwargs)[source]#
Compute the processed rewards for a batch of transitions without gradients.
Defaults to calling predict. Subclasses can override this to normalize or otherwise modify the rewards in ways that may help RL training or other applications of the reward function.
- Parameters
state (
ndarray
) – Current states of shape (batch_size,) + state_shape.action (
ndarray
) – Actions of shape (batch_size,) + action_shape.next_state (
ndarray
) – Successor states of shape (batch_size,) + state_shape.done (
ndarray
) – End-of-episode (terminal state) indicator of shape (batch_size,).kwargs – additional kwargs may be passed to change the functionality of subclasses.
- Return type
ndarray
- Returns
Computed processed rewards of shape (batch_size,).
- predict_th(state, action, next_state, done)[source]#
Compute th.Tensor rewards for a batch of transitions without gradients.
Preprocesses the inputs, output th.Tensor reward arrays.
- Parameters
state (
ndarray
) – Current states of shape (batch_size,) + state_shape.action (
ndarray
) – Actions of shape (batch_size,) + action_shape.next_state (
ndarray
) – Successor states of shape (batch_size,) + state_shape.done (
ndarray
) – End-of-episode (terminal state) indicator of shape (batch_size,).
- Return type
Tensor
- Returns
Computed th.Tensor rewards of shape (batch_size,).
- preprocess(state, action, next_state, done)[source]#
Preprocess a batch of input transitions and convert it to PyTorch tensors.
The output of this function is suitable for its forward pass, so a typical usage would be
model(*model.preprocess(transitions))
.- Parameters
state (
ndarray
) – The observation input. Its shape is (batch_size,) + observation_space.shape.action (
ndarray
) – The action input. Its shape is (batch_size,) + action_space.shape. The None dimension is expected to be the same as None dimension from obs_input.next_state (
ndarray
) – The observation input. Its shape is (batch_size,) + observation_space.shape.done (
ndarray
) – Whether the episode has terminated. Its shape is (batch_size,).
- Returns
a Tuple of tensors containing observations, actions, next observations and dones.
- Return type
Preprocessed transitions
- training: bool#
- class imitation.rewards.reward_nets.RewardNetWithVariance(observation_space, action_space, normalize_images=True)[source]#
Bases:
RewardNet
A reward net that keeps track of its epistemic uncertainty through variance.
- abstract predict_reward_moments(state, action, next_state, done, **kwargs)[source]#
Compute the mean and variance of the reward distribution.
- Parameters
state (
ndarray
) – Current states of shape (batch_size,) + state_shape.action (
ndarray
) – Actions of shape (batch_size,) + action_shape.next_state (
ndarray
) – Successor states of shape (batch_size,) + state_shape.done (
ndarray
) – End-of-episode (terminal state) indicator of shape (batch_size,).**kwargs – may modify the behavior of subclasses
- Return type
Tuple
[ndarray
,ndarray
]- Returns
Estimated reward mean of shape (batch_size,).
Estimated reward variance of shape (batch_size,). # noqa: DAR202
- training: bool#
- class imitation.rewards.reward_nets.RewardNetWrapper(base)[source]#
Bases:
RewardNet
Abstract class representing a wrapper modifying a
RewardNet
’s functionality.In general
RewardNetWrapper``s should either subclass ``ForwardWrapper
orPredictProcessedWrapper
.- __init__(base)[source]#
Initialize a RewardNet wrapper.
- Parameters
base (
RewardNet
) – the base RewardNet to wrap.
- property device: device#
Heuristic to determine which device this module is on.
- Return type
device
- property dtype: dtype#
Heuristic to determine dtype of module.
- Return type
dtype
- preprocess(state, action, next_state, done)[source]#
Preprocess a batch of input transitions and convert it to PyTorch tensors.
The output of this function is suitable for its forward pass, so a typical usage would be
model(*model.preprocess(transitions))
.- Parameters
state (
ndarray
) – The observation input. Its shape is (batch_size,) + observation_space.shape.action (
ndarray
) – The action input. Its shape is (batch_size,) + action_space.shape. The None dimension is expected to be the same as None dimension from obs_input.next_state (
ndarray
) – The observation input. Its shape is (batch_size,) + observation_space.shape.done (
ndarray
) – Whether the episode has terminated. Its shape is (batch_size,).
- Returns
a Tuple of tensors containing observations, actions, next observations and dones.
- Return type
Preprocessed transitions
- training: bool#
- class imitation.rewards.reward_nets.ShapedRewardNet(base, potential, discount_factor)[source]#
Bases:
ForwardWrapper
A RewardNet consisting of a base network and a potential shaping.
- __init__(base, potential, discount_factor)[source]#
Setup a ShapedRewardNet instance.
- Parameters
base (
RewardNet
) – the base reward net to which the potential shaping will be added. Shaping must be applied directly to the raw reward net. See error below.potential (
Callable
[[Tensor
],Tensor
]) – A callable which takes a batch of states (as a PyTorch tensor) and returns a batch of potentials for these states. If this is a PyTorch Module, it becomes a submodule of the ShapedRewardNet instance.discount_factor (
float
) – discount factor to use for the potential shaping.
- forward(state, action, next_state, done)[source]#
Compute rewards for a batch of transitions and keep gradients.
- training: bool#