imitation.algorithms.mce_irl#
Finite-horizon tabular Maximum Causal Entropy IRL.
Follows the description in chapters 9 and 10 of Brian Ziebart’s PhD thesis.
Functions
|
Calculate state visitation frequency Ds for each state s under a given policy pi. |
|
Performs the soft Bellman backup for a finite-horizon MDP. |
|
Squeeze a reward output tensor down to one dimension, if necessary. |
Classes
|
Tabular MCE IRL. |
|
A tabular policy. |
- class imitation.algorithms.mce_irl.MCEIRL(demonstrations, env, reward_net, rng, optimizer_cls=<class 'torch.optim.adam.Adam'>, optimizer_kwargs=None, discount=1.0, linf_eps=0.001, grad_l2_eps=0.0001, log_interval=100, *, custom_logger=None)[source]#
Bases:
DemonstrationAlgorithm
[TransitionsMinimal
]Tabular MCE IRL.
Reward is a function of observations, but policy is a function of states.
The “observations” effectively exist just to let MCE IRL learn a reward in a reasonable feature space, giving a helpful inductive bias, e.g. that similar states have similar reward.
Since we are performing planning to compute the policy, there is no need for function approximation in the policy.
- __init__(demonstrations, env, reward_net, rng, optimizer_cls=<class 'torch.optim.adam.Adam'>, optimizer_kwargs=None, discount=1.0, linf_eps=0.001, grad_l2_eps=0.0001, log_interval=100, *, custom_logger=None)[source]#
Creates MCE IRL.
- Parameters
demonstrations (
Union
[ndarray
,Iterable
[Trajectory
],Iterable
[TransitionMapping
],TransitionsMinimal
,None
]) – Demonstrations from an expert (optional). Can be a sequence of trajectories, or transitions, an iterable over mappings that represent a batch of transitions, or a state occupancy measure. The demonstrations must have observations one-hot coded unless demonstrations is a state-occupancy measure.env (
TabularModelPOMDP
) – a tabular MDP.rng (
Generator
) – random state used for sampling from policy.reward_net (
RewardNet
) – a neural network that computes rewards for the supplied observations.optimizer_cls (
Type
[Optimizer
]) – optimizer to use for supervised training.optimizer_kwargs (
Optional
[Mapping
[str
,Any
]]) – keyword arguments for optimizer construction.discount (
float
) – the discount factor to use when computing occupancy measure. If not 1.0 (undiscounted), then demonstrations must either be a (discounted) state-occupancy measure, or trajectories. Transitions are not allowed as we cannot discount them appropriately without knowing the timestep they were drawn from.linf_eps (
float
) – optimisation terminates if the $l_{infty}$ distance between the demonstrator’s state occupancy measure and the state occupancy measure for the current reward falls below this value.grad_l2_eps (
float
) – optimisation also terminates if the $ell_2$ norm of the MCE IRL gradient falls below this value.log_interval (
Optional
[int
]) – how often to log current loss stats (using logging). None to disable.custom_logger (
Optional
[HierarchicalLogger
]) – Where to log to; if None (default), creates a new logger.
- Raises
ValueError – if the env horizon is not finite (or an integer).
- demo_state_om: Optional[ndarray]#
- property policy: BasePolicy#
Returns a policy imitating the demonstration data.
- Return type
BasePolicy
- set_demonstrations(demonstrations)[source]#
Sets the demonstration data.
Changing the demonstration data on-demand can be useful for interactive algorithms like DAgger.
- Parameters
demonstrations (
Union
[ndarray
,Iterable
[Trajectory
],Iterable
[TransitionMapping
],TransitionsMinimal
]) – Either a Torch DataLoader, any other iterator that yields dictionaries containing “obs” and “acts” Tensors or NumPy arrays, TransitionKind instance, or a Sequence of Trajectory objects.- Return type
None
- train(max_iter=1000)[source]#
Runs MCE IRL.
- Parameters
max_iter (
int
) – The maximum number of iterations to train for. May terminate earlier if self.linf_eps or self.grad_l2_eps thresholds are reached.- Return type
ndarray
- Returns
State occupancy measure for the final reward function. self.reward_net and self.optimizer will be updated in-place during optimisation.
- class imitation.algorithms.mce_irl.TabularPolicy(state_space, action_space, pi, rng)[source]#
Bases:
BasePolicy
A tabular policy. Cannot be trained – prediction only.
- __init__(state_space, action_space, pi, rng)[source]#
Builds TabularPolicy.
- Parameters
state_space (
Space
) – The state space of the environment.action_space (
Space
) – The action space of the environment.pi (
ndarray
) – A tabular policy. Three-dimensional array, where pi[t,s,a] is the probability of taking action a at state s at timestep t.rng (
Generator
) – Random state, used for sampling when predict is called with deterministic=False.
- forward(observation, deterministic=False)[source]#
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.- Return type
NoReturn
- pi: ndarray#
- predict(observation, state=None, episode_start=None, deterministic=False)[source]#
Predict action to take in given state.
Arguments follow SB3 naming convention as this is an SB3 policy. In this convention, observations are returned by the environment, and state is a hidden state used by the policy (used by us to keep track of timesteps).
What is observation here is a state in the underlying MDP, and would be called state elsewhere in this file.
- Parameters
observation (
Union
[ndarray
,Mapping
[str
,ndarray
]]) – States in the underlying MDP.state (
Optional
[Tuple
[ndarray
,...
]]) – Hidden states of the policy – used to represent timesteps by us.episode_start (
Optional
[ndarray
]) – Has episode completed?deterministic (
bool
) – If true, pick action with highest probability; otherwise, sample.
- Return type
Tuple
[ndarray
,Optional
[Tuple
[ndarray
,...
]]]- Returns
Tuple of the actions and new hidden states.
- rng: Generator#
- imitation.algorithms.mce_irl.mce_occupancy_measures(env, *, reward=None, pi=None, discount=1.0)[source]#
Calculate state visitation frequency Ds for each state s under a given policy pi.
You can get pi from mce_partition_fh.
- Parameters
env (
TabularModelPOMDP
) – a tabular MDP.reward (
Optional
[ndarray
]) – reward matrix. Defaults is env.reward_matrix.pi (
Optional
[ndarray
]) – policy to simulate. Defaults to soft-optimal policy w.r.t reward matrix.discount (
float
) – rate to discount the cumulative occupancy measure D.
- Return type
Tuple
[ndarray
,ndarray
]- Returns
Tuple of
D
(ndarray) andDcum
(ndarray).D
is of shape(env.horizon, env.n_states)
and records the probability of being in a given state at a given timestep.Dcum
is of shape(env.n_states,)
and records the expected discounted number of times each state is visited.- Raises
ValueError – if
env.horizon
is None (infinite horizon).
- imitation.algorithms.mce_irl.mce_partition_fh(env, *, reward=None, discount=1.0)[source]#
Performs the soft Bellman backup for a finite-horizon MDP.
Calculates V^{soft}, Q^{soft}, and pi using recurrences (9.1), (9.2), and (9.3) from Ziebart (2010).
- Parameters
env (
TabularModelPOMDP
) – a tabular, known-dynamics MDP.reward (
Optional
[ndarray
]) – a reward matrix. Defaults to env.reward_matrix.discount (
float
) – discount rate.
- Return type
Tuple
[ndarray
,ndarray
,ndarray
]- Returns
(V, Q, pi) corresponding to the soft values, Q-values and MCE policy. V is a 2d array, indexed V[t,s]. Q is a 3d array, indexed Q[t,s,a]. pi is a 3d array, indexed pi[t,s,a].
- Raises
ValueError – if
env.horizon
is None (infinite horizon).
- imitation.algorithms.mce_irl.squeeze_r(r_output)[source]#
Squeeze a reward output tensor down to one dimension, if necessary.
- Parameters
r_output (th.Tensor) – output of reward model. Can be either 1D ([n_states]) or 2D ([n_states, 1]).
- Return type
Tensor
- Returns
squeezed reward of shape [n_states].