imitation.algorithms.mce_irl#

Finite-horizon tabular Maximum Causal Entropy IRL.

Follows the description in chapters 9 and 10 of Brian Ziebart’s PhD thesis.

Functions

mce_occupancy_measures(env, *[, reward, pi, ...])

Calculate state visitation frequency Ds for each state s under a given policy pi.

mce_partition_fh(env, *[, reward, discount])

Performs the soft Bellman backup for a finite-horizon MDP.

squeeze_r(r_output)

Squeeze a reward output tensor down to one dimension, if necessary.

Classes

MCEIRL(demonstrations, env, reward_net, rng)

Tabular MCE IRL.

TabularPolicy(state_space, action_space, pi, rng)

A tabular policy.

class imitation.algorithms.mce_irl.MCEIRL(demonstrations, env, reward_net, rng, optimizer_cls=<class 'torch.optim.adam.Adam'>, optimizer_kwargs=None, discount=1.0, linf_eps=0.001, grad_l2_eps=0.0001, log_interval=100, *, custom_logger=None)[source]#

Bases: DemonstrationAlgorithm[TransitionsMinimal]

Tabular MCE IRL.

Reward is a function of observations, but policy is a function of states.

The “observations” effectively exist just to let MCE IRL learn a reward in a reasonable feature space, giving a helpful inductive bias, e.g. that similar states have similar reward.

Since we are performing planning to compute the policy, there is no need for function approximation in the policy.

__init__(demonstrations, env, reward_net, rng, optimizer_cls=<class 'torch.optim.adam.Adam'>, optimizer_kwargs=None, discount=1.0, linf_eps=0.001, grad_l2_eps=0.0001, log_interval=100, *, custom_logger=None)[source]#

Creates MCE IRL.

Parameters
  • demonstrations (Union[ndarray, Iterable[Trajectory], Iterable[TransitionMapping], TransitionsMinimal, None]) – Demonstrations from an expert (optional). Can be a sequence of trajectories, or transitions, an iterable over mappings that represent a batch of transitions, or a state occupancy measure. The demonstrations must have observations one-hot coded unless demonstrations is a state-occupancy measure.

  • env (TabularModelPOMDP) – a tabular MDP.

  • rng (Generator) – random state used for sampling from policy.

  • reward_net (RewardNet) – a neural network that computes rewards for the supplied observations.

  • optimizer_cls (Type[Optimizer]) – optimizer to use for supervised training.

  • optimizer_kwargs (Optional[Mapping[str, Any]]) – keyword arguments for optimizer construction.

  • discount (float) – the discount factor to use when computing occupancy measure. If not 1.0 (undiscounted), then demonstrations must either be a (discounted) state-occupancy measure, or trajectories. Transitions are not allowed as we cannot discount them appropriately without knowing the timestep they were drawn from.

  • linf_eps (float) – optimisation terminates if the $l_{infty}$ distance between the demonstrator’s state occupancy measure and the state occupancy measure for the current reward falls below this value.

  • grad_l2_eps (float) – optimisation also terminates if the $ell_2$ norm of the MCE IRL gradient falls below this value.

  • log_interval (Optional[int]) – how often to log current loss stats (using logging). None to disable.

  • custom_logger (Optional[HierarchicalLogger]) – Where to log to; if None (default), creates a new logger.

Raises

ValueError – if the env horizon is not finite (or an integer).

demo_state_om: Optional[ndarray]#
property policy: BasePolicy#

Returns a policy imitating the demonstration data.

Return type

BasePolicy

set_demonstrations(demonstrations)[source]#

Sets the demonstration data.

Changing the demonstration data on-demand can be useful for interactive algorithms like DAgger.

Parameters

demonstrations (Union[ndarray, Iterable[Trajectory], Iterable[TransitionMapping], TransitionsMinimal]) – Either a Torch DataLoader, any other iterator that yields dictionaries containing “obs” and “acts” Tensors or NumPy arrays, TransitionKind instance, or a Sequence of Trajectory objects.

Return type

None

train(max_iter=1000)[source]#

Runs MCE IRL.

Parameters

max_iter (int) – The maximum number of iterations to train for. May terminate earlier if self.linf_eps or self.grad_l2_eps thresholds are reached.

Return type

ndarray

Returns

State occupancy measure for the final reward function. self.reward_net and self.optimizer will be updated in-place during optimisation.

class imitation.algorithms.mce_irl.TabularPolicy(state_space, action_space, pi, rng)[source]#

Bases: BasePolicy

A tabular policy. Cannot be trained – prediction only.

__init__(state_space, action_space, pi, rng)[source]#

Builds TabularPolicy.

Parameters
  • state_space (Space) – The state space of the environment.

  • action_space (Space) – The action space of the environment.

  • pi (ndarray) – A tabular policy. Three-dimensional array, where pi[t,s,a] is the probability of taking action a at state s at timestep t.

  • rng (Generator) – Random state, used for sampling when predict is called with deterministic=False.

forward(observation, deterministic=False)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Return type

NoReturn

pi: ndarray#
predict(observation, state=None, episode_start=None, deterministic=False)[source]#

Predict action to take in given state.

Arguments follow SB3 naming convention as this is an SB3 policy. In this convention, observations are returned by the environment, and state is a hidden state used by the policy (used by us to keep track of timesteps).

What is observation here is a state in the underlying MDP, and would be called state elsewhere in this file.

Parameters
  • observation (Union[ndarray, Mapping[str, ndarray]]) – States in the underlying MDP.

  • state (Optional[Tuple[ndarray, ...]]) – Hidden states of the policy – used to represent timesteps by us.

  • episode_start (Optional[ndarray]) – Has episode completed?

  • deterministic (bool) – If true, pick action with highest probability; otherwise, sample.

Return type

Tuple[ndarray, Optional[Tuple[ndarray, ...]]]

Returns

Tuple of the actions and new hidden states.

rng: Generator#
set_pi(pi)[source]#

Sets tabular policy to pi.

Return type

None

imitation.algorithms.mce_irl.mce_occupancy_measures(env, *, reward=None, pi=None, discount=1.0)[source]#

Calculate state visitation frequency Ds for each state s under a given policy pi.

You can get pi from mce_partition_fh.

Parameters
  • env (TabularModelPOMDP) – a tabular MDP.

  • reward (Optional[ndarray]) – reward matrix. Defaults is env.reward_matrix.

  • pi (Optional[ndarray]) – policy to simulate. Defaults to soft-optimal policy w.r.t reward matrix.

  • discount (float) – rate to discount the cumulative occupancy measure D.

Return type

Tuple[ndarray, ndarray]

Returns

Tuple of D (ndarray) and Dcum (ndarray). D is of shape (env.horizon, env.n_states) and records the probability of being in a given state at a given timestep. Dcum is of shape (env.n_states,) and records the expected discounted number of times each state is visited.

Raises

ValueError – if env.horizon is None (infinite horizon).

imitation.algorithms.mce_irl.mce_partition_fh(env, *, reward=None, discount=1.0)[source]#

Performs the soft Bellman backup for a finite-horizon MDP.

Calculates V^{soft}, Q^{soft}, and pi using recurrences (9.1), (9.2), and (9.3) from Ziebart (2010).

Parameters
  • env (TabularModelPOMDP) – a tabular, known-dynamics MDP.

  • reward (Optional[ndarray]) – a reward matrix. Defaults to env.reward_matrix.

  • discount (float) – discount rate.

Return type

Tuple[ndarray, ndarray, ndarray]

Returns

(V, Q, pi) corresponding to the soft values, Q-values and MCE policy. V is a 2d array, indexed V[t,s]. Q is a 3d array, indexed Q[t,s,a]. pi is a 3d array, indexed pi[t,s,a].

Raises

ValueError – if env.horizon is None (infinite horizon).

imitation.algorithms.mce_irl.squeeze_r(r_output)[source]#

Squeeze a reward output tensor down to one dimension, if necessary.

Parameters

r_output (th.Tensor) – output of reward model. Can be either 1D ([n_states]) or 2D ([n_states, 1]).

Return type

Tensor

Returns

squeezed reward of shape [n_states].