imitation.algorithms.adversarial.airl#

Adversarial Inverse Reinforcement Learning (AIRL).

Classes

AIRL(*, demonstrations, demo_batch_size, ...)

Adversarial Inverse Reinforcement Learning (AIRL).

class imitation.algorithms.adversarial.airl.AIRL(*, demonstrations, demo_batch_size, venv, gen_algo, reward_net, **kwargs)[source]#

Bases: AdversarialTrainer

Adversarial Inverse Reinforcement Learning (AIRL).

__init__(*, demonstrations, demo_batch_size, venv, gen_algo, reward_net, **kwargs)[source]#

Builds an AIRL trainer.

Parameters
  • demonstrations (Union[Iterable[Trajectory], Iterable[TransitionMapping], TransitionsMinimal]) – Demonstrations from an expert (optional). Transitions expressed directly as a types.TransitionsMinimal object, a sequence of trajectories, or an iterable of transition batches (mappings from keywords to arrays containing observations, etc).

  • demo_batch_size (int) – The number of samples in each batch of expert data. The discriminator batch size is twice this number because each discriminator batch contains a generator sample for every expert sample.

  • venv (VecEnv) – The vectorized environment to train in.

  • gen_algo (BaseAlgorithm) – The generator RL algorithm that is trained to maximize discriminator confusion. Environment and logger will be set to venv and custom_logger.

  • reward_net (RewardNet) – Reward network; used as part of AIRL discriminator.

  • **kwargs – Passed through to AdversarialTrainer.__init__.

Raises

TypeError – If gen_algo.policy does not have an evaluate_actions attribute (present in ActorCriticPolicy), needed to compute log-probability of actions.

logits_expert_is_high(state, action, next_state, done, log_policy_act_prob=None)[source]#

Compute the discriminator’s logits for each state-action sample.

In Fu’s AIRL paper (https://arxiv.org/pdf/1710.11248.pdf), the discriminator output was given as

\[D_{\theta}(s,a) = \frac{ \exp{r_{\theta}(s,a)} } { \exp{r_{\theta}(s,a)} + \pi(a|s) }\]

with a high value corresponding to the expert and a low value corresponding to the generator.

In other words, the discriminator output is the probability that the action is taken by the expert rather than the generator.

The logit of the above is given as

\[\operatorname{logit}(D_{\theta}(s,a)) = r_{\theta}(s,a) - \log{ \pi(a|s) }\]

which is what is returned by this function.

Parameters
  • state (Tensor) – The state of the environment at the time of the action.

  • action (Tensor) – The action taken by the expert or generator.

  • next_state (Tensor) – The state of the environment after the action.

  • done (Tensor) – whether a terminal state (as defined under the MDP of the task) has been reached.

  • log_policy_act_prob (Optional[Tensor]) – The log probability of the action taken by the generator, \(\log{ \pi(a|s) }\).

Return type

Tensor

Returns

The logits of the discriminator for each state-action sample.

Raises

TypeError – If log_policy_act_prob is None.

property reward_test: RewardNet#

Returns the unshaped version of reward network used for testing.

Return type

RewardNet

property reward_train: RewardNet#

Reward used to train generator policy.

Return type

RewardNet

venv: VecEnv#

The original vectorized environment.

venv_train: VecEnv#

Like self.venv, but wrapped with train reward unless in debug mode.

If debug_use_ground_truth=True was passed into the initializer then self.venv_train is the same as self.venv.

venv_wrapped: VecEnvWrapper#