imitation.algorithms.adversarial.gail#

Generative Adversarial Imitation Learning (GAIL).

Classes

`GAIL`(*, demonstrations, demo_batch_size, ...)	Generative Adversarial Imitation Learning (GAIL).
`RewardNetFromDiscriminatorLogit`(base)	Converts the discriminator logits raw value to a reward signal.

class imitation.algorithms.adversarial.gail.GAIL(*, demonstrations, demo_batch_size, venv, gen_algo, reward_net, **kwargs)[source]#

Bases: AdversarialTrainer

Generative Adversarial Imitation Learning (GAIL).

__init__(*, demonstrations, demo_batch_size, venv, gen_algo, reward_net, **kwargs)[source]#

Generative Adversarial Imitation Learning.

Parameters

demonstrations (Union[Iterable[Trajectory], Iterable[TransitionMapping], TransitionsMinimal]) – Demonstrations from an expert (optional). Transitions expressed directly as a types.TransitionsMinimal object, a sequence of trajectories, or an iterable of transition batches (mappings from keywords to arrays containing observations, etc).
demo_batch_size (int) – The number of samples in each batch of expert data. The discriminator batch size is twice this number because each discriminator batch contains a generator sample for every expert sample.
venv (VecEnv) – The vectorized environment to train in.
gen_algo (BaseAlgorithm) – The generator RL algorithm that is trained to maximize discriminator confusion. Environment and logger will be set to venv and custom_logger.
reward_net (RewardNet) – a Torch module that takes an observation, action and next observation tensor as input, then computes the logits. Used as the GAIL discriminator.
**kwargs – Passed through to AdversarialTrainer.__init__.

allow_variable_horizon: bool#: If True, allow variable horizon trajectories; otherwise error if detected.

logits_expert_is_high(state, action, next_state, done, log_policy_act_prob=None)[source]#

Compute the discriminator’s logits for each state-action sample.

Parameters

state (Tensor) – The state of the environment at the time of the action.
action (Tensor) – The action taken by the expert or generator.
next_state (Tensor) – The state of the environment after the action.
done (Tensor) – whether a terminal state (as defined under the MDP of the task) has been reached.
log_policy_act_prob (Optional[Tensor]) – The log probability of the action taken by the generator, \(\log{P(a|s)}\).

Return type

Tensor

Returns

The logits of the discriminator for each state-action sample.

property reward_test: RewardNet#

Reward used to train policy at “test” time after adversarial training.

Return type: RewardNet

property reward_train: RewardNet#

Reward used to train generator policy.

Return type: RewardNet

venv: VecEnv#: The original vectorized environment.

venv_train: VecEnv#

Like self.venv, but wrapped with train reward unless in debug mode.

If debug_use_ground_truth=True was passed into the initializer then self.venv_train is the same as self.venv.

venv_wrapped: VecEnvWrapper#

class imitation.algorithms.adversarial.gail.RewardNetFromDiscriminatorLogit(base)[source]#

Bases: RewardNet

Converts the discriminator logits raw value to a reward signal.

Wrapper for reward network that takes in the logits of the discriminator probability distribution and outputs the corresponding reward for the GAIL algorithm.

Below is the derivation of the transformation that needs to be applied.

The GAIL paper defines the cost function of the generator as:

\[\log{D}\]

as shown on line 5 of Algorithm 1. In the paper, \(D\) is the probability distribution learned by the discriminator, where \(D(X)=1\) if the trajectory comes from the generator, and \(D(X)=0\) if it comes from the expert. In this implementation, we have decided to use the opposite convention that \(D(X)=0\) if the trajectory comes from the generator, and \(D(X)=1\) if it comes from the expert. Therefore, the resulting cost function is:

\[\log{(1-D)}\]

Since our algorithm trains using a reward function instead of a loss function, we need to invert the sign to get:

\[R=-\log{(1-D)}=\log{\frac{1}{1-D}}\]

Now, let \(L\) be the output of our reward net, which gives us the logits of D (\(L=\operatorname{logit}{D}\)). We can write:

\[D=\operatorname{sigmoid}{L}=\frac{1}{1+e^{-L}}\]

Since \(1-\operatorname{sigmoid}{(L)}\) is the same as \(\operatorname{sigmoid}{(-L)}\), we can write:

\[R=-\log{\operatorname{sigmoid}{(-L)}}\]

which is a non-decreasing map from the logits of D to the reward.

__init__(base)[source]#: Builds LogSigmoidRewardNet to wrap reward_net.

forward(state, action, next_state, done)[source]#

Compute rewards for a batch of transitions and keep gradients.

Return type: Tensor

training: bool#