Developer Guide#
This guide explains the library structure of imitation. The code is organized such that logically similar files
are grouped into a subpackage. We maintain the following subpackages in src/imitation
:
algorithms
: the core implementation of imitation and reward learning algorithms.data
: modules to collect, store and manipulate transitions and trajectories from RL environments.envs
: provides test environments.policies
: provides modules that define policies and methods to manipulate them (e.g., serialization).regularization
: implements a variety of regularization techniques for NN weights.rewards
: modules to build, serialize and preprocess neural network based reward functions.scripts
: command-line scripts for running experiments through Sacred.util
: provides utility functions like logging, configurations, etc.
Algorithms#
The imitation.algorithms.base
module defines the following two classes:
BaseImitationAlgorithm
: Base class for all imitation algorithms.DemonstrationAlgorithm
: Base class for all demonstration-based algorithms like BC, IRL, etc. This class subclassesBaseImitationAlgorithm
.Demonstration algorithms offer the following methods and properties:policy
property that returns a policy imitating the demonstration data.set_demonstrations
method that sets the demonstrations data for learning.
All of the algorithms provide the train
method for training an agent and/or a reward network.
All the available algorithms are present in algorithms/
with each algorithm in a distinct file.
Adversarial algorithms like AIRL and GAIL are present in algorithms/adversarial
.
Data#
Modules handling environment data.
For example: types for transitions/trajectories; methods to compute rollouts; buffers to store transitions; helpers for these modules.
data.wrapper.BufferingWrapper
: Wraps a vectorized environment VecEnv
to save the trajectories from all the environments
in a buffer.
data.wrapper.RolloutInfoWrapper
: Wraps a gym.Env
environment to log the original observations and rewards received from
the environment. The observations and rewards of the entire episode are logged in the info
dictionary with the
key "rollout"
, in the final time step of the episode. This wrapper is useful for saving rollout trajectories, especially
in cases where you want to bypass the reward and/or observation overrides from other wrappers.
See data.rollout.unwrap_traj
for details and scripts/train_rl.py
for an example use case.
data.rollout.rollout
: Generates rollout by taking in any policy as input along with the environment.
Policies#
The imitation.policies
subpackage contains the following modules:
policies.base
: defines commonly used policies across the library likeFeedForward32Policy
,SAC1024Policy
,NormalizeFeaturesExtractor
, etc.policies.exploration_wrapper
: defines theExplorationWrapper
class that wraps a policy to create a partially randomized policy useful for exploration.policies.replay_buffer_wrapper
: defines theReplayBufferRewardWrapper
to wrap a replay buffer that returns transitions with rewards specified by a reward function.policies.serialize
: defines various functions to save and load serialized policies from the disk or the Hugging Face hub.
Regularization#
The imitation.regularization
subpackage provides an API for creating neural network regularizers. It provides classes such as
regularizers.LpRegularizer
and regularizers.WeightDecayRegularizer
to regularize the loss function and the weights of
a network, respectively. The updaters.IntervalParamScaler
class also provides support to scale the lambda hyperparameter
of a regularizer up when the ratio of validation to training loss is above an upper bound,
and scales it down when the ratio drops below a lower bound.
Rewards#
The imitation.rewards
subpackage contains code related to building, serializing, and loading reward networks.
Some of the classes include:
rewards.reward_nets.RewardNet
: is the base reward network class. Reward networks can take state, action, and the next state as input to predict the reward. Theforward
method is used while training the network, whereas thepredict
method is used during evaluation.rewards.reward_nets.BasicRewardNet
: builds a MLP reward network.rewards.reward_nets.CnnRewardNet
: builds a CNN based reward network.rewards.reward_nets.RewardEnsemble
: builds an ensemble of reward networks.rewards.reward_wrapper.RewardVecEnvWrapper
: This class wraps aVecEnv
with a customRewardFn
. The default reward function of the environment is overridden with the passed reward function, and the original rewards are stored in theinfo_dict
with theoriginal_env_rew
key. This class is used to override the original reward function of an environment with a learned reward function from the reward learning algorithms like preference comparisons.
The imitation.rewards.serialize
module contains functions to load serialized reward functions.
For more see the Reward Networks Tutorial.
Scripts#
We use Sacred to provide a command-line interface to run the experiments. The scripts to run the end-to-end experiments are
available in scripts/
. You can take a look at the following doc links to understand how to use Sacred:
Experiment Overview: Explains how to create and run experiments. Each script, defined in
scripts/
, has a corresponding experiment object, defined inscripts/config
, with the experiment object and Python source files named after the algorithm(s) supported. For example, thetrain_rl_ex
object is defined inscripts.config.train_rl
and its main function is inscripts.train_rl
.Ingredients: Explains how to use ingredients to avoid code duplication across experiments. The ingredients used in our experiments are defined in
scripts/ingredients/
:This ingredient provides a number of logging utilities.
This ingredient provides (expert) demonstrations to learn from.
This ingredient provides a vectorized gym environment.
This ingredient provides an expert policy.
This ingredient provides a reward network.
This ingredient provides a reinforcement learning algorithm from stable-baselines3.
This ingredient provides a newly constructed stable-baselines3 policy.
This ingredient provides Weights & Biases logging.
Configurations: Explains how to use configurations to parametrize runs. The configurations for different algorithms are defined in their file in
scripts/
. Some of the commonly used configs and ingredients used across algorithms are defined inscripts/ingredients/
.Command-Line Interface: Explains how to run the experiments through the command-line interface. Also, note the section on how to print configs to verify the configurations used for the run.
Controlling Randomness: Explains how to control randomness by seeding experiments through Sacred.
Util#
imitation.util.logger.HierarchicalLogger
: A logger that supports contexts for accumulating the mean of values of all the logged keys.
The logger internally maintains one separate stable_baselines3.common.logger.Logger
object for logging the mean values, and one Logger
object for the raw values for each context.
The accumulate_means
context cannot be called inside an already open accumulate_means
context.
The imitation.util.logger.configure
function can be used to easily construct a HierarchicalLogger
object.
imitation.util.networks
: This module provides some additional neural network layers that can be used for imitation like RunningNorm
and EMANorm
that normalize their inputs.
The module also provides functions like build_mlp
and build_cnn
to quickly build neural networks.
imitation.util.util
: This module provides miscellaneous util functions like make_vec_env
to easily construct vectorized environments and safe_to_tensor
that converts a NumPy array to a PyTorch tensor.
imitation.util.video_wrapper.VideoWrapper
: A wrapper to record rendered videos from an environment.