Command Line Interface#

Many features of the core library are accessible via the command line interface built using the Sacred package.

Sacred is used to configure and run the algorithms. It is centered around the concept of experiments which are composed of reusable ingredients. Each experiment and each ingredient has its own configuration namespace. Named configurations are used to specify a coherent set of configuration values. It is recommended to at least read the Sacred documentation about the command line interface.

The scripts package contains a number of sacred experiments to either execute algorithms or perform utility tasks. The most important ingredients for imitation learning are:

Usage Examples#

Here we demonstrate some usage examples for the command line interface. You can always find out all the configurable values by running:

python -m imitation.scripts.<script> print_config

Run BC on the CartPole-v1 environment with a pre-trained PPO policy as expert#

Note

Here the cartpole environment is specified via a named configuration.

python -m imitation.scripts.train_imitation bc with \
    cartpole \
    demonstrations.n_expert_demos=50 \
    bc.train_kwargs.n_batches=2000 \
    expert.policy_type=ppo \
    expert.loader_kwargs.path=tests/testdata/expert_models/cartpole_0/policies/final/model.zip

50 expert demonstrations are sampled from the PPO policy that is included in the testdata folder. 2000 batches are enough to train a good policy.

Run DAgger on the CartPole-v0 environment with a random policy as expert#

python -m imitation.scripts.train_imitation dagger with \
    cartpole \
    dagger.total_timesteps=2000 \
    demonstrations.n_expert_demos=10 \
    expert.policy_type=random

This will not produce any meaningful results, since a random policy is not a good expert.

Run AIRL on the MountainCar-v0 environment with a expert from the HuggingFace model hub#

python -m imitation.scripts.train_adversarial airl with \
    seals_mountain_car \
    total_timesteps=5000 \
    expert.policy_type=ppo-huggingface \
    demonstrations.n_expert_demos=500

Note

The small number of total timesteps is only for demonstration purposes and will not produce a good policy.

Run GAIL on the seals/Swimmer-v0 environment#

Here we do not use the named configuration for the seals environment, but instead specify the gym_id directly. The seals: prefix ensures that the seals package is imported and the environment is registered.

Note

The Swimmer environment needs mujoco_py to be installed.

python -m imitation.scripts.train_adversarial gail with \
        environment.gym_id="seals:seals/Swimmer-v0" \
        total_timesteps=5000 \
        demonstrations.n_expert_demos=50

Train an expert and save the rollouts explicitly, then train a policy on the saved rollouts#

First, train an expert and save the demonstrations. By default, this will use PPO and train for 1M time steps. We can set the number of time steps to train for by setting total_timesteps. After training the expert, we generate rollouts using the expert policy and save them to disk. We can set a minimum number of episodes or time steps to be saved by setting one of rollout_save_n_episodes or rollout_save_n_timesteps. Note that the number of episodes or time steps saved may be slightly larger than the specified number.

By default the demonstrations are saved in <log_dir>/rollouts/final (where for this script by default <log_dir> is output/train_rl/<environment>/<timestamp>). However, we can pass an explicit path as logging directory.

python -m imitation.scripts.train_rl with seals_cartpole \
        total_timesteps=40000 \
        logging.log_dir=output/ppo/seals_cartpole/trained \
        rollout_save_n_episodes=50

Instead of training a new expert, we can also load a pre-trained expert policy and generate rollouts from it. This can be achieved using the eval_policy script.

Note that the rollout_save_path is relative to the log_dir of the imitation script.

python -m imitation.scripts.eval_policy with seals_cartpole \
        expert.policy_type=ppo-huggingface \
        eval_n_episodes=50 \
        logging.log_dir=output/ppo/seals_cartpole/loaded \
        rollout_save_path=rollouts/final

Now we can run the imitation script (in this case DAgger) and pass the path to the demonstrations we just generated

python -m imitation.scripts.train_imitation dagger with \
        seals_cartpole \
        dagger.total_timesteps=2000 \
        demonstrations.source=local \
        demonstrations.path=output/ppo/seals_cartpole/loaded/rollouts/final

Visualise saved policies#

We can use the eval_policy script to visualise and render a saved policy. Here we are looking at the policy saved by the previous example.

python -m imitation.scripts.eval_policy with \
        expert.policy_type=ppo \
        expert.loader_kwargs.path=output/train_rl/Pendulum-v1/my_run/policies/final/model.zip \
        environment.num_vec=1 \
        render=True \
        environment.gym_id='Pendulum-v1'

Comparing algorithms’ performance#

Let’s use the CLI to compare the performance of different algorithms.

First, let’s train an expert on the CartPole-v1 environment.

python -m imitation.scripts.train_rl with \
        cartpole \
        logging.log_dir=output/train_rl/CartPole-v1/expert \
        total_timesteps=10000

Now let’s train a weaker agent.

python -m imitation.scripts.train_rl with \
    cartpole \
    logging.log_dir=output/train_rl/CartPole-v1/non_expert \
    total_timesteps=1000     # simply training less

We can evaluate each policy using the eval_policy script. For the expert:

python -m imitation.scripts.eval_policy with \
        expert.policy_type=ppo \
        expert.loader_kwargs.path=output/train_rl/CartPole-v1/expert/policies/final/model.zip \
        environment.gym_id='CartPole-v1' \
        environment.num_vec=1 \
        logging.log_dir=output/eval_policy/CartPole-v1/expert

which will return something like

INFO - eval_policy - Result: {
        'n_traj': 74,
        'monitor_return_len': 74,
        'return_min': 26.0,
        'return_mean': 154.21621621621622,
        'return_std': 79.94377589657559,
        'return_max': 500.0,
        'len_min': 26,
        'len_mean': 154.21621621621622,
        'len_std': 79.94377589657559,
        'len_max': 500,
        'monitor_return_min': 26.0,
        'monitor_return_mean': 154.21621621621622,
        'monitor_return_std': 79.94377589657559,
        'monitor_return_max': 500.0
    }
INFO - eval_policy - Completed after 0:00:12

For the non-expert:

python -m imitation.scripts.eval_policy with \
        expert.policy_type=ppo \
        expert.loader_kwargs.path=output/train_rl/CartPole-v1/non_expert/policies/final/model.zip \
        environment.gym_id='CartPole-v1' \
        environment.num_vec=1 \
        logging.log_dir=output/eval_policy/CartPole-v1/non_expert
INFO - eval_policy - Result: {
        'n_traj': 355,
        'monitor_return_len': 355,
        'return_min': 8.0,
        'return_mean': 28.92676056338028,
        'return_std': 15.686012049373561,
        'return_max': 104.0,
        'len_min': 8,
        'len_mean': 28.92676056338028,
        'len_std': 15.686012049373561,
        'len_max': 104,
        'monitor_return_min': 8.0,
        'monitor_return_mean': 28.92676056338028,
        'monitor_return_std': 15.686012049373561,
        'monitor_return_max': 104.0
}
INFO - eval_policy - Completed after 0:00:17

This will save the monitor CSVs (one for each vectorised env, controlled by environment.num_vec). The monitor CSVs follow the naming convention mon*.monitor.csv. We can load these CSV files with pandas and use the imitation.test.reward_improvement module to compare the performances of the two policies.

from pathlib import Path
import pandas as pd
from imitation.testing.reward_improvement import is_significant_reward_improvement

expert_monitor = pd.concat(
    [
        pd.read_csv(f, skiprows=1)
        for f in Path("./output/train_rl/CartPole-v1/expert/monitor").glob(
            "mon*.monitor.csv"
        )
    ]
)
non_expert_monitor = pd.concat(
    [
        pd.read_csv(f, skiprows=1)
        for f in Path("./output/train_rl/CartPole-v1/non_expert/monitor").glob(
            "mon*.monitor.csv"
        )
    ]
)
if is_significant_reward_improvement(non_expert_monitor["r"], expert_monitor["r"], 0.05):
    print("The expert improved over the non-expert with >95% probability")
else:
    print("No significant (p=0.05) reward improvement of expert over non-expert")
True

Algorithm Scripts#

Call the algorithm scripts like this:

python -m imitation.scripts.<script> [command] with <named_config> <config_values>

algorithm

script

command

BC

train_imitation

bc

DAgger

train_imitation

dagger

AIRL

train_adversarial

airl

GAIL

train_adversarial

gail

Preference Comparison

train_preference_comparisons

MCE IRL

none

Density Based Reward Estimation

none

Utility Scripts#

Call the utility scripts like this:

python -m imitation.scripts.<script>

Functionality

Script

Reinforcement Learning

train_rl

Evaluating a Policy

eval_policy

Parallel Execution of Algorithm Scripts

parallel

Converting Trajectory Formats

convert_trajs

Analyzing Experimental Results

analyze

Output Directories#

The results of the script runs are stored in the following directory structure:

output
├── <algo>
│   └── <environment>
│       └── <timestamp>
│           ├── log
│           ├── monitor
│           └── sacred -> ../../../sacred/<script_name>/1
└── sacred
    └── <script_name>
        ├── 1
        └── _sources

It contains the final model, tensorboard logs, sacred logs and the sacred source files.