imitation.scripts.train_rl#

Uses RL to train a policy from scratch, saving rollouts and policy.

This can be used:
  1. To train a policy on a ground-truth reward function, as a source of synthetic “expert” demonstrations to train IRL or imitation learning algorithms.

  2. To train a policy on a learned reward function, to solve a task or as a way of evaluating the quality of the learned reward function.

Functions

main_console()

train_rl(*, total_timesteps, ...)

Trains an expert policy from scratch and saves the rollouts and policy.

imitation.scripts.train_rl.main_console()[source]#
imitation.scripts.train_rl.train_rl(*, total_timesteps, normalize_reward, normalize_kwargs, reward_type, reward_path, load_reward_kwargs, rollout_save_final, rollout_save_n_timesteps, rollout_save_n_episodes, policy_save_interval, policy_save_final, agent_path, _rnd)[source]#

Trains an expert policy from scratch and saves the rollouts and policy.

Checkpoints:

At applicable training steps step (where step is either an integer or “final”):

  • Policies are saved to {log_dir}/policies/{step}/.

  • Rollouts are saved to {log_dir}/rollouts/{step}.npz.

Parameters
  • total_timesteps (int) – Number of training timesteps in model.learn().

  • normalize_reward (bool) – Applies normalization and clipping to the reward function by keeping a running average of training rewards. Note: this is may be redundant if using a learned reward that is already normalized.

  • normalize_kwargs (dict) – kwargs for VecNormalize.

  • reward_type (Optional[str]) – If provided, then load the serialized reward of this type, wrapping the environment in this reward. This is useful to test whether a reward model transfers. For more information, see imitation.rewards.serialize.load_reward.

  • reward_path (Optional[str]) – A specifier, such as a path to a file on disk, used by reward_type to load the reward model. For more information, see imitation.rewards.serialize.load_reward.

  • load_reward_kwargs (Optional[Mapping[str, Any]]) – Additional kwargs to pass to predict_processed. Examples are ‘alpha’ for :class: AddSTDRewardWrapper and ‘update_stats’ for :class: NormalizedRewardNet.

  • rollout_save_final (bool) – If True, then save rollouts right after training is finished.

  • rollout_save_n_timesteps (Optional[int]) – The minimum number of timesteps saved in every file. Could be more than rollout_save_n_timesteps because trajectories are saved by episode rather than by transition. Must set exactly one of rollout_save_n_timesteps and rollout_save_n_episodes.

  • rollout_save_n_episodes (Optional[int]) – The number of episodes saved in every file. Must set exactly one of rollout_save_n_timesteps and rollout_save_n_episodes.

  • policy_save_interval (int) – The number of training updates between in between intermediate rollout saves. If the argument is nonpositive, then don’t save intermediate updates.

  • policy_save_final (bool) – If True, then save the policy right after training is finished.

  • agent_path (Optional[str]) – Path to load warm-started agent.

  • _rnd (Generator) – Random number generator provided by Sacred.

Return type

Mapping[str, float]

Returns

The return value of rollout_stats() using the final policy.