imitation.scripts.train_rl#
Uses RL to train a policy from scratch, saving rollouts and policy.
- This can be used:
To train a policy on a ground-truth reward function, as a source of synthetic “expert” demonstrations to train IRL or imitation learning algorithms.
To train a policy on a learned reward function, to solve a task or as a way of evaluating the quality of the learned reward function.
Functions
|
Trains an expert policy from scratch and saves the rollouts and policy. |
- imitation.scripts.train_rl.train_rl(*, total_timesteps, normalize_reward, normalize_kwargs, reward_type, reward_path, load_reward_kwargs, rollout_save_final, rollout_save_n_timesteps, rollout_save_n_episodes, policy_save_interval, policy_save_final, agent_path, _rnd)[source]#
Trains an expert policy from scratch and saves the rollouts and policy.
- Checkpoints:
At applicable training steps step (where step is either an integer or “final”):
Policies are saved to {log_dir}/policies/{step}/.
Rollouts are saved to {log_dir}/rollouts/{step}.npz.
- Parameters
total_timesteps (
int
) – Number of training timesteps in model.learn().normalize_reward (
bool
) – Applies normalization and clipping to the reward function by keeping a running average of training rewards. Note: this is may be redundant if using a learned reward that is already normalized.normalize_kwargs (
dict
) – kwargs for VecNormalize.reward_type (
Optional
[str
]) – If provided, then load the serialized reward of this type, wrapping the environment in this reward. This is useful to test whether a reward model transfers. For more information, see imitation.rewards.serialize.load_reward.reward_path (
Optional
[str
]) – A specifier, such as a path to a file on disk, used by reward_type to load the reward model. For more information, see imitation.rewards.serialize.load_reward.load_reward_kwargs (
Optional
[Mapping
[str
,Any
]]) – Additional kwargs to pass to predict_processed. Examples are ‘alpha’ for :class: AddSTDRewardWrapper and ‘update_stats’ for :class: NormalizedRewardNet.rollout_save_final (
bool
) – If True, then save rollouts right after training is finished.rollout_save_n_timesteps (
Optional
[int
]) – The minimum number of timesteps saved in every file. Could be more than rollout_save_n_timesteps because trajectories are saved by episode rather than by transition. Must set exactly one of rollout_save_n_timesteps and rollout_save_n_episodes.rollout_save_n_episodes (
Optional
[int
]) – The number of episodes saved in every file. Must set exactly one of rollout_save_n_timesteps and rollout_save_n_episodes.policy_save_interval (
int
) – The number of training updates between in between intermediate rollout saves. If the argument is nonpositive, then don’t save intermediate updates.policy_save_final (
bool
) – If True, then save the policy right after training is finished.agent_path (
Optional
[str
]) – Path to load warm-started agent._rnd (
Generator
) – Random number generator provided by Sacred.
- Return type
Mapping
[str
,float
]- Returns
The return value of rollout_stats() using the final policy.