Beyond the agent-environment-goal triumvirate, four principal sub-elements characterize reinforcement learning problems.
- Policy. This defines the RL agent’s behavior by mapping perceived environmental states to specific actions the agent must take when in those states. It can take the form of a rudimentary function or more involved computational process. For instance, a policy guiding an autonomous vehicle may map pedestrian detection to a stop action.
- Reward signal. This designates the RL problem’s goal. Each of the RL agent’s actions either receives a reward from the environment or not. The agent’s only objective is to maximize its cumulative rewards from the environment. For self-driving vehicles, the reward signal can be reduced travel time, decreased collisions, remaining on the road and in the proper lane, avoiding extreme de- or accelerations, and so forth. This example shows RL may incorporate multiple reward signals to guide an agent.
- Value function. Reward signal differs from value function in that the former denotes immediate benefit while the latter specifies long-term benefit. Value refers to a state’s desirability per all of the states (with their incumbent rewards) that are likely to follow. An autonomous vehicle may be able to reduce travel time by exiting its lane, driving on the sidewalk, and accelerating quickly, but these latter three actions may reduce its overall value function. Thus, the vehicle as an RL agent may exchange marginally longer travel time to increase its reward in the latter three areas.
- Model. This is an optional sub-element of reinforcement learning systems. Models allow agents to predict environment behavior for possible actions. Agents then use model predictions to determine possible courses of action based on potential outcomes. This can be the model guiding the autonomous vehicle and that helps it predict best routes, what to expect from surrounding vehicles given their position and speed, and so forth.7 Some model-based approaches use direct human feedback in initial learning and then shift to autonomous leanring.