Designing Societally Beneficial Reinforcement Learning Systems

Choices, risks, and reward reporting. Recommendations for how to integrate RL systems with society.

Feb 08, 2022

NOTE: An expanded version of this post can be found on the Berkeley Artificial Intelligence Research (BAIR) Blog.

I’m delighted to share a long-term project with you all charting the future where the public has a better understanding of what makes reinforcement learning (RL) both powerful and risky. This project with the Center for Long Term Cybersecurity (CLTC) and Graduates for Engaged and Extended Scholarship in Engineering (GEESE) is my long-term blogging projects turning professional. Here, I share the summary of our paper and where different parties should look.

Choices, Risks, and Reward Reports: Charting Public Policy for Reinforcement Learning Systems can be downloaded here, shared on twitter here, or a press release here.

This paper encompasses three and a half major parts.

First: a summary of what makes RL different from other types of learning (e.g. supervised and unsupervised learning), along with fundamental types of feedback it contains — control, behavioral, and exogenous.
Second: a summary of the distinct risks in this formulation, being Scoping the Horizon, Defining Rewards, Pruning Information, and Training Multiple Agents.
Third: a forward looking analysis of specific governance mechanisms and legal points of entry for RL. This is highlighted by our recommendation of documenting reward reports for any real world system.
Third and a half: an Appendix discussing cutting edge technical questions in RL research and how different guiding principles of them will define the future of data-driven feedback systems.

This paper centers around the types of feedback central to the RL framework and a specific set of risks RL design manifests. Here I detail them to give a primer for further reading.

Types of feedback:

Control Feedback: the classic notion of feedback from linear systems where the action taken depends on the current measurements of the system.

An illustration of control feedback showing the relationship between the agent and its environment, including a policy (pi) that maps actions (a) onto states (s) and rewards (r) according to policy parameters (theta).

Behavioral Feedback: the often-defining feature of RL, trial and error learning and how that evolves over time.

An illustration of behavioral feedback showing the relationship between the agent and its own replay memory, from which sequential actions are incorporated into behavior (theta).

Exogenous Feedback: the future purview of RL designers — how a optimized environment impacts systems outside of the predetermined domain.

An illustration of exo-feedback in which control and behavioral feedback interacts with other parts of the application domain, causing the environment to drift over time.

Types of risk:

Scoping the Horizon: determining the timescale of an agents goals has an incredible impact on behavior. In research this is often discussed in the realm of sparse rewards, but in the real world agents can externalize costs depending on the defined horizon.

Distinct planning horizons for vehicle behavior. A short horizon comprises immediate reactions to nearby objects (e.g., signage, road obstacles). A longer horizon comprises more strategic, line-of-sight behaviors (e.g., merging, signaling, passing). Even longer horizons could capture end-to-end route planning. As the horizon expands, different dynamics are brought into scope.

Defining Rewards: the classic risk of RL systems, reward hacking, where the designer and agent negotiate behaviors based on a specific function. In the real world, this can often result in unexpected and exploitative behavior.

Defining rewards can lead to reward hacking if the agent learns to navigate around a maze rather than through it.

Pruning Information: a common practice in RL research is to change the environment to fit your needs. In the real world, modifying the environment is changing the information flow from the environment to your agent. Doing so can dramatically change what the reward function means for your agent and offload risk to external systems.

Information pruning in the context of traffic motion planning. The system includes actions and states only on the road itself, ignoring more costly features (e.g., pedestrians).

Training Multiple Agents: little is known how learning systems will interact. When their relative concentration increases, the terms defined in their optimization can re-wire norms and values encoded in the application domains.

Multi-agent RL in traffic (with risk of Goodhart’s Law implied). On the left, RL-based agents adopt behaviors that conform to the existing traffic flow. On the right, the learning-based agents redefine the flow of cars to optimize their own behavior and, in turn, the environment.

The TL;DR - Reward Reporting

We propose Reward Reports, a new form of documentation that foregrounds the societal risks posed by automated decision-making systems (ADS), whether explicitly or implicitly construed as RL. Building on proposals to document datasets and models, we focus on reward functions: the objective that guides optimization decisions in feedback- laden systems. Reward Reports comprise questions that highlight the promises and risks entailed in defining what is being optimized in an AI system, and are intended as living documents that dissolve the distinction between ex- ante specification and ex-post harm. As a result, Reward Reports provide a framework for ongoing deliberation and accountability after a system is deployed.

(more forthcoming on this soon)

Download the Paper

Where do I start?

Here I provide some guidance on where you should start if you have different backgrounds and goals around RL systems:

I am a technical expert looking to understand the risks...

The most substantial section of our paper, “A Topology of Choices and Risks in RL Design” will translate common practice of RL engineers and researchers into clear mechanisms for action that impact the domain of interest and users.

I am interested in learning what RL really encompasses...

Start from the beginning! The “Introduction” gives an excellent overview on what makes RL click. This was one of the most enjoyable sections to create and would be very useful material for any introductory AI course.

I have a deep understanding of RL but have not thought about applying it to critical domains...

We have a specific section dedicated to walking through how risks emerge in three application domains: social media recommendations, vehicle transportation, and energy infrastructure.

I am a policymaker looking for your recommendation...

Honestly, we hope you read the whole thing, but the governance mechanisms can give you a TL;DR on the necessary actions.

I am curious on your thoughts of the future of RL research...

The appendix it is. Explore ideas such as how offline RL breaks feedback loops and where the model-based vs. model-free debate will engage with society.