Notes on Reinforcement Learning 3: Fixing RL Research

Towards Deployable RL, Exploration in Deep Reinforcement Learning and Risk Sensitive Dead-end Identification in Safety-Critical Offline Reinforcement Learning

Jan 23, 2023

Good morning everyone, this week’s notes come packed with features. As promised in last week’s post, this will include a detailed summary of Towards Deployable RL - What’s Broken with RL Research and a Potential Fix, as well as the classification of announced papers on arxiv.org and an overview of my favourites. We’ll begin with this week’s favourites, continue with today’s feature, then the sorted articles on arxiv.org and finish with some notes.

Subscribe to receive next week’s review of Reinforcement Learning from Human Feedback, and why it is important for the deployment of chatGPT.

My favourites

Of those announced on arxiv (9th-16th January), these are the ones that captured my attention . I’m sure other researchers would have chosen differently but I thought these are of great quality.

Risk Sensitive Dead-end Identification in Safety-Critical Offline Reinforcement Learning
- How to prevent worst-case scenarios or dead-ends in Reinforcement Learning (this ties with this chapter on last week’s feature, on how to ensure agents explore the world in a safe manner)
Learning to Perceive in Deep Model-Free Reinforcement Learning
Exploration in Deep Reinforcement Learning: From Single-Agent to Multi-Agent Domain
- This is actually from 2021 but updated on 2023.
- A survey on the state of the art in exploration, both for single to multiple agents.
Decentralized model-free reinforcement learning in stochastic games with average-reward objective
Mastering Diverse Domains through World Models - Deepmind
Reducing Exploitability with Population Based Training
Score vs. Winrate in Score-Based Games: which Reward for Reinforcement Learning?

At the end of this newsletter I’ll include my notes on Risk Sensitive Dead-end Identification in Safety-Critical Offline Reinforcement Learning and Exploration in Deep Reinforcement Learning: From Single-Agent to Multi-Agent Domain.

Towards Deployable RL - What’s Broken with RL Research and a Potential Fix

This is a paper written by Shie Mannor and Aviv Tamar, both at Technion, Israel’s Institute of Technology. I would characterize this paper as a Deployable RL manifesto, as its aim is to encourage research practices that close the current gap between RL and real-world applications. To them, deployable AI means RL that can work at scale, be economically feasible, and can eventually be put in the field.

They heavily criticise current focus on benchmarks, arguing that it yields very little practical value, and instead advocate for a focus on challenges, which they define as a real-world problem sponsored by a group of researchers from academia/industry. In their words: instead of comparing algorithms, solve a real world problem and bring real-world value.

They end by encouraging people to contribute challenges, frame their own research with this point of view and criticise research that is void of real-world value.

Here I offer you a schematic, heavily summarized version, but I encourage you to also read the full thing at Aviv Tamar’s Substack, link at the end.

Sections

Introduction
Generalist Agents VS Deployable RL Systems
Principles of Deployable RL Research
Actionable Next Steps

1. Introduction

RL is not living up to our expectations.
Here are five popular research practices that were relevant for 2015, but are currently stagnating the field:
- Overfitting to Specific Benchmarks
  - Nearly every paper is required to show some distinction for one of the popular benchmarks (Atari, Mujoco...) which leads to tweaks that only work for those benchmarks.
  - Progress in benchmarks does not yield tangible real-world value
- Wrong Focus
  - Current benchmarks ignore the deployable nature of situated (RL-driven) agents completely, focusing on algorithms rather than on a system/engineering view.
  - OpenAI GYM abstracts away all system-design issues
  - This hampers progress since in many practical problems, figuring out what are the useful states, actions, and rewards is a critical component of the development process
- Detached Theory
  - Useful theory seems to be quite rare:
    - Regret minimization is overly pessimistic
    - There is a lot of prior knowledge in practical problems that is not accounted for
    - Finite states and actions are not a good model for many problems of interest
    - There is a focus on unimportant qualities
- Uneven Playing Grounds
  - Performance is confounded by resources available to the implementer:
    - Proficiency in hyper-parameter tuning
    - Size of NN
    - Prior knowledge about the problem/solution
    - Variability of scale and the current trend at top conferences to prefer masive experimentation over conceptual novelty, can inhibit long-term progress
- Lack of Experimental rigor
  - Impressive singular experiments give a false sense of progress
  - We need more rigorous evaluation of difficulty and success: stability, deployment time and cost, testability and life-cycles are critical
  - The publication standard is that failure cases are almost never reported, stability is impossible to tell, and software design issues are not discussed.

2. Generalist Agents vs Deployable RL Systems

Two views:
- Generalist/AI first: Future progress will be made by focusing attention on large-scale training of agents that solve diverse problems, with the hope that along the way a generalist agent will develop, and be a useful component in various real world problems
- Deployable RL/RL second: Seek to design RL algorithms that solve concrete real world problems
The five problems from the introduction are relevant to both approaches but, with the current state of the field, we should seek the second (Deployed RL) approach.

3. Principles of Deployable RL Research

At present, RL is uneconomical to deploy.
To change it we need:
- to research on how to deploy it effectively
- to understand better the gains RL brings to practical problems
They propose a constructive model with three general principles around challenges:
- Challenges instead of benchmarks:
  - Challenge: problem sponsored by a group of researchers from academia/industry: instead of comparing algorithms, solve a real world problem and bring real world value
  - Credit for contributing challenges: A rigorous presentation of a challenge (description, community and supporting platform) should be credited
  - Measurable progress is the main criterion for publication: Every publication should explain the limitations and issues with the proposed algorithm and how it addresses problems specifically.
  - Weight class: Computing power available should be reported to address the significance of the results
- Theory papers shoud address specific challenges:
  - The goal of the research should be well justified in terms of its potential impact on real world problems. Should also consider problems that have to do with the life cycle of software such as data acquisition, debugging, testability and performance deterioration.
- Design Patterns Oriented Research:
  - Real-world RL based systems should have conceptual solutions to problems where issues such as terstability, debuggability, and other system life-cycle issues are addressed.
  - We forsee that one way to make significant progress on a challenge would be by developing novel design patterns for it.

4. Actionable Next Steps:

To us, deployable RL means RL that can work at scale, be economically feasible, and can eventually be put in the field.
Here is what you can do:
- Contribute challenges:
  - Requires deep understanding of the application domain, with many disciplines of impact (healthcare, engineering...)
  - Might require industry and academia joining forces
- Frame your own research: Frame the research effort within deployable RL principles.
- Critize others' research: Coordinate with researchers, reviewers and area chairs. Ask how a paper gets the field closer to real-world impact.

You can find find the whole article over at Aviv Tamar’s Substack or at arxiv.org.

Aviv’s Substack

Towards Deployable RL - What’s Broken with RL Research and a Potential Fix

Abstract Reinforcement learning (RL) has demonstrated great potential, but is currently full of overhyping and pipe dreams. We point to some difficulties with current research which we feel are endemic to the direction taken by the community. To us, the current direction is not likely to lead to “deployable” RL: RL that works in practice and can work in…

3 years ago · 12 likes · 15 comments · Aviv Tamar and Shie Mannor

Announced Papers: 9th-16th January

Notes on Reinforcement Learning 3.1: Risk Sensitive Dead-end Identification in Safety-Critical Offline Reinforcement Learning

One of the problems highlighted in The Challenge of IA Value Alignment

GREAT PAPER

Taylor K William - University of Toronto
Sonali Parbhoo - Imperial College London
Marzyeh Ghassemi - Massachussets Institute of Technology

Abstract

Identifying worst-case scenarios or dead-ends is crucial in safety-critical scenarios
This situations are rife with uncertainity due to stochastic environments and limited offline training data
Distributional Dead-End Discovery (DistDeD): a framework to identify worst-case decision points based on estimated distributions of the return of a decision.
Used on a toy domain as well as assesing the risk of death severely ill patients
Results: improves prior discovery approaches by increasing detection 20% and providing indications of the risk 10 hours earlier on average

Sections

Introduction
Related Work
1. Safe and Risk-Sensitive RL
2. Non-stationary and Uncertainty-Aware RL
3. RL in safety critical domains
Preliminaries
1. Distributional RL
2. Conservatism in Offline RL
3. Risk Estimation
4. Dead-end Discovery (DeD)
Risk-sensitive Dead-end Discovery
Illustrative Demonstration of DistDeD
Assesing Medical Dead-ends with DistDeD
1. Data
2. State Construction
3. D- and R- Networks
4. Training
5. Experimental Setup
6. Results
  1. DistDeD Provides earlier Warning of Patient Risk
  2. DistDeD Allows for a Tunable Assesment of Risk
  3. CQL Enhances DistDeD Performance
Discussion
1. Limitations
2. Broader Impact
3. Author Contributions
4. Acknowledgements

1. Introduction

In complex, safety-critical scenarios, being able to identify signs of rapid deterioration is critical: like replacing components within high value machinery or healthcare evaluation
Quantifying worst--case outcomes is usually challenging as a result of unknown stochasticity in the environment (compunded over a sequence of decisions), potentially changing dynamics, and limited data.
RL is a natural paradigm to address sequential decision-making tasks in safety-critical settings, focusing on maximizing the cumulative effects of decisions over time.
Many approaches relie on a priori knowledge about which states and actions to avoid, but this is not feasible in many real-world tasks as this may be unknown due to unknown interactions between selected actions and the observed state.
RL in high-risk settings is fully offline and off-policy due to ethical and legal reasons. As a consequence, it is very affected by the data collected and confounding information may lead to the overestimation of anticipated return, biased decisions and/or overconfident yet erroneous predictions, as well as overlooking rare but dangerous situations.
In general, Rl has been used in risk-neutral situations.
DeD (cite) is a framework that takes into account risk by avoiding actions proportionally to their risk of leading to dead-ends.
- Recorded negative outcomes are leveraged to identify behaviours that should be avoided
- Actions that lead to dead-ends are identified based on threshold point-estimates of the expected return of an action rather than considering the full distribution
  - Risk estimation in DeD is limited and too optimistic about determining which actions should be avoided.
  - By underestimating the risk associated with a particular action, we are unable to determine whether an action could be potentially dangerous.
DistDeD:
- Risk-sensitive decision framework positioned to serve as an early-warning system for dead-end discovery
- Tool for thinking about risk-sensitivy in data-limited offline settings
- Contributions:
  1. Provide distributional estimates of the return to determine whether a certain state is at risk of becoming a dead-end from the expected worst-case outcomes over available decisions.
  2. Establish DistDeD as a lower bound to DeD results -> Able to detect and provide earlier indication of high risk scenarios
  3. Modelling the full distribution -> Spectrum of risk-sensitivity when assesing dead-ends, tunable risk estimation procedures and can be customized
  4. Empirical evidence that DistDeD enables an earlier determination of high-risk areas of the state space on both a simulated environment and a real-world application.

7. Discussion

Justification, foundational evidence and preliminary findings on DistDeD
Limitations:
- Discrete action spaces
- Binary reward signal
- Dead-ends are derived from a single condition, most real world-scenarios are more complex
- Does not make causal claims about the impact of each action
Broader Impact
- Intended for assistance to domain experts, not for usage in isolation
- It asseses high-risk situations early enough so that the human decision maker can make a decision
- Misuse could be fatal

Notes on Reinforcement Learning 3.2: Exploration in Deep Reinforcement Learning: From Single-Agent to Multi-Agent Domain

2021 -IEEE Transactions on Neural Networks and Learning Systems

Updated 12 January 2023

Abstract

DRL and MARL is known to be sample inefficient, preventing real-world applications
One bottleneck is the exploration challenge: how to efficiently explore the environment collecting informative experiences
Comprehensive survey on existing exploration methods for both single-agent and multi-agent RL.
First: identify challenges
Second: Survey with two categories: uncertainity-oriented exploration and motivation-oriented exploration, as well as other notable exploration methods
Both algorithmic analysis and comparison on DRL benchmarks
Summarization and future directions

Sections

Introduction
Preliminaries
1. Markov Decision Process & Markov Game
2. Reinforcement Learning Methods
  1. Value-based methods
  2. Policy Gradient Methods
  3. Actor-Critic Methods
  4. MARL Algorithms
3. Basic Exploration Techniques
  1. Epsilon-Greedy
  2. Upper Confidence Bounds
  3. Entropy Regularization
  4. Noise Perturbation
4. Exploration based on Bayesian Optimization
  1. Gaussian Process-Upper Confidence Bounds (GP-UCB)
  2. Thompson Sampling (TS)
What Makes Exploration Hard in RL
1. Large State-action Space
2. Sparse, Delayed Rewards
3. White-noise Problem
4. Multi-agent Exploration
Exploration in Single-agent DRL
1. Uncertainity-oriented Exploration
  1. Exploration under Epistemic uncertainty
    1. Parametric Posterior-based Exploration
    2. Non-parametric Posterior-based Exploration
  2. Exploration under Aleatoric Uncertainty
  3. Exploration under Both Types of Uncertainty
2. Intrinsic Motivation-oriented Exploration
  1. Prediction Error
  2. Novelty
  3. Information Gain
3. Other Advanced Methods for Exploration
  1. Distributed Exploration
  2. Exploration with Parametric Noise
  3. Safe Exploration
Exploration in Multi-Agent DRL
1. Uncertainty-oriented Exploration
2. Intrinsic Motivation-oriented Exploration
3. Other Methods for Multi-Agent Exploration
Discussion
1. Empirical Analysis
2. Open Problems
  1. Exploration in Large Open space
  2. Exploration in Long-horizon Environments with Extremely Sparse, Delayed Rewards
  3. Exploration with White Noise Problem
  4. Convergence
  5. Multi-Agent Exploration
  6. Safe Exploration
Conclusion

7. Conclusion

Suggestions and insights:
- Current Exploration methods are evaluated mainly in terms of cumulative rewards and sample efficient on a handful of well-known environments
- The essential connections beteen different exploration methods are to be further revealed
- Exploration among large action space, long horizon environments and convergence analysis are relatively lacking studies
- Multi-Agent Exploration can be even more challenging due to complex multi-agent interactions. Coordinated exploration with decentralized execution and exploration under non-stationarity may be the key problems to address.

The end

This has been this week’s post on Notes on Reinforcement Learning. Tune next week for a review of Reinforcement Learning from Human Feedback, and why it is important for the deployment of chatGPT.

Notes on Reinforcement Learning

Notes on Reinforcement Learning 3: Fixing RL Research

Towards Deployable RL, Exploration in Deep Reinforcement Learning and Risk Sensitive Dead-end Identification in Safety-Critical Offline Reinforcement Learning

My favourites

Towards Deployable RL - What’s Broken with RL Research and a Potential Fix

Sections

1. Introduction

2. Generalist Agents vs Deployable RL Systems

3. Principles of Deployable RL Research

4. Actionable Next Steps:

Announced Papers: 9th-16th January

Notes on Reinforcement Learning 3.1: Risk Sensitive Dead-end Identification in Safety-Critical Offline Reinforcement Learning

Abstract

Sections

1. Introduction

7. Discussion

Notes on Reinforcement Learning 3.2: Exploration in Deep Reinforcement Learning: From Single-Agent to Multi-Agent Domain

Abstract

Sections

7. Conclusion

The end

Discussion about this post