Reinforcement Learning Notes 2: Value Alignment

Have a great week, some creepy stuff, the summary we needed,The Challenge of Value Alignment and last week's sorted articles

Jan 16, 2023

Good morning everyone!

I wish you all a very productive week, and to that end I provide you this humble Substack article on this week’s Reinforcement Learning notes.

This was a busy week, with lots of papers submitted to arxiv.org and dutifully processed by yours truly, including some ethically ambiguous stuff. On that topic, I will be delivering on last week’s post promise to summarize the paper The Challenge of Value Alignment: from Fairer Algorithms to AI Safety.

Some creepy stuff

In this review of articles, one technical paper has grabbed my attention: Robofriend: An Adaptive Storytelling Robotic Teddy Bear . As described by the designers:

Robofriend is constructed by taking a large, 1 meter tall teddy bear, and instrumenting it to be able to move its head and arms, play videos and sound, and look at the children it is reading the story to.

Hi, I’m Robofriend and I’ll be looking at you always

Among the things Robofriend is portrayed to do is count the number of faces, so it knows when a child has walked away or stopped looking at the camera, measure focus and excitement on children’s faces, emit positive/negative feedback (“great job” vs “please be quiet”, among others) and move head and arms randomly.

On top of all these features, Robofriend is designed to learn by interaction to be as attention-grabbing as possible. And to think that the only two ethical considerations cited are privacy concerns and that the teddy bear might one day replace daycare teachers.

My primary concern in this case would be that children might be unable to differenciate between humans and artificial agents, developing affection to a machine optimized to be always monopolizing their attention.

On a similar note, the paper e-Inu: Simulating A Quadruped Robot With Emotional Sentience has also been submitted, in which a simulated quadrupled robot is designed to detect underlying emotions in speech, tonality, and facial expressions of humans present in the scene and respond accordingly.

The summary we needed

Also this week, a succint summary of Reinforcement Learning has been published by Sanjeevan Ahilan. This is aimed to “those who already have some familiarity with RL and are looking to review, reference and/or remind themselves of important ideas in the field.”

I’ve just finished the fundamentals chapter and I think it is extraordinarly useful. So, if it has been some time since you last opened Sutton and Barto’s RL Bible, this might just be your opportunity to review some important RL concepts.

The Challenge of Value Alignment: A Summary

There is a lot of talk about Artificial General Intelligence (AGI) and value alignment, and I think this paper, The Challenge of Value Alignment: from Fairer Algorithms to AI Safety, is a great place to start, even though it does lean a bit on the utilitarian side of ethics, frequently citing classic authors such as John Stuart Mill, David Hume or Peter Singer.

Here I’ll present to you what I believe are the most important ideas and concepts by chapter. To ease the flow I will only focus on the content and will not do the usual “the authors argue that…”, “they believe that….”, “the text then explains…”. The authors of this paper are Iason Gabriel, a researcher from Deepmind, and Vafa Ghazavi, Executive Director for the James Martin Institute for Public Policy. Other authors will be cited if necessary.

1. Introduction

The challenge of value alignment centres upon the question of how to ensure that AI systems are properly aligned with human values and ameanable to human control.

AI is a complicated subject for value alignment, since we are able to encode a richer set of values in AI systems that in simpler artifacts, and also AI systems have a greater scope of action and intelligence.

2. Technology and Values

Langdon Winner argues it is possible to instill values in technologies, with examples such as Robert Moses’ design bridges to limit transport between poor and rich neighbourhoods [turns out the story might be more complex], or Baron Haussman’s design of the streets of Paris after the French Revolution, that ease the manoeuvring of the military. Another example might be Harari’s claim that the need for large datasets and computer power favors centralized forms of political authority.

We can say that there is no value neutral technology since new technologies make some outcomes more likely and some outcomes less likely to occur, they create new possibilities, and sometimes exclude certain possibilities to be realized. Technologists are engaged in a world-making activity: there is a level of responsability, and the need for methods to ensure that technology is aligned with human values. They should think about this issues early on, including whether to develop new technologies at all.

Human alignment should also be social, through key methods such as stakeholder analysis and citizen consultation.

3. Is AI Special?

What is AI?

To Stuart Russell, “machines are intelligent to the extent that their actions can be expected to match their objectives”.

AI also includes Machine Learning, a family of statistical and algorithmic approaches like supervised, unsupervised and reinforcement learning.

The Potential Uniqueness of AI Systems

ML algorithms have the same concerns about injustice, safety, and unintended consequences that are present with other technologies. They are subject to algorithmic bias, the potential to manifest a particular set of values.

Another concern has been social value misalignment, as algorithms in the criminal justice system, healthcare and facial analysis have been found to discriminate against women and non-white folks.

Some challenges specific to AI systems is that they can make decisions or choices that are more meaningful that those encountered by technologies in the past, and that once the model has been trained is hard to know why it decides one thing or another.

As per Daniel Dennet’s degrees of freedom paradigm, a simple switch that can be turned on/off by some environmental change marks a degree of freedom, and biological organisms, humans and AI networks have additional degrees of freedom, with issues of control growing complex and non-linear with each degree. Hammers and pencils are not able to respond to their environment, but artificial agents can learn new mappings between inputs and outputs, coming up with results that even surprise their human designers.

He argues that the most useful way to think about AI systems is as rational agents that have goals and intentions (intentional stance), with trajectories that "can unfold without any direct dependence on us, their creators, and whose discriminations give their internal states a sort of meaning to them that may be unknown and not in our service".

On a similar note, to Luciano Floridi and J. W. Sanders AI systems are moral agents, due to their capacity for interactivity, autonomy and adaptability.

4. Technological Approaches to Value Alignment

The alignment of powerful AI systems requires interdisciplinary collaboration, as we need a clear understanding of both the goal of alignment and the technical means.

Top-Down and Bottom-Up Approaches

As described in Wallach and Allen’s research, top-down approaches start by identifying an appropriate moral theory to align with and then designing algorithms capable of implementing it while bottom-up approaches focus upon the creation of environments or feedback mechanisms that enable agents to learn from human beahaviour and be rewarded for morally praiseworthy conduct.

Top-down approaches are based on the possibility that ethical rules can be stated in computer code, and face a dilemma: either base it on our own moral beliefs, or public principles (primarly utilitarianism). One example of this approach would be Isaac Asimov’s three laws of robotics.

An example of a bottom-up approach would be Inverse Reinforcement Learning, where the agent focuses on 'the problem of extracting a reward function given observed, optimal behaviour'. This has a particular set of challenges, like the algorithm’s opaqueness and the difficulty of ensuring it is free from bias.

Concrete Problems

Reward hacking or reward corrumption is a situation where the artificial agent manages to maximise the numerical reward it receives by finding unanticipated shortcuts or corrupting the feedback system. A classic example would be the development of an agent for the game CoastRunners, where it priorizes driving in circles and winning points by destroying stuff over finishing the race.

Other problems facing IA are that the agent may take the most efficient path and not consider side effects, how to ensure that the agent explores the world in a safe manner or ways to evaluate complex agent behaviour.

Highly Advanced AI

According to Stuart Russell, the ultimate goal of AI research is the discovery of a general purpose method that is applicable across all problem types and works effectively for large and difficult stances while making very few assumptions. This is popularly referred to as AGI, or Artificial General Intelligence.

This is very related to Nick Bostrom’s superintelligence: any intellect that greatly exceeds the cognitive performance of humans in virtually all domains of interest.

Bostrom argues through the orthogonality thesis that every intelligence is compatible with any goal (benign or malign). This is contrary to Derek Parfit and Peter Singer’s belief that substantial moral insight might result from the capacity for instrumental reason.

Bostrom and Russell propose versions of the Instrumental Convergence Thesis: that AGI would display instrumental goals of self-improvement, self-preservation and resource acquisition in pursuit of its goals, even if it is to the disadvantage of human beings.

A concern with AGI is how to provide advice and direction to a smarter entity than us. To that end, techniques such as reward modelling (supplementing RL with human oversight) or safety via debate (systems debate with each other, competing to provide true answers to human operators) have been proposed.

5. The Fundamental Relevance Of Value

There are three ways to understand value alignment: alignment with instructions, alignment with true intentions, or alignment with human preferences. All three can yield misinformed, irrational or unethical results.

To achieve social value alignment AI systems ultimately need to embody principles widely endorsed by those affected, and individual values have to be aggregated to collective judgement, either through utility functions, coherent extrapolated volition or others.

Here lie two fundamental obstacles: moral uncertainty, as we are ensure if an action/theory is morally right, and moral pluralism, as people ascribe to a variety of reasonable views and perspectives.

6. Conclusion

A summary of the above.

Papers published last week

I hope you enjoyed the summary above. Here I present to you the papers published from January second through January 9th, as sorted by my own proprietary classification.

Finishing thoughts

I hope you enjoyed this edition of Notes On Reinforcement Learning. Please subscribe if you haven’t, and you’ll receive the summary of Towards Deployable RL - What’s Broken with RL Research and a Potential Fix on next week’s newsletter.

Notes on Reinforcement Learning

Discussion about this post