Indirect Normativity

Alignment Challenges

It’s hardly a week goes by and there has been some new breakthrough in AI, it’s vaulting in capability to new heights all the time – it’s increasingly hard to deny how critical a role ethics has to play in shaping AI safety. Therefore it’s deeply concerning to see such significant disagreement among experts who have rigorously studied ethics. The divergence in moral and meta-ethical perspectives among these experts poses serious questions:

Does disagreement among experts mean there is confusion in our conception of ethics?
Is there more to ethics than we can comprehend?
How can we effectively align AI if the very foundations of our ethical understanding are not aligned?

AI alignment, especially when considering near powerful seed AI or superintelligence, we face a major challenge: What values should we install and how should they be acquired, tested, and refined over time? Directly programming an AI with our current values risks “locking in forever” the potentially flawed, narrow-minded, or incomplete perspectives of our present generation. Indirect Normativity attempts to circumvent this risk by asking the AI to help solve ethics, for instance by figuring out what we would want, upon ideal reflection or under improved conditions, rather than hard-coding our current moral judgments.

This is a particularly pertinent question if there is justifiable credence that human-level AI (HLAI) is either here or not far off.

Intelligence is powerful – and with it humanity achieved great things. The values that a superintelligent AI adopts will shape the trajectory of the future in far reaching ways. If we fail to design AI with the right values, it might optimise for goals that conflict with our best interests, potentially leading to catastrophic outcomes.

Nick Bostrom likened solving AI Safety before things go horribly wrong to philosophy with a deadline. Polls regarding ‘when there will be a 50% chance that Human-Level AI will exist’ show half the respondents saying before 2068 in an earlier survey and in a more recent survey 2061. Deep learning pioneer Yoshua Bengio previously estimated that human-level AI would take “decades to centuries” but more recently significantly shortened his estimate to a 90% probability of achieving human-level AI within 5 to 20 years (Y. Bengio, Aug 2023). This timeline makes it seem highly unlikely that we will have an agreed on resolution to the fundamental issues of ethics, value, and meaning by the time HLAI occurs.

Even if we actually solve the capability control challenges of alignment (and all the AI companies are coordinating), but fail to solve the normative issues, then it might be better for the first AGI systems to be relatively benign. This would allow us to shut them down or contain them before it totalises unwise wishes. This could then prompt a necessary urgency to address the ethical, value, coordination, and meaning issues – without doing so, AI may be rudderless in turbulent oceans of moral uncertainty – where we drift toward potentially catastrophic futures.

Though if keeping AI bottled up forever is infeasible, then we have a ‘solving motivation selection with a deadline’ problem.

Capability control is, at best, a temporary and auxiliary measure. Unless the plan is to keep superintelligence bottled up forever, it will be necessary to master motivation selection.
Nick Bostrom – Superintelligence – chapter 12, page 185

As outlined in Nick Bostrom’s book Superintelligence, capability control methods include stuff like boxing, incentive methods, stunting and tripwire approaches. Indirect Normativity is a form of ‘motivation selection’ (alongside direct specification, domesticity and augmentation). See the diagram below:

The challenge of value alignment in AI systems is to ensure they are robustly aligned with good values. This challenge can be decomposed into a two main parts:

The technical challenge of aligning AI to values
The normative challenge of discovering what values to align to

and relatedly:

The coordination challenge of ensuring that all relevant stakeholders work together effectively on these values

Note: Normative generally means relating to some evaluative standard.

For a detailed analysis on reasons why value alignment is important, see the Instrumental Convergence thesis and the Orthogonality thesis.

Motivating good AI behaviour by direct specification alone has shown to be extremely problematic (i.e. in Isaac Asimov’s novels), but I feel it could be useful in conjunction with other approaches. Perhaps directly specifying what we have good reason to absolutely value.

We want an alignment that will hopefully not lock in any particular or kind of alignment or alignment procedure. We want alignment to ensure existential security for humanity, and over time reducing existential risk closer and closer to zero, and then less hindered by screaming imperatives, we’ll have time to reflect deeply on matters of import – a long reflection.

Three layers of questions:

How can we get a superintelligence to do what we want?
What do we want the superintelligence to want?
What should we want the superintelligence to want?

In AI safety, indirect normativity is form of motivation selection that attempts to indirectly find values by deferring some of the process of value discovery to powerful AI, such as by reference to what a rational agent would value under idealized conditions, rather than via direct specification, domesticity, augmentation or capability control methods.

Existential Security

Avoiding extinction saves lives!

Since extinction would cancel the potential for astronomical amounts of valuable lives worth living across the boggling vastness of the potential cosmic endowment, achieving existential security (which entails extreme civilizational longevity) is a dominant goal in the short term. Though it obviously involves trade offs. If we take into account the vastness of unnecessary suffering, today’s world is un-ideal. Many humans are living at subsistence, starving, dying of unnecessary diseases, on top of that many people’s revealed preferences point to the want of more luxury, especially in the first world – the nagging appetite for status, the racing ambition for having more than ones neighbours. How acceptable is it to slow down the pace of increases in luxury in favour of speeding up progress towards existential security? Would the consensus vote for this?
Humanity may find it unpalatable to sacrifice high amounts of potential gains in quality of life for the abstract notion of existential security. Or may be impatient about the notion that the foremost challenge of humanity is to focus on this kind of existential security, while leaving so many people to lead lives that are less than they could be – everyone wants to be all they can be as fast as possible.
Perhaps Indirect Normativity can help make the trade-offs more palatable – for instance AI may help in solving human ageing, so that people don’t need to die unnecessarily early – that would be pretty amazing by our current standards – perhaps then we can patiently wait for resilient existential security to be solved.

Long Reflections

Assuming to reasonable degrees that existential security and stability is solved, AI is aligned, and we are living in a good enough proto-utopia, before we rush to configuring all the material in our galaxy into atomically precise hedonism, we can then take the time for a long reflection where we can have a marathon philosophy conference around areas of moral uncertainty and disagreement, areas on value, the nature of good, and what might be true or not about ethics.

What sorts of issues might we reflect on?

Enhancing Security and Fairness: further improve security, fairness, alignment, and social configuration? How can we achieve these improvements?
Drawing further inspiration from our idealized selves: work towards inching closer to —smarter, wiser, more moral beings who have had ample time to consider ethical questions.
Expanding the Moral Circle: broaden our moral considerations to include the basic needs of non-human animals and artificial life and advance human and non-human rights.
Understanding Qualia: investigate qualia, experiential texture, and hedonic tone. How can we offload suffering’s functions onto technological prostheses and further reduce existential risk to nearly zero?
Clarifying Metaethics: aim to gain a clearer understanding of what constitutes moral rightness, whether through moral realism, moral anti-realism, existentialism, or other ethical frameworks..
Balancing Suffering and Future Life: evaluate the trade-off between reducing current suffering and preserving future life. Is our current approach correct, or does it need adjustment?
Achieving Broad Consensus: We must address the political challenge of gaining broad endorsements from different agents. Does a cognitively supreme AI have compelling arguments for a singular moral trajectory, or is a reasonable pluralistic trajectory more satisfactory?

An iterative approach ensures that we are continually refining our ethical and social frameworks, allowing for a more nuanced and comprehensive understanding of what a truly aligned and morally sound superintelligence would look like.

Adequate alignment →Semi-superintelligent Oracle → Indirect Normativity → Existential Security → Utopia 1 → Superintelligence → Long Reflection
→ More Existential Security → Utopia 2 → Long Reflection
→ Utopia 3 … → Long Reflection → Utopia n
… enjoy the our captured share of the universe for 100s of trillions of years (1.7×10¹⁰⁶ years).

Why Indirect Normativity?

As alluded to earlier, installing the wrong value-set into AI could be catastrophic.

Suppose we could install any arbitrary final value into a seed AI. The decision as to which value to install could then have the most far-reaching consequences. Certain other basic parameter choices—concerning the axioms of the AI’s decision theory and epistemology—could be similarly consequential. But foolish, ignorant, and narrow-minded that we are, how could we be trusted to make good design decisions? How could we choose without locking in forever
the prejudices and preconceptions of the present generation?
Nick Bostrom, Superintelligence – chapter 13, page 209

We should be careful in taking a cavalier approach to constraining the AI to ‘behave as we currently see fit’. As we rediscover over and over again, we are we are most often wrong to some degree.
Artificial intelligence has helped us detect a lot of errors and misinformation:

Medical: detected errors in medical diagnoses and suggested more accurate ones
Social Sciences: revealed biases in datasets used for social research, leading to more accurate understandings of social phenomena.
Drug Discovery: predicted potential drug candidates faster and more accurately than traditional methods, identifying compounds that might have been overlooked.
Climate Science: improved climate models by identifying errors in historical data and refining predictions of future climate patterns.
Fake News Detection: detect and flag fake news, helping users to differentiate between accurate and inaccurate information.

Likewise, AI could help us correct and improve our understandings of values and ethics.

Over history we’ve seen a tremendous amount of evolution in fundamental moral convictions. Some have changed dramatically. This should give us pause to rethink whether we have stumbled on the correct set of moral principles. Though AI will be trained on our data, which reflects our moral convictions over time, perhaps AI without our inherent biases might be able to extrapolate further what ideal moral principles will look like.

It may be dangerous align AI by directly specifying a bunch of rules, the wrong axioms of AI’s decision theory or misguided epistemology. Human values are complex and context-dependent, prone to oversimplification. Also rule based systems will likely have an inflexibility to nuance resulting in outcomes that are technically compliant but morally or practically unacceptable. Inadequately encoding all rules and exceptions encode all the necessary rules and exceptions, any misunderstandings or gaps could result in fragile alignment, or an AI overfit to narrow rules etc inadequate for real world dynamic states of affairs. AI systems need robust epistemology to handle uncertainty, to be able to learn from new information, and correct their understanding – it is doubtful direct specification can handle this.

Given our human limitations, how can we be trusted with the technical challenge of aligning AI to our values? and the normative challenge of knowing what those values should be?
What nature of epistemic humility is required to avoid locking in prejudices and preconceptions?

Indirect normativity is about offloading onto superintelligence at least some of the cognitive labor involved in the normative challenge of discovering a coherent set of values that lie within the intersection of ‘moral rightness’ and the best of preferred human values.

If we know we aren’t unbounded rational agents we then know we are fallible – so what would our idealized selves want? (I’ll discuss epistemic deference later on.)

Ideal Advisors or Ideal Observers?

I’ll first illustrate the concept of a utility monster – a theoretical entity that maximizes its utility at the extreme expense of others.
Imagine an ideal advisor as a hyper-rational, ultra-creative agent solely focused on its advisee’s interests, and if there are ethical guidelines, they are secondary to the overriding obligation to optimize the individual advisee’s private interests. It should be no surprise that such an optimization process acting on narrow private interests would inevitably disregard the interests of others – potentially leading to an AI-enabled utility monster which steeply weighs it’s own utility over others.
Our world, constrained by limited resources, drives Malthusian dynamics fuelling competition, innovation races and resource monopolization. As we advance with this overall dynamic towards superintelligence (which will bring with it widespread job loss to automation), we face potential winner-takes-all scenarios. In this context, coordinating our way out of a multi-polar trap becomes exceedingly challenging.

Now, consider an ideal observer – an agent that is fully informed, vividly imaginative, calm-minded, and impartial. Unlike the ideal advisor, the ideal observer has the interests of all agents at heart.
According to Richard Brandt’s ethical naturalism:

“The main idea .. is that ethical terms should be defined after the pattern of the following example: “x is better than y” means “If anyone were, in respect of x and y, fully informed and vividly imaginative, impartial, in a calm frame of mind and otherwise normal, he would prefer x to y.” – Brandt, Richard (1959). Ethical Naturalism

An ideal observer can mitigate utility monster scenarios by ensuring that no single agent’s interests are maximized at the expense of others. This broader ethical perspective can help avoid the multi-polar traps and AI arms races by fostering cooperation over competition, ensuring that the AI’s actions benefit all agents fairly rather than leading to zero-sum outcomes. This holistic approach could align AI development with global well-being, reducing incentives for destructive competition and monopolization of resources.

However, adopting an ideal observer approach might also lead to outcomes akin to the repugnant conclusion, where the AI’s impartiality and drive to optimize overall well-being could result in prioritizing a large number of lives with minimal welfare over fewer lives with higher welfare. This highlights a potential ethical dilemma, where maximizing impartial goodness could, paradoxically, lead to morally questionable outcomes. There are many proposed ways around the repugnant conclusion, though no consensus has yet emerged on the right way forward. This might be an issue so complex that it’s better addressed during a long reflection, where more time and resources can be dedicated to finding a suitable solution.

Applied to indirect normativity, if AI is to serve many, it should function like a hyper-rational ideal observer. As Nick Bostrom suggests in ‘Superintelligence’, an ideal observer would ‘achieve that which we would have wished the AI to achieve if we had thought about the matter long and hard’ (Bostrom, 2014, p. 141). This impartiality ensures that the AI considers the broader impact of its actions, promoting a more ethical and balanced approach to decision-making.

The ideal observer thesis may not be achievable with the kind of AI available during the initial phases of an indirect normativity project, due to the AI’s incomplete information and understanding. Therefore, we might need to take an iterated approach to indirect normativity – informed more by something akin to Rawls’ ‘Veil of Ignorance,’ which promotes fairness and impartiality even with limited knowledge. Perhaps a flavour of the veil of ignorance which guides decision-making without bias by designing a fair society for all humankind and morally relevant non-human animals. This approach could serve as a practical interim solution, helping to lay the foundation for future, more powerful AI systems to employ something closer to ideal observer-informed indirect normativity.

The Principle of Epistemic Deference

We may be fuzzy as to what we ourselves truly want, or what’s at the best interests of humanity, or all sentience – or what is really moral. Our brains afford us cognition enough to understanding that there is a lot we don’t know. Epistemic humility tells us that we shouldn’t be so confident as to assume we know all the facts, have all the best morals and aren’t just guessing. So why not delegate some of the cognitive labour to AI?

“A future superintelligence occupies an epistemically superior vantage point: its beliefs are (probably, on most topics) more likely than ours to be true. We should therefore defer to the superintelligence’s opinion whenever feasible.” – Nick Bostrom, Superintelligence

Here is a brief overview of issues with AI safety methods:

Capability Control: So far all methods of capability control (Boxing, Incentives, Stunting, Tripwires) aren’t reliably leak-proof.

Motivation Selection: Ensuring the adequacy of norm-selecting procedures in motivation selection strategies is problematic.

Direct specification is fragile, and too difficult to encode the complexity of human values
Augmentation ends up maximising some of the bad human values along with the good, and may be blind to objectively good values we don’t yet recognise or care about
Gradual domestication may fail to ‘tame’ the AI to be modest and non-ambitious/maximizing in pursuit of it’s goals

With well designed motivation selection, we may epistemically defer to AI the technical challenges of filling in the gaps of our alignment strategies, and the epistemic challenge of better value discovery.

Reflection on AI Safety Progress:

Exhausting All Approaches?
- Have we truly explored all possible avenues in AI safety? Almost certainly not.
Wisdom and Resources for Discovery?
- Do we possess the wisdom, expertise, time, and patience to fully explore and implement these approaches before unaligned entities race to dominate AI? Likely not.
Avoiding Other Looming Catastrophes?
- Are we confident that crises like climate change, nuclear war, or man-made pandemics will not interfere with our efforts to achieve desirable safe AI? Almost certainly not.
Default AI Outcome?
- Are we prepared to accept the default, potentially dangerous, outcomes of AI without robust safety measures? Absolutely not.

Hopeful Prospect:

Harnessing AI for AI Safety?
- Can we leverage AI’s capabilities to enhance our efforts in AI safety? Almost certainly yes! (See Timaeus)

Rather than specifying a concrete normative standard directly, we specify a process for deriving a standard. We then build the AI system so that it is motivated to carry out this process and to adopt whatever standard the process arrives at, and then epistemically defer to it on matters of import.

To reiterate, its promise lies in the fact that it could let us offload onto AI much of the difficult cognitive work required to carry out a direct specification of an appropriate final goal.

Note that this principle would also apply to a superintelligence who’s cognitive power may someday be dwarfed by other greater superintelligences who still don’t know everything there is to know relevant to ideal final goals or values. So the idea of actually knowing Final Goals, or Final Values should be looked at with humility and skepticism.

Approaches to Indirect Normativity

Two primary approaches to align AI with good values via Indirect Normativity are ‘Coherent Extrapolated Volition’ (CEV) and ‘Moral Rightness’ (MR). CEV aims to extrapolate and codify human values for AI to follow, while MR focuses on identifying and implementing a set of ethical principles for AI behaviour. Both methods face challenges in accurately capturing human values or establishing universal moral frameworks. And then there is a 3rd, which is a mixture of CEV and MR.

Coherent Extrapolated Volition

“Our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted.” – Eliezer Yudkowsky

This raises questions like:

What would ideal versions of ourselves value?
Do we harbour Anti-social, selfish preferences, confused preferences?
What about Evil preferences?
What about aggregation issues surrounding incompatible preferences? – its obviously hard to aggregate conflicting preferences between different ideologies
Is AI only to extrapolate preferences where they cohere, missing out on all the preferences which don’t?

Who’s Preferences? The Extrapolation Base

Would everyone be included in the extrapolation base? If not, which segments of humanity?
Our world contains extremist lunatics and crazy sadists, whose preferences are directly at odds with most others preferences – who’s happiness is achieved through other’s suffering – one would assume we shouldn’t include sadistic and anti-social preferences in the extrapolation base.

Who should be part of the extrapolation base:

Should the welfare of non-human animals or digital minds be considered?
All adults?
The sponsors of the AI project?
Are there particular ethics professors or moral experts whom we should give extra weight to?
Should everyone be part? Or should we exclude unethical people (sociopath, psychopaths), religious fundamentalists, cult leaders?

The “extrapolation base” refers to the initial set of values, desires, and preferences of humans that the AI uses as a starting point for its extrapolation process – hopefully reflecting what humans would want, if they were more informed, more rational, and had undergone further cognitive development and moral growth. This raises the question of whether preferences converge if all humans were the wiser idealized versions we spoke about earlier.

As mentioned previously, AI, using its superior cognitive abilities, will extrapolate from this base to determine what humans would collectively wish for themselves if they had better understanding and reasoning capabilities.
It includes not only the present preferences and values but also considers how these might change under idealized conditions of increased knowledge and rationality.

It may be difficult for AI to correctly predict what an ideal human, or better yet, an ideal observer would want. And to the extent that it could, humanity may not want to sign off on it, if it required commitment to values that are completely incomprehensible and alien. So the AI may, for the sake of non-domination, just extrapolate to somewhere near the immediate and potentially flawed preferences of individuals or groups.

This concept aims to capture a more refined and collective human will, guiding the AI in making decisions that are in the best long-term interests of humanity as a whole. The complexity and abstract nature of defining and operationalizing this extrapolation base is one of the significant challenges in implementing CEV.

Non-Human Source Material

What will be the training data? Which texts (books, essays, papers etc.) are the most clear and reliable sources which AI might prioritize in extracting value learning from?

Without an agreed on set of ethics, how can we decide on the criteria for choosing what to include in the source material? How should all this be decided?

What if a state actor or the owner of a large tech company got control of the process of seeding the AI – and included selfishly parochial values? Or a religion who’s tenets rank their followers far above all others strove to make apocalyptic prophecy come true?

A Limitation of CEV:
There are cherished values we don’t yet have adequate empirical data to support or can’t explain well, perhaps we should avoid taking a leap of faith with them until we understand them better. It might be best to be careful what we wish for, and avoid totalizing solutions based on flawed value extrapolations.

What about values we have not thought about yet? Inside the gulf between the upper limits of things we can reason about and what the AI can reason about, there may be value who’s signatures look unintuitive, noisy or alien to us. Should AI take these values into account without them being put before us for consideration? Short of us being cognitively uplifted, AI may be able to translate these value signatures into forms which are more familiar to us, at the expense of detail, so at least we get to ponder them before we give the go ahead for an implementation.

Don’t let the perfect be the enemy of the good. It may be difficult to see us getting to a solved universal moral theory that people are ‘epistemically obliged’ to follow, though if we can achieve existential security and stability there will be plenty of time to discover new value, we don’t need to have it all at once. Perhaps discovering a universal moral theory is one of the things we can strive for during periods of long reflection.

Conflict over who gets to define what AI values

The CEV proposal aims to prevent conflict over the seed dynamics of the first superintelligent AI. While it mitigates some motives for conflict, it doesn’t eliminate them entirely.

The CEV approach involves significant complexity and ethical challenges, especially concerning who gets included in the initial extrapolation base and how to handle conflicting interests and values.

These points highlight the intricate ethical and practical issues involved in designing a superintelligence that can accurately represent and act upon humanity’s collective volition.

Individuals, groups or nations motivated by power seeking and private interests might attempt to capture the AI seed dynamics, and dominate the future by excluding others from the extrapolation base, rationalizing this through various arguments, such as the sponsor deserving ownership or the project imposing risks.

Distrust in extrapolation dynamics – Competing groups might distrust each other’s preferred extrapolation rules, leading to conflict over which approach to take.

Political Implication of perfect rightness? !!!!

Humans may not want perfect rightness – to the extent we want good outcomes, we want to be included in those good outcomes.

Moral Rightness

As humans are beings of limited rationality, some of us have discovered what look like some of the core axioms of the logical space of morality. If modern LLM systems are trained on human data they are learning from the values and morals where beings of limited rationality converge. This alone may not generate any new significant moral insight. We’d want AI to be in a position to research further and deeper – and as we have already discovered self-evident truths in mathematics and logic, we may also discover new self-evident truths of value and ethics – and if successful, from there we could scaffold more powerful AI on this knowledge, such that the AI may end up being far more impartial and moral than any human – eventually affording it the capability to fully explore and realize universal truths of morality. Does knowing moral rightness more than any human mean AI would care? I don’t know, but hopefully AI would also help with the technical challenge of aligning AI to morally right value.

So, yes in summary it would be amazing to have an AI on our side which was able to help further the important goal of moral learning and moral epistemology beyond our capability – with an ongoing cultivation of increasingly enlightened and wholesome understandings of moral value and critical thinking – always moving towards the horizon where new truths come into focus with AI aiming to continuously discover better epistemics (better ways of thinking), and at the object level what is morally right, and act accordingly. Given AI’s superior cognitive abilities, humans biases and other limitations, it may be wise to epistemically defer to the AI since it could better than human at understanding and executing morally right actions.

Assuming AI can discover MR, it may be a better option than CEV alone, because:

The result may be simpler than CEV – as it avoids 1) decoherence among extrapolated volitions and 2) is not constrained by the dimensions of the extrapolation base
Reduces the likelihood of moral failures due to overly narrow or broad bases
Increases likelihood that the AI pursues morally right actions.

I’ll talk about Moral Permissibility, a combo of MR + CEV, later.

Moral Realism

During the 19th century debate about life, people thought ‘well, there’s got to be this special elan vital’ or spiritual essence to life – they thought that life didn’t come from configurations of matter. Now we know a lot more about biochemistry and it turns out actually life does come from a bunch of matter.

Perhaps the same thing is true with value – there may be no need to evoke value-vital, an extra non-physical essence to add to the world. We may find the value is there in features of the natural world – or you could say, morally relevant features supervening on natural features.

Personally I don’t think the is/ought distinction stops us from making empirically informed moral progress – I’m sure we’d be happy with knowing which particular kinds of natural conditions satisfy normative concepts from which we scaffold moral knowledge.

At the very least, superintelligence could understand the range of possible morally relevant features of reality and use them to inform strategies for moral progress.

P.s. Moral realism is winning now, where anti-realism was the dominant view a few decades ago. See the Phil Papers survey.

Moral Permissibility

The maximal, the right, and the permissible.

Maximal values may take really strange forms

Universal optimization: Maximizing some objective function across the entire universe, potentially leading to outcomes that are incomprehensible to humans – we may not want to maximize X until we comprehend all the ramifications, because it could mean converting the universe into hedonium – which includes our atoms.
Existential security: Also we could take existential security way too far by using all the resources for optimizing the likelihood of extinction down to ‘infinity – 1’ at the expense of well-being.

Anyway, it seems unlikely that justified ultra confidence in maximal values will happen any time soon, which behoves us to apply epistemic humility, which brings us to moral rightness. !!! this should go under Moral Rightness

As mentioned, moral rightness might be considered as human compatible ethics – prioritizing human flourishing and well-being, based on principles such as autonomy, justice, and beneficence. Similar to what I think Stuart Russell seeks to achieve with his book ‘Human Compatible AI’.

So, Moral Permissibility: Moral rightness may be too much of a tall order. Achieving a consensus on moral rightness is a formidable challenge due to the diversity of human values and cultures. If it’s too difficult to get some kind of ethical pluralist agreement between values and cultures, we may have to settle for the morally permissible common ground – perhaps we get to hang around in a pluralist utopias while AI goes off and turns other parts of the universe into something maximal or more morally right.

Alternatively, Superintelligence may be in a position to create a series of bubble utopias, where there is less compromise between values and cultures.

Reasonable Pluralism

Iason Gabriel has articulated the challenge of reconciling diverse views in a pluralistic world while aligning AI systems to treat people fairly despite those differences. He noted that people envision AI alignment in various ways. Some imagine a human parliament or a centralized body providing coherent and sound value advice to AI systems, thereby addressing pluralism and offering robust guidance on agreed-upon best practices.

Conversely, other visions for AI do not rely on such a centralized human element. Instead, they envisage scenarios with multiple AIs, each paired with a human interlocutor, or AIs working independently to achieve constructive goals. In these cases, the AI would need to perform value calculations or syntheses as part of its default operations. Gabriel highlights that the type of AI system we pursue will determine how real-world political institutions must be tailored to support that vision effectively.

Even if with the aid of AI we were to converge on reasonably accurate values – values which we had good reason to believe were on the trajectory to ideal moral values – the difficulty of communicating the goodness of these values to all stakeholders, the process of integration and consensus-seeking may be quite difficult.

Overconfident Final Values

If we or AI doesn’t know all there is to know, then we therefore can’t be too confident in any final-values. History is replete with examples where overconfident leaders did terrible things to other people, which should give us pause to reconsider our confidence.

Path Dependencies & Architecture

The technical design of AI systems, like most technologies are not value agnostic.

Sometimes we assume that it will be just as easy aligning AI with one moral system as any of the others i.e. hedonistic utilitarianism, virtue ethics or deontology. The design and path dependencies of AI systems, including LLMs, significantly influence the moral theories they can effectively implement, and this will influence the dynamics of how such AI systems approach Moral Rightness. Utilitarianism appears to be the most natural fit for many current AI architectures, particularly those based on reinforcement learning. However, with deliberate design choices, it is possible to steer AI systems towards other moral frameworks such as deontology or virtue ethics, though this requires careful consideration of the underlying architecture and training methodologies.

So if we wanted to install say, a rights based theory or a deontological theory into AI, we should be clear that’s what we want to do and design accordingly.

Reasons To Be Hopeful

Generative AI and LLMs seem to make things more hopeful, according to Toby Ord who said

“These new systems are not (inherently) agents. So the classical threat scenario of Yudkowsky & Bostrom (the one I focused on in The Precipice) doesn’t directly apply. That’s a big deal…

…these new AI systems have read vast amounts of human writing. Something like 1 trillion words of text. All of Wikipedia; novels and textbooks; vast parts of the internet. It has probably read more academic ethics than I have and almost all fiction

…These systems thus have a vast amount of training signal about human values. This is in big contrast to the game-playing systems, which knew nothing of human values and where it was very unclear how to ever teach them about such values. So the challenge is no longer getting them enough information about ethics, but about making them care.”

There are a variety of areas where LLMs seem strong, but one thing that stands out is contextual understanding: having advanced understanding of context, may help AI align with virtue ethics to some extent, but their primary function often revolves around generating coherent and contextually relevant text, which doesn’t necessarily align with strict moral code.

Conveying Research and Findings During Indirect Normativity

During an Indirect Normativity project, AI can help in research and presentation of various potential trajectories for civilization. By creating detailed simulations of different societal trajectories, AI can explore variations in economic systems, governance structures, environmental policies, and technological adoption. Each simulation would vary parameters such as population growth, resource allocation, technological advancement, and ethical norms to predict various outcomes.

Performance indicators:

Quality of Life: AI can measure and report on indicators such as average lifespan, access to healthcare, education levels, and overall well-being.
Amount of Lives Saved: It can track and report the number of lives saved through improvements in healthcare, safety measures, and conflict resolution.
Units of Wellbeing: AI can quantify and compare the subjective well-being of individuals in different scenarios using well-being indices.
Quality of Wellbeing Units: Beyond just quantity, AI can assess the depth and richness of well-being experienced, taking into account factors like mental health, happiness, and fulfilment.
Meaning Generated: AI can analyse the degree to which individuals find meaning and purpose in their lives under different societal conditions.
Environmental Sustainability: AI can evaluate the sustainability of different trajectories, measuring factors like carbon footprint, biodiversity preservation, and resource use efficiency.

For intelligibility, to effectively convey its findings, AI can utilize a range of reporting mechanisms. Interactive dashboards can present data and insights visually, allowing stakeholders to explore different outcomes. Comprehensive reports would summarize findings, highlight key insights, and recommend potential paths based on simulation results. Furthermore, AI can create presentations and conduct briefings to explain its research to policymakers, ethicists, and other stakeholders. Continuous updates would ensure decision-makers have the most current information as new data becomes available and simulations are refined.

Ethical considerations are crucial in this process. AI can incorporate feedback from a wide range of stakeholders to align simulations and reports with diverse values and perspectives. Clearly explaining the methodologies used in simulations and reporting ensures transparency and understanding. Existing and new approaches can become more comprehensive as new issues come into focus, as the lay of the landscape of possibility becomes clearer, allowing AI to assist in navigating complex ethical and societal challenges, helping humanity make informed decisions about future trajectories.

Alignment Challenges

Existential Security

Long Reflections