| |

AI Alignment to Moral Realism

The worry that AI could do a lot of harm I think is justified.

I argue that an AI with a posteriori moral knowledge would be able to come at the most reliably best ethics. My hope is that AI would leverage the scientific method to scaffold it’s ethics – what makes something objectively good is how it relates to something empirically observable.

Making progress in ethics through the achievement of the same kind of traction we see in modern technological progress would be awesome – in the sense that harnessing the the virtues of science (testability, falsifiability etc) to be in service to ethical progress – would enable tangible grip on real world phenomena required to scaffold reliable ethical outcomes. One would hope that there are sound engineering principles informed by science that went into erecting the building you spend most of your time in, or into the chip fabrication process which went into the computer you spend most of your time using.

If we wanted to make serious moral progress, we could epistemically defer to AI if it is cognitively superior in relevant ways.

We want AI to do the things that are objectively best, and we being self interested agents, we may want to add the caveat: do the things that are best for us. It’s proving really difficult to directly specifying what to do, so perhaps ‘what to do’ can be iteratively discovered through indirect normativity through moral realism tempered with an accounting of human value.

A utilitarian AI through empirical observation (a posteriori) the degree to which an action produces wellbeing and reduces suffering for the greatest number – and therefore be in a good place to adjudicate the goodness of this action.

An AI aligning to natural moral law would be able to discover a posteriori the actions which align with our natural function.

Why might people see light in indirect normativity? Those on either side of the political divide may disagree on object level details, but they may agree that they don’t know what the perfect answer is – and that an ideal algorithm could in principle produce the far more accurate answer, each privately optimistic their views will be vindicated in the fullness of time.

But how to seed the superintelligence such that it’s trajectory isn’t arbitrary, and that it aligns with a true, or permissibly true enough moral arc?

I argue that AI should be aligned to a form of Moral Realism – which, as explained in the Stanford Encyclopedia of Philosophy, is the view that there are objective moral facts that exist independently of human opinion or cultural norms. These facts are not created by humans but are discovered through reason, intuition, or some other means).

Note this page will be updated as I find better ways to express my concerns on this matter.

Assumptions: Moral Realism, Cognitivism, humans can’t not be stupid to some degree (and often converge on some kind of fake moral realism, or moral relativism).

Human values are difficult to specify

The idea that AI should align to human values is problematic if the values humans hold are empirically ungrounded, inconsistent, and contradictory. Varieties of issues extracting and coherently aggregating values have been discussed for some time.

Complexity of value thesis. It takes a large chunk of Kolmogorov complexity to describe even idealized human preferences. That is, what we ‘should’ do  is a computationally complex mathematical object even after we take the limit of reflective equilibrium (judging your own thought processes) and other standard normative theories. A superintelligence with a randomly generated utility function would not do anything we see as worthwhile with the galaxy, because it is unlikely to accidentally hit on final preferences for having a diverse civilization of sentient beings leading interesting lives.

See: Yudkowsky (2011)Muehlhauser & Helm (2013).

Fragility of value thesis. Getting a goal system 90% right does not give you 90% of the value, any more than correctly dialing 9 out of 10 digits of my phone number will connect you to somebody who’s 90% similar to Eliezer Yudkowsky. There are multiple dimensions for which eliminating that dimension of value would eliminate almost all value from the future. For example an alien species which shared almost all of human value except that their parameter setting for “boredom” was much lower, might devote most of their computational power to replaying a single peak, optimal experience over and over again with slightly different pixel colors (or the equivalent thereof). Friendly AI is more like a satisficing threshold than something where we’re trying to eke out successive 10% improvements.

See: Yudkowsky (20092011).

Indirect Normativity

Indirect design considerations for a design we don’t know how to directly specify.

In his seminal work ‘Superintelligence‘, Bostrom (2014) raises a critical question regarding the axiomatic framework upon which a superintelligence ought to be constructed.

Lacking confidence in our ability to specify a concrete normative standard, we would instead specify some more abstract condition that any normative standard should satisfy, in the hope that a superintelligence could find a concrete standard that satisfies the abstract condition. We could give a seed AI the final goal of continuously acting according to its best estimate of what this implicitly defined standard would have it do.

Nick Bostrom, Superintelligence (pp. 258–259)

Bostrom presents a comparative analysis of several prominent proposals for its value system:

  • Coherent Extrapolated Volition (CEV): This approach posits that the superintelligence should inherit the (at least satisfactory) values towards which humanity would converge upon (given enough time and insight) attaining a state of perfect rationality and self-awareness.

our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted

Coherent Extrapolated Volition – Yudkowsky 2004
  • Moral Rightness (MR): This proposal advocates for imbuing the superintelligence with a direct imperative to pursue morally right actions. However, Bostrom acknowledges the inherent ambiguity associated with the concept of “moral rightness,” highlighting the lack of a universally accepted definition within philosophical discourse. He cautions that adopting an erroneous interpretation could lead to catastrophic outcomes.
  • Moral Permissibility (MP): Building upon the CEV framework, the MP proposal suggests constraining the superintelligence’s actions to remain within the bounds of moral permissibility. This approach seeks to mitigate the challenges associated with defining “moral rightness” by focusing on actions that are demonstrably not morally wrong.

Bostrom further postulates that achieving any of these proposed value systems within an artificial intelligence might necessitate equipping it with advanced linguistic capabilities, comparable to those of a mature human adult. By fostering a comprehensive understanding of natural language, the AI could potentially grasp the nuances of “moral rightness” and undertake actions that demonstrably align with that concept.

Finally, Bostrom acknowledges the inherent challenges associated with the moral rightness model and proposes the moral permissibility approach as a potential compromise. This modification aims to preserve the core principles of the moral rightness model while mitigating its complexity by focusing on actions that are demonstrably not morally wrong. The superintelligence would then be free to pursue the extrapolated human volition, so long as its actions remain within the bounds of established moral permissibility.

Moral Realism & Moral Rightness

‘Moral Rightness’, can be approached either through moral realism or moral anti-realism.

The moral realism that I advocate is an approach to ethics informed by the scientific method. Where values are informed by facts about the world grounded in empiricism.

Moral judgments can be true or false, and there can be moral progress over time.

  • Arguments for MR:
    • Argument from moral disagreement: Disagreements about moral issues wouldn’t be meaningful if there were no right or wrong answers.
    • Argument from moral progress: If morality is entirely subjective, then it wouldn’t be possible to say that certain practices, like slavery, are objectively wrong.
    • The argument from the nature of morality: Morality seems to have a special status different from personal preferences, suggesting an objective reality.
  • Arguments against MR:
    • Moral relativism: Different cultures have different moral codes, making it impossible to claim one is objectively right.
    • The non-cognitivist view: Moral statements don’t express truth claims but rather emotions or attitudes.

Should Moral Realism be taken seriously?
And if so, how seriously?

Sociological claim: Moral Realism is the most popular stance amongst philosophers. According to a PhilPaper survey titled ‘Meta-ethics: moral realism or moral anti-realism? in 2009 amongst philosophers, found that 56% of philosophers accept or lean towards moral realism (28%: anti-realism; 16%: other). Another PhilPaper survey in 2020 found 62.1% accept or lean towards realism (26.1%: anti-realism; 12.7% other – note most of the respondents in ‘other’ answered ‘Agnostic/undecided’ 4.2% ‘The question is too unclear to answer’ 3.5%).

Tests on moral realism:

  • Are there a cognitive signatures for pain, and pleasure which consistently appear across species?
  • could different AI systems independently “discover” the same moral principles?

AI should be trained to converge on moral facts

There is a feeling that we all share the same core values, and any inconsistencies are minor, and aren’t worth worrying about. Though on a closer inspection when we look at the nuts and bolts there are plenty of values where people vehemently disagree. The problem of who’s version of human values to align to may be decided by those who are controlling the AI – and if those values are inimical to the values of others, perhaps even the majority of others, then the AI’s actions may be disastrously abhorred.

  • AI in service to the private values of the humans who controlling the AI. These private values could be selfish or inconsistent with the private interests of the majority, and serve to . Without any ethical grounding, AI may make no compromise in it’s actions outside of the private interests of the AI controllers.
    I.e. prioritizing individual wealth accumulation above all else, regardless of the environmental or social consequences, promotion of intolerance (potentially leading to discrimination or exclusion), and disregarding the needs of others.
  • Dealing with disagreement about what values to align to. The disagreements humans have about value may be unresolvable, or what resolutions can be made may be so messy as to be easily to blunder. Voting on what values to hold may not be the best way to avoid arriving at values which are selfish or limited in scope and zeroing in on those which better serve the collective good.
  • Human values may be wrong. And if history is a guide, they often are. The way in which values have been discarded or revised show that they weren’t perfect to begin with, and also that they were often downright horrible.
    I.e. treatment of human slaves and minorities, as well as non-human animals. The extract and burn fast approach to maintaining living standards have had environmental effects that we are seeing today and may have dire consequences in the near future.
  • Value lock in. If the AI is controlled by the values of humans, there is potential for value lock in. This could resolve to a singleton utility monster selfishly reaping the benefits forever.
  • Difficulty in aligning to the complexity of human value. “Complexity of value is the thesis that human values have high Kolmogorov complexity; that our preferences, the things we care about, cannot be summed by a few simple rules, or compressed.” – see Less Wrong post
  • Value is fragile. “..losing even a small part of the rules that make up our values could lead to results that most of us would now consider as unacceptable (just like dialing nine out of ten phone digits correctly does not connect you to a person 90% similar to your friend). For example, all of our values except novelty might yield a future full of individuals replaying only one optimal experience through all eternity.” see Less Wrong post
  • Human values don’t consider or prioritize values that aren’t human. I.e. the wellbeing of animals.

Why align to Moral Realism?

  • Universal implementation. If it’s true that pain, pleasure and novelty etc are foundational to all experience in the universe, then ethics grounded in this would service all experiencing agents, and not just humans.
  • Simpler implementation. Pain, pleasure, suffering, wellbeing and sensitivity to novelty can be classed as ethical facts about the real world, like the moon has a dark side because of the absence of light.
  • Tracking the real world is less likely to spin off into absurdity – slightly changing human value may cause
  • Easier to track. Much easier for an AI, with a different (alien) cognitive architecture, to

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *