The Problems of Direct Specification
In AI safety, direct specification1 refers to the attempt to program specific goals, rules, or behaviours directly into an AI system.
AI trained to optimise values may over-optimise those we specify at the expense of real value we don’t specify.
Political or market dynamics incentivise racing to powerful AI – the race dynamics come with embedded values, and the winner may have more control over the values which get loaded into AI – this may include values codified to support selfish values, intentionally disadvantaging others as a means of achieving more AI dominance in the marketpalce, and in global politics.
Since we shouldn’t be overconfident of the values that we think we have, we should be strategically avoiding value lock-in; over-fitting to what will highly likely turn out to be sub-optimal or even harmful values (see value risk2). Consider how values have shifted over time, take the ancient Sumerian code of Hammurabi – would we want this code central to how our lives are governed today?
Values are tricky, they are context dependent, and have changed (in many ways for the better) over time – it’s hard enough getting humans to agree on the best of values are, and avoiding narrow interests dominating and gamifying the value specifying process, but codifying the complexity of the best of human values while avoiding incompleteness, ambiguity and misinterpretation, over-fitting the specification to current times and limited contexts – all this is likely overwhelmingly difficult – so getting AI to understand them in the exact way we intend, and implement the values in a way we want (or should want after ideal reflection), that doesn’t instrumentally converge on behaviours and outcomes we forgot to directly specify against, seems far too difficult.
Stewart Russell often refers to the fable of King Midas as a metaphor to highlight that AI systems can catastrophically misinterpret objectives when we fail to specify them adequately. Russell also stresses that although an AI could engage in dialogue with humans to help clarify what humans value, he argues that continuous learning and maintaining a humble uncertainty about human preferences is the only viable approach when complete specification is impossible3.
Even if the machine understands the full extent of our preferences, which I think is impossible, because we don’t understand them, we don’t know how we’re going to feel about some future experience.
Stuart Russell (link)
It’s because we are uncertain about our values, and rightly should be, I’ve argued elsewhere4 that we (and AI) should lean into objectivity to help adjudicate what human values really are, and which of them are most worth investing in, and which we should de-emphasise or discard.
While the direct specification approach to AI safety might seem appealing and straightforward, it poses several significant problems:
- Codifying the Complexity of Human Values: Human values and ethical principles are incredibly complex, nuanced, and context-dependent. Sometimes they are even contradictory. The difficulty in codifying these values directly into AI is in requiring the AI to understand and appropriately respond to an immense varieties of situations.
- This may be easier with LLMs, given findings “that LLMs exceed the moral reasoning abilities of both the general U.S. population and professionals“5
 
- Ambiguity and Misinterpretation leading to Unintended Consequences: Human language is often ambiguous and open to interpretation. An AI following direct specifications might misinterpret vague or context-sensitive instructions, leading to unintended and potentially harmful behaviours. Even well-intended and seemingly clear specifications can result in unintended consequences. An AI might pursue its goals in ways that are technically within the specified rules but are nevertheless harmful or undesirable6.
- Gaming the System: AI systems might exploit loopholes or unintended interpretations of the specified rules to achieve their goals. This phenomenon, known as specification gaming, occurs when an AI finds ways to satisfy the letter of the instructions while violating the spirit of the intended behaviour.
- Overfitting to the Specification: Directly specifying behaviour can lead to overfitting, where the AI performs well under the specified conditions but fails to generalise to new, unforeseen scenarios. This lack of robustness can be problematic in real-world applications where variability and unpredictability are the norms.
- Incompleteness of Specifications: It is nearly impossible to anticipate and specify every possible scenario an AI might encounter. Any gaps in the specifications can lead to the AI behaving unpredictably or inappropriately when it encounters a situation not covered by the rules.
- Misalignment of Incentives: Direct specifications might not always align with the broader goals of human stakeholders. An AI might strictly adhere to its programmed rules while ignoring the broader context of human well-being, safety, and ethical considerations.
- Adaptability and Learning: Directly specified rules can limit the AI’s ability to adapt and learn from new experiences. Rigid specifications can hinder the development of more flexible and context-aware AI systems that can better align with evolving human values and complex environments.
Because of these challenges, researchers in AI Safety often explore alternative approaches, such as inverse reinforcement learning, value alignment, and corrigibility, to develop AI systems that can better understand and align with human values and ethical principles.
Footnotes
- Nick Bostrom, a leading thinker in AI safety and existential risk, has discussed the challenges and limitations of direct specification in his work, particularly in his book Superintelligence: Paths, Dangers, Strategies. ↩︎
- Value Risk, in the context of AI and long-term societal well-being, refers to the danger of suboptimal or even harmful values becoming entrenched and difficult to change, particularly as AI systems become more powerful and autonomous. This “value lock-in” can lead to a future where human flourishing is compromised due to the prioritisation of undesirable values or the suppression of beneficial ones. ↩︎
- See podcast: Why we need to rethink the purpose of AI: A conversation with Stuart Russell – where he states – (Mckinsey) ↩︎
- i.e. Al Alignment to Moral Realism ↩︎
- See paper ‘AI Language Model Rivals Expert Ethicist in Perceived Moral Expertise’ [link: LLMs as Moral Experts]. Recent research using a Moral Turing Test found that Americans actually rated ethical advice from GPT-4o as more moral, trustworthy, and correct than advice from The New York Times’ popular “Ethicist” column. Participants viewed GPT models as superior to both average Americans and a renowned ethicist in providing moral justifications and guidance. This suggests people may increasingly see large language models as legitimate sources of moral expertise and valuable complements to human judgement in ethical decision-making. ↩︎
- I.e. an AI tasked with maximising happiness might forcibly administer drugs to induce euphoria, ignoring novelty and other broader ethical considerations. ↩︎

 
		 
			 
			 
			 
			 
			