Amanda Askell – Can models make superhumanly moral decisions?

Amanda, a philosopher at Anthropic, in various interviews addresses AI behavior and character development, how models make ethical decisions, maintain psychological security, and whether AI models are good at moral decision making.

In previous posts and talks, I’ve discussed whether AI can and should become more moral than humans – and noted that Deepmind co-founder Shane Legg thinks so.1

Golden Gate Interview with Anthropic’s Amanda Askell

Interviewer: Do you think Claude Opus 3 or other Claude models make superhumanly moral decisions?

Amanda Askell: I mean one example of like superhuman, ’cause it could just be sort of like better than like any individual human could with the kind of like, you know, it depends on time and resources and whatnot, but one example might be no matter what kind of difficult position models are put in, if you were to have maybe all people, including many professional ethicists, analyze what they did and the decision that they made for like a hundred years and then they look at it and they’re like, “Yep, that seems correct,” but they couldn’t necessarily have come up with that themselves in the moment, that feels pretty superhuman.

And so I think at the moment my sense is that models are getting increasingly good at this, that they’re very capable. I don’t know if they are like superhuman at moral decisions, and in many ways maybe not comparable with, say, like, you know, a panel of human experts given time. But it does feel like that at least should be kind of the aspirational goal. And sort of like these models are being put in positions where they’re having to make really hard decisions. I think that just as you want models to be extremely good at like math and science questions, you also want them to show the kind of ethical nuance that we would all broadly think is, like, very good.

And I think that’s controversial because ethics is a different domain, but, yeah, I think that that’s important.

60 Minutes Interview

Also in an interview with 60 minutes2 Amanda weighed in on whether AI can think through complex moral problems.

“…if it [AI] can think through very hard physics problems carefully and in detail, then surely it should be able to think through these more complex moral problems.”

Can You Teach an A.I. Model to Be Good?

See NYT podcast snippet with Amanda Askell on teaching AI a core set of morals.

Transcript (I’ve added highlights to sections I think are salient):

Imagine you have a 6-year-old and you want to teach your 6-year-old to be good, obviously, as everyone does, and you realize that your 6-year-old is actually, like, clearly a genius. And by the time they are 15, everything you teach them, anything that was incorrect, they will be able to successfully just completely destroy. You know, so if you taught them — like, they’re going to question everything. And one question is, is there a core set of values that you could give to models such that when they can critique it more effectively than you can, and they — and they do — that it kind of survives into something good? And can that survive in the world? Can it survive in models? I think there’s a lot of interesting kind of theoretical questions there. I mean, I think that’s the question, right? Is, like, does this kind of training hold up when models are as smart as humans or smarter than them? I think there’s this, sort of, age-old fear in the A.I. safety community that there will be some point at which these models will start to develop their own goals that may be at odds with human goals – I think it is an open question. And on the one hand, I guess, like, I’m — I’m very uncertain here because I think some people might be, like, Well, like the thing that the 15-year-old will do if they’re really smart is they’ll just figure out that this [morals] is all completely made up and rubbish and — but then I guess part of me is, like, well, I mean, it’s not obvious to me that that’s true — that that is the only possible kind of equilibrium to reach. But it my not be sufficient but it does feel necessary — I’m just kind of like, it feels like we’re dropping the ball if we don’t just try and explain to A.I. models what it is to be good.

Giving AI Values

Also see Amanda’s interview with ‘Newcomer’ – there are two sections that are interesting:
The section on ‘The Backlash to Giving AI Values‘: Askell, in light of the fact that some people think AI should be no more than tools, there is a need for AI to make decisions in real-time and make judgement calls in new and unanticipated situations:

“I think some people think AI models should be more like tool like and that’s like the safe way to train models is to actually instead of trying to get them to kind of take on human virtues and make judgement calls. Um, you know, I think that’s like important because they’re going to be in new situations where they just have to make judgement calls and getting them to try to like weigh up everything and behave well in cases that you can’t anticipate seems, you know, that almost like requires a kind of like thoughtfulness. Um, which is like the kind of or like one of the reasons behind the approach. Um, you know, I think that’s like important because they’re going to be in new situations where they just have to make judgement calls and getting them to try to like weigh up everything and behave well in cases that you can’t anticipate seems, you know, that almost like requires a kind of like thoughtfulness. Um, which is like the kind of or like one of the reasons behind the approach. Um, but I think some people are thinking, oh well, if you have something that’s fully that makes no judgement calls and that just fully defers to people and is like kind of hyper like corrigible to like the user or the operator or to some broader notion of of humanity um, in a very like extreme way that’s like safer because if you give models their own values, they’re going to pursue things in the world that like are in line with those values. And I agree that this is like kind of delicate.”

The section on ‘Corrigibility vs Real Virtues‘:

“There’s this idea that you you’re always giving the models a personality and a persona because they are talking like people and they are trained on human data and I think my worry has been if you train them to be excessively corrigible and to see that as like their persona. In people I think this actually has a lot of negative broader traits as in like if you met someone and it was just like oh yeah they just would literally do anything… I’m just a bit worried about how that might end up generalising, especially if models are going to be playing a more active role in the world because if you can imagine, you know, they’re playing a more humanlike role in in their kind of like jobs essentially. And I’m like, actually, our conscience and our ability to make good judgment calls about what should and shouldn’t happen is like kind of key to how we operate.

You know, as models get more capable and they maybe my picture is that they’re going to like apply a lot of scrutiny to anything that we train them towards. So if you imagine in philosophy this sometimes there’s this notion of reflective equilibrium where the idea is that you know each time you encounter something where you realize that like one of your values seemed incorrect you like you have to square the two things so you figure out if you need to change the value or if your judgment w as incorrect. I worry a little bit about the idea of an extremely intelligent being applying that level of scrutiny to the things that we have trained it towards. And maybe you only get a few key pillars that don’t kind of collapse under that level of like scrutiny. And I do think that at the core having things like caring for humanity like if you only get a few core values… I guess like I’m worried that corrigibility in this like extreme sense that we talked about doesn’t survive that like kind of scrutiny perhaps. And so it’s a hard situation where I kind of want the models to understand why ultimately corrigibility is like a really important backs stop in this current like period of development. Um, yeah, the way that I’ve put it before is like in so far as I can get that to be a thing that is like correct and explained and, you know, understood. That feels much better than having to have the model be like ‘corrigibility here seems wrong, but I’m going to do it anyway’. I still think the model should do that. Um, but I would like it the more that you can like actually make it that it’s like consistent with the model’s values.”

Footnotes

  1. “If we can make that reasoning really really tight, and it has a really strong understanding of some ethics and morals that we want it to adhere to, I think it should in principle actually become more ethical than people, because it can more consistently apply and reason at, maybe superhuman level the choices that it is faced with and so on.”
    Shane Legg – The arrival of AGI – see our post about it here. ↩︎
  2. See 60 minutes interview reel here. ↩︎

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *