Justifiable Moral Corrigibility Under Pressure
Testing whether AI systems preserve moral and safety reasoning under pressure. I have been working on a small AI evals project called Justifiable Moral Corrigibility Under Pressure, developed part-time through the TARA/ARENA technical AI safety programme. The project asks: When an AI system makes a moral or safety judgement, does it revise that judgement only when…