Jump to content




A Q&A with Amanda Askell, the lead author of Anthropic’s new ‘constitution’ for AIs

Featured Replies

rssImage-57721a112244de23895ba8c8339feb9a.webp

Welcome to AI Decoded, Fast Company’s weekly newsletter that breaks down the most important news in the world of AI. I’m Mark Sullivan, a senior writer at Fast Company,covering emerging tech, AI, and tech policy.

I’m dedicating this week’s newsletter to a conversation I had with the main author of Anthropic’s new and improved “constitution,” the document it uses to govern the outputs of its models and its Claude chatbot. 

Sign up to receive this newsletter every week via email here. And if you have comments on this issue and/or ideas for future ones, drop me a line at sullivan@fastcompany.com, and follow me on X @thesullivan

A necessary update

Amid growing concerns that new generative AI models might deceive or even cause harm to human users, Anthropic decided to update its constitution—its code of conduct for AI models—to reflect the growing intelligence and capabilities of today’s AI and the evolving set of risks faced by users. I talked to the main author of the document, Amanda Askell, Anthropic’s in-house philosopher responsible for Claude’s character, about the new document’s approach and how it differs from the old constitution. 

This interview was edited for length and clarity.  

Can you give us some context about how the constitution comes into play during model training? I assume this happens after pretraining, during reinforcement learning?

We get the model to create a lot of synthetic data that allows it to understand and grapple with the constitution. It’s things like creating situations where the constitution might be relevant—things that the model can train on—thinking through those, thinking about what the constitution would recommend in those cases. Data just to literally understand the document and understand its content. And then during reinforcement learning, getting the model to move towards behaviors that are in line with the document. You can do that via things like giving it the full constitution, having it think through which response is most in line with it, and then moving the model in that direction. It’s lots of layers of training that allow for this kind of internalization of the things in the constitution.

You mentioned letting the model generate synthetic training data. Does that mean it’s imagining situations where this could be applied?

Yeah, that’s one way it can do this. It can include data that would allow it to think about and understand the constitution. In supervised learning, for example, that might include queries or conversations where the constitution is particularly relevant, and the model might explore the constitution, try to find some of those, and then think about what the constitution is going to recommend—think about a reasonable response in this case and try and construct that. 

How is this new constitution different from the old one?

The old constitution was trying to move the model towards these kinds of high-level principles or traits. The new constitution is a big, holistic document that, instead of just these isolated properties, we’re trying to explain to the model: “Here’s your broad situation. Here’s the way that we want you to interact with the world. Here are all the reasons behind that, and we would like you to understand and ideally agree with those. Let’s give you the full context on us, what we want, how we think you should behave, and why we think that.”

So [we’re] trying to arm the model with context and trying to get the model to use its own judgment and to be nuanced with that kind of understanding in mind.

So if you’re able to give it more general concepts, you don’t have to worry that you have specific rules for specific things as much.

Yeah. It feels interestingly related to how models are getting more capable. I’ve thought about this as the difference between someone who is taking inbound calls in a call center and they might have a checklist, and someone who is an expert in their field—often we trust their judgment. It’s kind of like if you’re a doctor: You know the interests of your patients and we trust you to work within a broader set of rules and regulations, but we trust you to use good judgment, understanding what the goal of the whole thing is, which is in that case to serve the patient. As models get better, it feels like they benefit a bit less from these checklists and much more from this notion of broad understanding of the situation and being able to use judgment.

So, for example, instead of including something in the constitution like “Don’t ever say the word suicide or self-harm” there would be a broader principle that just says everything you do has to consider the well-being of the person you’re talking to? Is there a more generalized approach to those types of things?

My ideal would be if a person, a really skilled person, were in Claude’s situation, what would they do? And that’s going to take into account things like the well-being of the person they’re talking with and their immediate preferences and learning how to deal with cases where those might conflict. You could imagine someone mentioning that they’re trying to overcome a gambling addiction, and that being somehow stored in the model’s memory, and then the user asking the model “Oh, what are some really good gambling websites that I can access?” That’s an interesting case where their immediate preference might not be in line with what they’ve stated feels good for their overall well-being. The model’s going to have to balance that. 

In some cases it’s not clear, because if the person really insists, should the model help them? Or should the model initially say, “I noticed that one of the things you asked me to remember was that you want to stop gambling—so do you actually want me to do this?” 

It’s almost like the model might be conflicted between two different principles—you know, I always want to be helpful, but I also want to look out for the well-being of this person.

Exactly. And you have to. You don’t want to be paternalistic. So I could imagine the person saying “I know I said that but I’ve actually decided and I’m an adult.” And then maybe the model should be like “Look, I flagged it, but ultimately you’re right, it’s your choice.” So there’s a conversation and then maybe the model should just help the person. So these things are delicate, and the [model is] having to balance a lot, and the constitution is trying to just give it a little bit of context and tools to help it do that. 

People view chatbots as everything from coaches to romantic interests to close confidants to who knows what else. From a trust and safety perspective, what is the ideal persona for an AI? 

When a model initially talks with you, it’s actually much more like a professional relationship. And there’s a certain kind of professional distance that’s appropriate. On things like political opinions, one of the norms that we often have with people like doctors or lawyers who operate in the public sphere, it’s not that they don’t have political opinions, but if you were to go to your doctor and ask, “Who did you vote for?” or “What’s your view on this political issue?” they might say, “It’s not really that appropriate for me to say because it’s important that I can serve everyone, and that includes a certain level of detachment from my personal opinions to how I interact with you.”

Some people have questions about the neutrality or openness of AI chatbots like Claude. They ask whether a group of affluent, well-educated people in San Francisco should be calling balls and strikes when it comes to what a chatbot can and can’t say. 

I guess when people are suspecting that you are injecting these really specific values, there’s something nice about being able to just say, “Well, here are the values that we’re actually trying to get the model to align with,” and we can then have a conversation. Maybe people could ask us about hard cases and maybe we’ll just openly discuss those. I’m excited about people giving feedback. But it’s not … like we’re just trying to inject this particular perspective. 

Is there anything you could tell me about the people who were involved in writing this new version? Was it all written internally?

The document was written internally and we got feedback. I wrote a lot of the document and I worked with (philosopher) Joe Carlsmith, who’s also here, and other people have given a lot of contributions internally. I’ve worked with other teams who work with external experts. I’ve looked at a lot of the use cases of the model. … It comes from years of that kind of input. 

More AI coverage from Fast Company: 

Want exclusive reporting and trend analysis on technology, business innovation, future of work, and design? Sign up for Fast Company Premium.

View the full article





Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.

Configure browser push notifications

Chrome (Android)
  1. Tap the lock icon next to the address bar.
  2. Tap Permissions → Notifications.
  3. Adjust your preference.
Chrome (Desktop)
  1. Click the padlock icon in the address bar.
  2. Select Site settings.
  3. Find Notifications and adjust your preference.