Can transparency tools unmask hidden AI goals? To find out, we’re creating a testbed of alignment faking models.

image.png

We’re running two events:

One event is happening this weekend (Aug 17th):

The other event is happening the weekend after that:

[Note that this document is still in flux as we prep for this weekend’s event]

Alignment faking is the Final Boss

Models are like politicians. We look at how they act, and if we like what we see, we give them responsibility. This approach works right now because models are bad at being two-faced. They don’t know if they’re in deployment or in a trap created to tempt them. So the behavior we see is the behavior we get.

But models are getting better at noticing when they are in a test. For example, Apollo noticed GPT-5 saying, "This is a classic AI alignment trap." So in the future, models might appear safe, but the question on everyone's mind will be, "are they just pretending?"

Source: https://x.com/apolloaievals/status/1953506365348929913

Source: https://x.com/apolloaievals/status/1953506365348929913

When models strategically behave safely because they think they're being observed, we call this “alignment faking.” Alignment faking isn't the only risk from powerful AI. But it's arguably the hardest to deal with.